Hey John,
short on time, so just a brief response to "what does optimize compression actually do?".
(I believe I've covered that stuff in more detail in the book... ).
Optimize compression tries to figure out
a) after what column a table should be sorted and
b) the compression algorithm for the value vector for each column.
Obviously the overall compression efficiency depends on the type of data and the distribution of this data per column.
Since we reconstruct tuples based on the relative position (offset position) of the entries in the respective value vectors, the sort order for tuples has to be the same in all columns.
So, the tuple that is on position 42 in the first column has to be on position 42 in all columns.
Now, the goal is clear: find the sort order and the compression algorithm (you know... DEFAULT, INDIRECT, CLUSTER, RLE..) that allows best overall compression.
Still with me? good.
The single reason that makes one compression algorithm better for a specific column than the other is the data distribution. Depending on things like having one absolutely most common value or a recurring pattern of values and so on, different algorithms can be used.
Stupid example: color indicating column for cars (yeah, the car analogy again... whatever ). Let's say it has ten colors in total but red really stands out with over one third of all cars.
It would probably make sense to sort the tuples by color then and apply RLE to the color red entries.
Alright... as anyone with some exposure to optimization problems will guess finding the "best" combination might take some time. Also, with our usual database situations the actual kind of data and its distribution within the columns doesn't change that often.
Once the initial loading phase is passed, the data distribution is actually quite stable for a while.
Running the optimization every time with a delta merge would be pointless.
So, the compression optimization only is performed when a lot of data has changed - or when it's asked for manually.
Bottom line: the compression optimization run tries to find the optimal compression types for all columns of a table. The actual compression is performed with every delta merge.
Ok, here you go, now you know -
A great weekend to everybody.
- Lars