0

I have a huge table in a data warehouse (Vertica). I am accessing this table in chunks for optimization purposes. The way I am deciding my chunks is pretty straightforward. I have a primary key column say A and I take a MAX(A). I have a chunk size of 20000 and I have now created (A/20000)+1 chunks. I frame query for each chunk and retrieve the data .

There problem with this approach is as follows:

  • My number of chunks is dependent on MAX(A) and MAX(A) is growing very fast and thereby my number of chunks increases with it as well.

I have decided on number 20000 because that is what gives me optimal performance. But distribution of primary key within the chunks of 20000 is so scattered. For example the 0-20000 might contain only 3 elements and range 20000-40000 might contain 500 elements and no ranges come close to 20000.

I am trying to figure whether there are any good approximation algorithm for this problem which minimizes the number of chunks and fill in close to 20000 primary keys in one chunk.

Any pointers towards the solution is appreciated.

Kermit
  • 33,827
  • 13
  • 85
  • 121
Senthil
  • 93
  • 10

1 Answers1

0

I'm not sure what optimization purposes means, but I think the best approach would be to create a timestamp column, or use an eligible timestamp column to partition on. You could then partition on a larger frame of reference so there isn't a wide range between the partitions.

If the table is partitioned, it will be able to benefit from partition pruning. This means that Vertica can eliminate the storage containers during query execution which do not match on the timestamp predicate.

Otherwise, you can look at the segmentation clause and use the max/min from the storage containers. This could be slightly more complicated.

Kermit
  • 33,827
  • 13
  • 85
  • 121