How to tile data evenly distributed across boundaries

Question

I have a time-based set of records that I need to load into a partitioned table which is used for staging data. The stage table is partitioned by day. I have distributed the data load into the stage table across multiple "processors" (streams in SSIS) for efficiency. Once the data is staged, I then perform a series of de-duplication operations prior to loading into the data mart. However, my challenge is that the staged data is not evenly split across the processors, as I am using the NTILE function across the set partitioned by the date.

So I might see the following distribution for 5 processors ...

Processor 1, >= 2011-01-01 and < 2011-05-01, Rows = 200,000
Processor 2, >= 2011-05-01 and < 2011-09-01, Rows = 3,000,000
Processor 3, >= 2011-09-01 and < 2012-01-01, Rows = 6,000,000
Processor 4, >= 2012-01-01 and < 2012-05-01, Rows = 6,000,000
Processor 5, >= 2012-05-01 and < 2012-09-01, Rows = 0

The amount of data is increasing exponentially, so although Processor 4 only has 6,000,000 at the time of the load today, once the full range is populated, Processor 4 may be working on 8,000,000+ rows (records) in total.

My goal is to distribute the work evenly across the processors, based upon the number of rows, while ensuring that any two processors are not contending for the same partition (day).

So, as a visual the distribution would need to look something like this ...

Processor 1, >= 2011-01-01 and < 2011-09-01, Rows (3,200,000)
Processor 2, >= 2011-09-01 and < 2011-11-01, Rows (3,000,000)
Processor 3, >= 2011-11-01 and < 2012-01-01, Rows (3,000,000)
Processor 4, >= 2012-01-01 and < 2012-01-03, Rows (3,000,000)
Processor 5, >= 2012-01-03 and < 2012-03-18, Rows (3,000,000; 2012-03-18 contains most current data)

Any feedback would be greatly appreciated.

score 0 · Answer 1 · answered Mar 27 '12 at 13:46

Here is an idea, but it depends on your de-duping requirements. In your current partitioning scheme, you are keeping months of records together. Would the de-duping work if all records for one day were in the same partiiton?

If so, then you can do a "round robin" partitioning by day. Instead of using ranges, convert the date to a number and assign the partition based on mod 5. Here is an example:

select CAST(<thedate> as int)%5

This should work much better for load balancing.

How to tile data evenly distributed across boundaries

1 Answers1