I have a time-based set of records that I need to load into a partitioned table which is used for staging data. The stage table is partitioned by day. I have distributed the data load into the stage table across multiple "processors" (streams in SSIS) for efficiency. Once the data is staged, I then perform a series of de-duplication operations prior to loading into the data mart. However, my challenge is that the staged data is not evenly split across the processors, as I am using the NTILE function across the set partitioned by the date.
So I might see the following distribution for 5 processors ...
Processor 1, >= 2011-01-01 and < 2011-05-01, Rows = 200,000
Processor 2, >= 2011-05-01 and < 2011-09-01, Rows = 3,000,000
Processor 3, >= 2011-09-01 and < 2012-01-01, Rows = 6,000,000
Processor 4, >= 2012-01-01 and < 2012-05-01, Rows = 6,000,000
Processor 5, >= 2012-05-01 and < 2012-09-01, Rows = 0
The amount of data is increasing exponentially, so although Processor 4 only has 6,000,000 at the time of the load today, once the full range is populated, Processor 4 may be working on 8,000,000+ rows (records) in total.
My goal is to distribute the work evenly across the processors, based upon the number of rows, while ensuring that any two processors are not contending for the same partition (day).
So, as a visual the distribution would need to look something like this ...
Processor 1, >= 2011-01-01 and < 2011-09-01, Rows (3,200,000)
Processor 2, >= 2011-09-01 and < 2011-11-01, Rows (3,000,000)
Processor 3, >= 2011-11-01 and < 2012-01-01, Rows (3,000,000)
Processor 4, >= 2012-01-01 and < 2012-01-03, Rows (3,000,000)
Processor 5, >= 2012-01-03 and < 2012-03-18, Rows (3,000,000; 2012-03-18 contains most current data)
Any feedback would be greatly appreciated.