Merging rows based on timeframe to reduce data redundancy in SQL Server 2012

Question

I have the following SQL problem, in which I am actually looking for the most efficient way to do this. Suppose I have the following table:

r|Company - StartPeriod - EndPeriod - NumberOfEmployees - NumberOfMachines ...
1|A - 2014-1-1 - 2014-2-5 - 30 - 1
2|A - 2014-2-6 - 2014-3-7 - 30 - 1
3|A - 2014-3-8 - 2014-4-9 - 31 - 1
4|A - 2014-4-9 - 2014-5-1 - 32 - 1
5|B - 2014-1-2 - 2014-2-1 - 20 - 2
6|B - 2014-2-2 - 2014-3-1 - 20 - 2
7|B - 2014-3-2 - 2014-4-1 - 20 - 2

Notice the the redundant amount of data? This is because the companies get sampled now and then and therefore the periods are actually kind of random in lengths.

Now the question is, how can merge the rows if and only if the variables NumberOfEmployees, NumberOfMachines etcetera are the same, but StartPeriod and Endperiod differ? Of course I would like to replace the start period with the earliest one and the endperiod with the latest one.

The r is just an indicator, not a variable in the database. In this case, 1 and 2 will be merged, but 3,4 will be kept, since it has different values (namely it has 31 employees instead of 30).

This is in my opinion such an hard problem, that I am not even sure whether SQL can do this.

The resulting output would be:

r|Company - StartPeriod - EndPeriod - NumberOfEmployees - NumberOfMachines ...
1|A - 2014-1-1 - 2014-3-7 - 30 - 1
2|A - 2014-3-8 - 2014-4-9 - 31 - 1
3|A - 2014-4-9 - 2014-5-1 - 32 - 1
4|B - 2014-1-2 - 2014-4-1 - 20 - 2

Thanks for everything!

score 1 · Answer 1 · answered Mar 12 '15 at 15:57

1

You may be able to just use group by:

select company, min(startperiod) as startperiod, max(endperiod) as endperiod,
       numberofemployees, numberofmachines . .
from table t
group by company, numberofemployees, numberofmachines;

If this doesn't solve the problem, please edit the question and provide more information on the additional logic needed for merging rows.

answered Mar 12 '15 at 15:57

Gordon Linoff

1,242,037
58
646
786

Hmm, I never looked at this way... I will try this! It looks promising. Thanks! It works! I think, wow it was pretty simple (I need six minutes before I can accept your answer, THANKS!) – Snowflake Mar 12 '15 at 16:00
1

Note that if the "same" row occurs later on that row will be merged in as well. – Lennart - Slava Ukraini Mar 12 '15 at 16:07

score 0 · Answer 2 · answered Mar 12 '15 at 16:03

You can use an "island and gap"" strategy:

select r, company, min(startperiod), max(EndPeriod)
     , NumberOfEmployees, NumberOfMachines, ..., grp
from (
    select r, company, startperiod, EndPeriod, ...
         , row_number() over (order by r)
         - row_number() over (partition by company, 
                                           NumberOfEmployees,
                                           NumberOfMachines, ...  
                              order by r) as grp
    from T
)
group by r, company, NumberOfEmployees, NumberOfMachines, ..., grp

Merging rows based on timeframe to reduce data redundancy in SQL Server 2012

2 Answers2