3

I have an employee table with an employee_id, name, and working_division, where the employee_id is the primary key. I have an Excel source with these columns and more where an employee has entered their hours, and what type of work they have done, what division of the company it was for and so forth..

So for any given day an employee I could have multiple rows showing their type of work, what division they worked for, and their charged hours to that division.

How do I get this into the OLE DB in which the employee_id is the primary key?

I am trying to use the aggregate transform to group by the employee_id, however the employee_id and working_divisions are not one-to-one. Thus, the group by operation on both of those columns will try to insert the same employee_id into the employee table (the employee_id is the primary key!) If I do not include the working_division for the aggregate transform, then I lose the data.

How can I group my data by the employee_id, and still keep all the other columns with that row?

Thanks for all the help!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
CodeKingPlusPlus
  • 15,383
  • 51
  • 135
  • 216
  • First thing that goes through my head is: why don't you modify your PK on the destination table? If employee_id to working_division is not one to one, then you can't get all your data either way you group. Maybe I'm missing something. – Marcel N. Jun 18 '12 at 17:43
  • @marceln I need the employee_id to be the PK. Basically I have a very large unorganized data source, and I am breaking it apart into 4 to 5 separate tables to fit my model so I can make sense of the data with some data mining algorithms. – CodeKingPlusPlus Jun 18 '12 at 18:59
  • 2
    Your source data could look like `10, Bob, Div1` and `10, Bob, Div2` and your desire is to roll that data up to be `10, Bob, ?` in the table? Stated another way, how should the data be aggregated to satisfy the design of the Employee table? – billinkc Jun 20 '12 at 19:29
  • @thecoon Any feedback on my question/comment as you have the bounty? – billinkc Jun 22 '12 at 15:05
  • The short answer is that it can't be aggregated without modifications in the target table(s).Other possible answers would be:surrogate PK in the employee table (which would also eliminate the need for a `Multicast`) or a compound PK on (`employee_id`, `working_division`). I prefer the first one, as compound PKs can have a negative impact on insert performance in large tables with random values for the PK entries. So, I started the bounty hoping to see which of these options (or others that I maybe can't see) is the most appropriate in such a situation. – Marcel N. Jun 22 '12 at 18:22
  • @CodeKingPlusPlus: Anything to say to this? :) – Marcel N. Jun 23 '12 at 15:13

1 Answers1

5

I need the employee_id to be the PK. Basically I have a very large unorganized data source, and I am breaking it apart into 4 to 5 separate tables to fit my model so I can make sense of the data with some data mining algorithms

OK, then why don't you split employee_id and working_division in two separate tables? The second table should keep a FK to the employee table (so one to many).

In the SSIS package you can then add a Multicast component, right after the Aggregate on employee_id, in order to split your data source in the 2 target tables.

I think that without a modification in your target model you won't be able to achieve what you want. It basically violates the rules of RDBMS. That grouping you are talking about couldn't be done even in plain SQL and yield correct results.

Note: If you're worried about modifying your target data model, then perhaps you can normalize it as I mentioned before and then denormalize it back through a view. You maybe can even create an indexed view in order to speed things up at read time (as far as I can see an indexed view should be possible, since all you have is an inner join between the two tables).

Marcel N.
  • 13,726
  • 5
  • 47
  • 72