Two options on I the scenarios where you have inserts that should have been updates.
Duplicate source values
The first is that you have duplicate keys in your source data and nothing in the target table.
Source data
Key|Value
A |abc
B |bcd
A |cde
Destination data
Key|Value
C |yza
B |zab
In this situation, assuming the default behaviour of the Lookup Component, full cache, before the package begins execution, SSIS will run the source query for the Lookup reference table. Only once all the lookup table data has been cached will the data flow being flowing data.
- The first row, A:abc hits the lookup. Nope, no data and off to the Insert path.
- The second row B:bcd hits the lookup. Nope, no data and off to the Insert path.
- The third row A:cde hits the lookup. Nope, no data and off to the Insert path (and hopefully a primary/unique key violation)
When the package started, it only knew about data in the destination table. During the run you added the same key value to the table but never asked the lookup component to check for updates.
In this situation, there are two resolutions: The first is to change the cache mode from Full to None (or Partial). This will have the lookup component issue a query against the target table for every row that flows through the data flow. That can get expensive for large rows. It also won't be fool proof because the data flow has the concept of buffers and in a situation like our sample 3 row load, that would all fit in one buffer. All the rows in the buffer would hit the Lookup at approximately the same time and thus the target table will still not contain an A value when the third row flows through the component. You can put the brakes on the data flow and force it to process one row at a time by adjusting the buffer size to 1 but that's generally not going to be a good solution.
The other resolution would be dedupe/handle survivorship. Which A row should make it to the database in the event our source has different values for the same business key? First, last, pick one? If you can't eliminate the data before it hits the Data Flow, then you'll need to deduplicate the data using an Aggregate component to rollup your data best you can.
Case sensitive lookups
Source data
Key|Value
A |abc
B |bcd
a |cde
Destination data
Key|Value
C |yza
B |zab
The other scenario where the Lookup component bites you is that the default, Full Cache, matching is based on .NET matching rules for strings. Thus AAA is not equal to AaA. If your lookup is doing string matching, even if your database is case insensitive, the SSIS lookup will not be insensitive.
In situations where I need to match alpha data, I usually make an extra/duplicate column in my source data which is the key data all in upper or lower case. If I am querying the data, I add it to my query. If I am loading from a flat file, then I use a Derived Column Component to add my column to the data flow.
I then ensure the data in my reference table is equally cased when I use the Lookup component.
Lookup Component caveats
Full Cache mode:
- insensitive to changes to the reference data
- Case sensitive matches
- generally faster overall
- delays data flow until the lookup data has been cached
- NULL matches empty string
- Cached data requires RAM
No Cache mode:
- sensitive to changes in the reference
- Case sensitivity matching is based on the rules of the lookup system (DB is case sensitive, you're getting a sensitive match)
- It depends (100 rows of source data, 1000 rows of reference data - no one will notice.
1B rows of source data and 10B rows of reference data - someone will notice. Are there indexes to support the lookups, etc)
- NULL matches nothing
- No appreciable SSIS memory overhead
Partial Cache:
The partial cache is mostly like the No cache option except that once it gets a match against the reference table, it will cache that value until execution is over or until it gets pushed out due to memory pressure
Lookup Cache NULL answer