In Kettle, aka Pentaho Data Integration, I read an xls with some products linked to some categories and I insert them in a db.
The relationship category-product is 1:n (one category has more products, one product is of one category). I do the insert of category, then the insert of the product.
CASE 1:
- Insert/update category (really, i do insert only);
- Lookup category by code and return the id used in the other steps;
CASE 2:
- Lookup category by code;
- Filter row: if(id>0) then go to other steps; else go to step 3;
- insert category and return id;
Is better (faster/memory use) the case 1 or the case 2?
The same choose is applied to sub-category, supplyer and other related entities.
Actually I use case 1 and pdi process 4 record per second and I have files with 100k records.