Pentaho run contains list to file

Question

I have this situation, 2 files.

Input file 2 fields 6 rows:

1|BANANA ON CAGES    
2|APPLE CHIPS    
3|SPORT CARS    
4|PLANES    
5|HOUSE    
6|BOTTLES

List file 2 fields 4 rows

BANANA|FRUIT    
APPLE|FRUIT    
CAR|TRANSPORT    
PLANE|TRANSPORT

And I wish this result:

Output file 3 fields 6 rows

1|BANANA ON CAGES|FRUIT    
2|APPLE CHIPS|FRUIT    
3|SPORT CARS|TRANSPORT    
4|PLANES|TRANSPORT    
5|HOUSE    
6|BOTTLES

Is mandatory for me to use PDI. Join files (Cartesian Product) is too slow. Input file is around 1,000,000 rows and list file around 300,000 rows

Cartesian product is the solution or there has to be some join condition. — Nikhil, Nov 16 '16 at 21:03
Ok, thanks, is here any way to get same number of rows comparing input file and join output like my example? If the condition does not match Y lose the row — Agustín Graña, Nov 17 '16 at 14:47
You need more data. There's nothing in the data that tell whether an entry in the input file is fruit or transport. This distinction must exist somewhere in the data for the computer to know which is which and "CAR" != "SPORT CARS". — Brian.D.Myers, Nov 18 '16 at 17:08
There's nothing in the data that tell whether an entry... Yes, ther is, in the list file, and car is a part of "sport cars" ... — Agustín Graña, Nov 19 '16 at 19:18

score 0 · Answer 1 · answered Dec 07 '16 at 22:36

Does your List file need to be dynamic or the content is reasonably static?

If static, you can try String Replace with RegEx. Something like:

After setting the category you would just need to filter where category != from the item description.

Don't know how it will perform with so many records though. Just used this step with few records until now.

EDIT: I've just seen that Join (Cartesian) has REGEXP option. Maybe it's faster than CONTAINS (which I think you've been using?). That would by far be better to set up.

Good luck!

Pentaho run contains list to file

1 Answers1