I've got two data sets (csv files) which I bring into a Pentaho DI transformation using a "CSV file input" step. File A has a column joincol
whose values appear in file B in column joincol
. I want to left join the data in file A with the data in file B on the condition A.joincol = B.joincol
to get the corresponding data from column wantcol
from file B.
Pretty simple stuff.
As a (necessary) preliminary step I sort both datasets on the columns to be joined on (namely A.joincol
and B.joincol
) ascending.
I then use a "Merge Join" step with the sorted data from file A as the "First Step" and the sorted data from file B as the "Second Step", and select the "Join Type" to be LEFT OUTER. I select "Get key fields" under the "Keys for 1st step" section and specify that joincol
is the only key field to join on, then do the same for "Keys for 2nd step".
So, the join partially works except:
- rows are dropped, which should be absolutely impossible with a left outer join
- the rows which are dropped are precisely those in the group of rows corresponding to the last value of
joincol
when it is sorted ascending.
I've confirmed that the values in file A's joincol
all appear in file B's joincol
by doing a vlookup in Excel. In any case, even if none of the values appeared in file B a left outer join should still leave me with at least all the rows I had going in.
I am new to Pentaho so I find it hard to believe that I stumbled upon such a crazy bug so early on in my exploration, which means I must be missing a step. However, I read the documentation for the merge join and believe I have followed all necessary steps (the ones I've laid out here). Can anyone let me know if I've missed something? Perhaps this is the result of a common error?
Thanks very much!