0

I've got two data sets (csv files) which I bring into a Pentaho DI transformation using a "CSV file input" step. File A has a column joincol whose values appear in file B in column joincol. I want to left join the data in file A with the data in file B on the condition A.joincol = B.joincol to get the corresponding data from column wantcol from file B.

Pretty simple stuff.

As a (necessary) preliminary step I sort both datasets on the columns to be joined on (namely A.joincol and B.joincol) ascending.

I then use a "Merge Join" step with the sorted data from file A as the "First Step" and the sorted data from file B as the "Second Step", and select the "Join Type" to be LEFT OUTER. I select "Get key fields" under the "Keys for 1st step" section and specify that joincol is the only key field to join on, then do the same for "Keys for 2nd step".

So, the join partially works except:

  • rows are dropped, which should be absolutely impossible with a left outer join
  • the rows which are dropped are precisely those in the group of rows corresponding to the last value of joincol when it is sorted ascending.

I've confirmed that the values in file A's joincol all appear in file B's joincol by doing a vlookup in Excel. In any case, even if none of the values appeared in file B a left outer join should still leave me with at least all the rows I had going in.

I am new to Pentaho so I find it hard to believe that I stumbled upon such a crazy bug so early on in my exploration, which means I must be missing a step. However, I read the documentation for the merge join and believe I have followed all necessary steps (the ones I've laid out here). Can anyone let me know if I've missed something? Perhaps this is the result of a common error?

Thanks very much!

Rookatu
  • 1,487
  • 3
  • 21
  • 50
  • Have you confirmed that those last rows are present and passed by the CSV file input and Sort Rows steps? You can make the sort descending and then preview the step, your rows should be first. – Cyrus Jun 07 '17 at 09:50
  • Yeah, I'm not dealing with a lot of data, I manually inspected every row of the output of each step. They were there in both inputs to the join. – Rookatu Jun 08 '17 at 00:28
  • Please read & act on [mcve]. Your question is clear but more details are needed. Also, please explain where you are getting the "steps" from. (http://www.bidimensions.com/category/tutorial/?) PS If all the values of the left appear in the right, then a left join result is the same as an inner join result. So why are you left joining? – philipxy Jun 08 '17 at 05:17

1 Answers1

1

Joins is typical of SQL not of PDI.

With PDI, there is a main stream of data, and for each row you look for a correspondence in the other stream and return the required columns.

If you look for one value, for example the Customer of an Order, use the Lookup/Lookup Stream step. When you expect some values to be missing (LEFT JOIN), you can use a default. When you want only row with a correspondence (INNER JOIN), you can filter out the default you just put.

If you need to return many rows, for example all the Orders of a Customer, you may the Joins/Merge Join. From experience, it is usually easier to reverse the question.

In case you cannot, you are in the case of a FULL OUTER JOIN, you have to use Joins/Join Rows (Cartesian Product), in which you can add one or many conditions.

AlainD
  • 6,187
  • 3
  • 17
  • 31
  • Hi @AlainD, I appreciate the answer. My process is to usually use left joins so that, in case the lookup value is not present, I do not lose data. For rows where no lookup is found I will coalesce the lookup column value to some default and later inspect such rows as "kickouts". I believe this is standard practice in ETL development. I'm just not clear on WHY the left join doesn't work. Do you have any insights there? Thanks! – Rookatu Jul 02 '17 at 21:17
  • Hi @Rookatu, it is difficult to say without an example. My opinion is that the issue is in the data in the incoming flow, not the PDI step. And more probably in the sort order, for which you have lots of options: Case sensitive, Locale, Collator, and overall type. So check the CSV input is not in Lazy conversion mode, so that the data is converted from internal binary bytes into proper String, Numbers, Dates, etc, BEFORE they enter the sort step. – AlainD Jul 04 '17 at 07:24
  • Its the issue with the input data or the file which you considered on left side has less rows then the rows on the right side file. – karan arora Oct 16 '17 at 12:52