0

I have a sources table that has ID and Source(varchar)

1 Facebook
2 Twitter
3 Google

I have incoming data that has Source(varchar) and Views(Int)

Facebook 10
Twitter 12
Reddit 14

I want the kettle job to do this:

  1. Check if the source exists in the source table, and if so replace add a field of type INT called sourceID with the respective ID from source
  2. If it doesn't exist add it to the source table.

E.g. from the above data the result should be this

sourceID,Views
1,10
2,12
4,14 (Reddit wasn't in the table so it created it and the autoincrement gives it ID 4).

I am having trouble finding the right steps to achieve this

Killerpixler
  • 4,200
  • 11
  • 42
  • 82

2 Answers2

2

My answer uses two transformations: in the first steps an intermediate result is created by joining the sources and the keys using the match key source. Also the maximum source id is determined and stored in a variable. Note the two sort elements and the merge element can be replaced by a single "merge in memory" element.

First transformation: merge sources and views

In the second transformation the intermediate result is read and depending on whether there is a existing source either an update is performed on existing data or a new entry is inserted in to the sources table. The source id of the new entry is computed as a sum of the maximum id retrieved from the variable and a sequence starting in one.

Second transformation: insert and update

Marcus Rickert
  • 4,138
  • 3
  • 24
  • 29
0

I recommend using the Combination Lookup step.

You specify that source_name is the key to match, source_id is the key to retrieve. If the source exists in the table the key is returned, if not, it's inserted in the table and the new key is returned.

You can specify whether you want to use a db counter, an auto_increment field or if you want to use tableMax+1 as your new key.

It allows you to cache the whole table, which can improve performance tremendously.

nsousa
  • 4,448
  • 1
  • 10
  • 15