Best way to cache data to use among different transformations

Question

I have a transformation that reads millions of ids from a mega data store.

I would like to somehow store those ids in a list or hashmap.

I have about a dozen other transformations. Each of those transformations gets input data (ids) from other distinct child data stores.

What I would like to do is, in a UDJC, as I get the ids from the child data stores, somehow check if each id is already in the mega id list.

Due to performance reasons, I can't call the mega store for every transformation.

How can I create/incorporate a list of mega ids that I can use in my subsequent UDJCs?

Thanks

Are you sure you are not duplicating the `Unique rows (HasSet)` functionality ? — AlainD, Jul 13 '17 at 08:30

score 0 · Answer 1 · answered Jul 17 '17 at 18:41

0

I ended up serializing the ids to a file in one transformation and de-serializing the file in subsequent transformations.

answered Jul 17 '17 at 18:41

eych

On the risk of being repetitive, are you sure you do not duplicate the functionality of the `UniqueRow? step ? If the number of ids is really large, use precede the `Unique Row` with a `sort` with values stored on tmp files (it is the same strategy as yours, but it is hard to be better than Kettle on that). Also, if you need to make one id by concatenating multiple key, use the `Combination lookup/update` step. – AlainD Jul 24 '17 at 16:09

1 Answers1