Using Kiba: Is it possible to define and run two pipelines in the same file? Using an intermediate destination & a second source

Question

My processing has a "condense" step before needing further processing:

Source: Raw event/analytics logs of various users.

Transform: Insert each row into a hash according to UserID.

Destination / Output: An in-memory hash like:

{ 
  "user1" => [event, event,...], 
  "user2" => [event, event,...] 
}

Now, I've got no need to store these user groups anywhere, I'd just like to carry on processing them. Is there a common pattern with Kiba for using an intermediate destination? E.g.

# First pass
source EventSource # 10,000 rows of single events
transform {|row| insert_into_user_hash(row)}
@users = Hash.new
destination UserDestination, users: @users

# Second pass
source UserSource, users: @users # 100 rows of grouped events, created in the previous step
transform {|row| analyse_user(row)}

I'm digging around the code and it appears that all transforms in a file are applied to the source, so I was wondering how other people have approached this, if at all. I could save to an intermediate store and run another ETL script, but was hoping for a cleaner way - we're planning lots of these "condense" steps.

score 0 · Answer 1 · answered Oct 10 '17 at 15:49

To directly answer your question: you cannot define 2 pipelines inside the same Kiba file. You can have multiple sources or destinations, but the rows will all go through each transform, and through each destination too.

That said you have quite a few options before resorting to splitting into 2 pipelines, depending on your specific use case.

I'm going to email you to ask a few more detailed questions in private, in order to properly reply here later.

Using Kiba: Is it possible to define and run two pipelines in the same file? Using an intermediate destination & a second source

1 Answers1