0

I'm looking at writing one of our ETL (or ETL like) processes in kiba and I wonder how to structure it. The main question I have is the overall architecture. The process works roughly like this:

  1. Fetch data from an HTTP endpoint.
  2. For each item returned from that API and make one more HTTP call
  3. Do some transformations for each of the items returned from step 2
  4. Send each item somewhere else

Now my question is: Is it OK if only step one is a source and anything until the end is a transform? Or would it be better to somehow have each HTTP call be a source and then combine these somehow, maybe using multiple jobs?

ujh
  • 4,023
  • 3
  • 27
  • 31

1 Answers1

1

It is indeed best to use a single source, that you will use to fetch the main stream of the data.

General advice: try to work in batches as much as you can (e.g. pagination in the source, but also bulk HTTP lookup if the API supports it in step 2).

Source section

The source in your case could be a paginating HTTP resource, for instance.

A first option to implement it would be to write write a dedicated class like explained in the documentation.

A second option is to use Kiba::Common::Sources::Enumerable (https://github.com/thbar/kiba-common#kibacommonsourcesenumerable) like this:

source Kiba::Common::Sources::Enumerable, -> {
  Enumerator.new do |y|
    # do your pagination & splitting here
    y << your_item
  end
}
# then
transform Kiba::Common::Transforms::EnumerableExploder

Join with secondary HTTP source

It can be done this way:

transform do |r|
  # here make secondary HTTP query
  result = my_query(...)
  # then merge the result
  r.merge(secondary_data: ...)
end

There is support for parallelisation of the queries in that step via Kiba Pro's ParallelTransform (https://github.com/thbar/kiba/wiki/Parallel-Transform):

parallel_transform(max_threads: 10) do |r|
  # this code will run in its own thread
  extra_data = get_extra_json_hash_from_http!(r.fetch(:extra_data_url))
  r.merge(extra_data: extra_data)
end

It must be noted too that if you can structure your HTTP calls to handle N rows at one time (if the HTTP backend is flexible enough) things will be even faster.

Step 3 does not need specific advice.

Send each item somewhere else

I would most likely implement a destination for that (but it could also be implemented as a transform actually, and parallelized still with parallel_transform if needed).

Thibaut Barrère
  • 8,845
  • 2
  • 22
  • 27
  • The issue I have is that the second HTTP request depends on data from the first one. I ended up putting the second one into a transform as it sort of transforms the data from the source. – ujh Mar 12 '21 at 15:08
  • The other thing I've been thinking of was using Sidekiq as a destination and just chaining various jobs like that, but that would be a bit too involved for a first try. – ujh Mar 12 '21 at 15:10
  • 1
    I do not recommend putting Sidekiq directly as a destination, unless you really have to, because it will make it harder for you to keep an eye on the outcome of the jobs. At least initially, I recommend working as a single job. – Thibaut Barrère Mar 14 '21 at 17:05