It is indeed best to use a single source
, that you will use to fetch the main stream of the data.
General advice: try to work in batches as much as you can (e.g. pagination in the source, but also bulk HTTP lookup if the API supports it in step 2).
Source section
The source in your case could be a paginating HTTP resource, for instance.
A first option to implement it would be to write write a dedicated class like explained in the documentation.
A second option is to use Kiba::Common::Sources::Enumerable
(https://github.com/thbar/kiba-common#kibacommonsourcesenumerable) like this:
source Kiba::Common::Sources::Enumerable, -> {
Enumerator.new do |y|
# do your pagination & splitting here
y << your_item
end
}
# then
transform Kiba::Common::Transforms::EnumerableExploder
Join with secondary HTTP source
It can be done this way:
transform do |r|
# here make secondary HTTP query
result = my_query(...)
# then merge the result
r.merge(secondary_data: ...)
end
There is support for parallelisation of the queries in that step via Kiba Pro's ParallelTransform
(https://github.com/thbar/kiba/wiki/Parallel-Transform):
parallel_transform(max_threads: 10) do |r|
# this code will run in its own thread
extra_data = get_extra_json_hash_from_http!(r.fetch(:extra_data_url))
r.merge(extra_data: extra_data)
end
It must be noted too that if you can structure your HTTP calls to handle N rows at one time (if the HTTP backend is flexible enough) things will be even faster.
Step 3 does not need specific advice.
Send each item somewhere else
I would most likely implement a destination for that (but it could also be implemented as a transform actually, and parallelized still with parallel_transform
if needed).