How to filter data in extractor?

Question

I've got a long-running pipeline that has some failing items (items that at the end of the process are not loaded because they fail database validation or something similar).

I want to rerun the pipeline, but only process the items that failed the import on the last run.

I have the system in place where I check each item ID (that I received from external source). I do this check in my loader. If I already have that item ID in the database, I skip loading/inserting that item in the database.

This works great. However, it's slow, since I do extract-transform-load for each of these items, and only then, on load, I query the database (one query per item) and compare item IDs.

I'd like to filter-out these records sooner. If I do it in transformer, I can only do it per item again. It looks like extractor could be the place, or I could pass records to transformer in batches and then filter+explode the items in (first) transformer.

What would be better approach here?

I'm also thinking about reusability of my extractor, but I guess I could live with the fact that one extractor does both extract and filter. I think the best solution would be to be able to chain multiple extractors. Then I'd have one that extracts the data and another one that filters the data.

EDIT: Maybe I could do something like this:

already_imported_item_ids = Items.pluck(:item_id)

Kiba.run(
  Kiba.parse do
    source(...)

    transform do |item|
      next if already_imported_item_ids.include?(item)

      item
    end

    transform(...)
    destination(...)
  end
)

I guess that could work?

score 1 · Accepted Answer · edited Sep 29 '21 at 21:13

A few hints:

The higher (sooner) in the pipeline, the better. If you can find a way to filter out right from the source, the cost will be lower, because you do not have to manipulate the data at all.
If you have a scale small enough, you could load only the full list of ids at the start in a pre_process block (mostly what you have in mind in your code sample), then compare right after the source. Obviously it doesn't scale infinitely, but it can work a long time depending on your dataset size.
If you need to have a higher scale, I would advise to either work with a buffering transform (grouping N rows) that would achieve a single SQL query to verify the existence of all the N rows ids in the target database, or work with groups of rows then explode indeed.

We ended up by "inlining" the filtering within the Kiba parse block: ```rb already_imported_item_ids = Document.pluck(:external_item_id) transform do |item| next if already_imported_item_ids.include?(item['ID']) item end ``` — Viktor, Jul 04 '20 at 11:35

How to filter data in extractor?

1 Answers1