Best place to check headers for a CSV file in with kiba ETL

Question

I need to check that :

header line is present
header contain a specfic set of headers

What the best place to do that. I have some possible solution but don't know the more idiomatic one

Check before running the full ETL for exemple before the Kiba.parse block
Check in a pre_process block inside the ETL
check in the ETL source. I tend to prefer this one since it will be more reusable (need to pass the mandatory field as params)

Note that even if I can check in a transform block what field are available on the row, this solution seem not very efficient since it will run for each line.

Any hints appreciated

score 1 · Accepted Answer · answered Feb 17 '19 at 17:50

There are various & all quite idiomatic ways to achieve this:

At the source level (passing an array of headers)

You can use CSV without headers: true, which offers the opportunity to finely check the headers:

class CSVSource
  def initialize(filename:, csv_options:, expected_headers:)
  # SNIP

  def each
    CSV.foreach(filename, csv_options).with_index do |row, file_row_index|
      if file_row_index == 0
        check_headers!(actual: row.to_a, expected: expected_headers)
        next # do not propagate the headers row
      else
        yield(Hash[expected_headers.zip(row.to_a)])
      end
    end
  end

  def check_headers!(actual:, expected:)
  # SNIP - verify uniqueness, presence, raise a clear message if needed
end

At the source level (letting the caller define the behaviour using a lambda)

class CSVSource
  def initialize(after_headers_read_callback:, ...)
    @after_headers_read_callback = ...

  def each
    CSV.foreach(filename, csv_options).with_index do |row, file_row_index|
      if file_row_index == 0
        @after_headers_read_callback.call(row.to_a)
        next
      end
      # ...
    end
  end

The lambda will let the caller define their own checks, raise if needed etc, which is better for reuse.

At the transform level

If you want to further decouple the components (e.g. separate the headers handling from the fact that rows come from a CSV source), you can use a transform.

I commonly use this design, which allows for better reuse (here with a CSV source which will yield a bit of meta-data):

def transform_array_rows_to_hash_rows(after_headers_read_callback:)
  transform do |row|
    if row.fetch(:file_row_index) == 0
      @headers = row.fetch(:row)
      after_headers_read_callback.call(@headers)
      nil
    else
      Hash[@headers.zip(row.fetch(:row))].merge(
        filename: row.fetch(:filename),
        file_row_index: row.fetch(:file_row_index)
      )
    end
  end
end

What's not recommended

In all cases, avoid doing any processing in Kiba.parse itself. It's a better design to ensure IO will only occur when you are calling Kiba.run (since it will be more future-proof and will support introspection features in later versions of Kiba).

Also, using pre_process isn't recommended (while it will work), because it will lead to a bit of duplication etc.

Hope this helps, and let me know if this isn't clear!

Thanks a lot for the response. Will go down the lambda road on this one. Little bit off but cool idea to have a method that add the `transform` blok like in your `transform_array_rows_to_hash_rows` exemple — djtal64, Feb 18 '19 at 09:02
Thanks! Yes, this way to do things works well for "meta transforms" of sorts. I will share more example of use in the future, because it's quite powerful! — Thibaut Barrère, Feb 19 '19 at 10:28

Best place to check headers for a CSV file in with kiba ETL

1 Answers1

At the source level (passing an array of headers)

At the source level (letting the caller define the behaviour using a lambda)

At the transform level

What's not recommended