0

I have a kiba job that takes a CSV file (with Kiba::Common::Sources::CSV), enrich its data, merge some rows (with the ChainableAggregateDestination destination described here) and saves it to another CSV file (with Kiba::Common::Destinations::CSV).

Now, I want to sort the rows differently (based on the first column) in my destination CSV. I can't find a way to write a transform that does this. I could use post_process to reopen the destination CSV, sort it and rewrite it but I guess there is a cleaner way...

Can someone point me in the right direction?

Spone
  • 1,324
  • 1
  • 9
  • 20

1 Answers1

1

To sort rows, a good strategy is to use an "aggregating transform", as explained in this article, to store all the rows in memory (although you could do it out of memory), then at transform "close" time, sort them and re-emit them in the pipeline.

This is the most flexible design IMO.

class SortingTransform
  def initialize(config...)
    @rows = []
  end

  def process(row)
    @rows << row
    nil # do not emit rows right away
  end

  def close
    # Here: sort the rows, optionally using external
    # configuration passed at init time
    @rows.sort_by { ... }.each do |row|
      yield row
    end
  end
end

You could also indeed re-open the output and sort it, in a secondary ETL job, but the first solution usually has my preference if it can work for you.

Thibaut Barrère
  • 8,845
  • 2
  • 22
  • 27