Transpose CSV rows and columns during ETL process using Kiba (or plain Ruby)

Question

A third party system produces an HTML table of parent teacher bookings:

 Blocks    Teacher 1   Teacher 2   Teacher 3
3:00 pm      Stu A       Stu B
3:10 pm      Stu B                   Stu C
...
5:50 pm      Stu D       Stu A       Stu E

The number of columns changes depending on how many teachers have bookings. The number of rows changes depending on how many slots we create.

The end result needs to be a hash for each teacher like:

{ name: "Teacher 1", email: "teacher.1@school.edu", appointments: [
  { start: "15:00", end: "15:08", attendees: [
    { name: "Stu A Parent 1", email: "stuap1@example.com" },
    { name: "Stu A Parent 2", email: "stuap2@example.com" }
  ] },
  { start: "15:10", end: "15:18", attendees: [
    { name: "Stu B Parent", email: "stubp@example.com" }
  ] },
  ...
  { start: "17:50", end: "17:58", attendees: [
    { name: "Stu D Parent 1", email: "studp1@example.com" },
    { name: "Stu D Parent 2", email: "studp2@example.com" }
  ] },
] },

I think it makes most sense to ETL process each teacher as a row so this time I've transposed the rows and columns in Numbers and saved that as a CSV:

Blocks,3:00 pm,3:10 pm,...,5:50 pm
Teacher 1,Stu A,Stu B,...,Stu D
Teacher 2,Stu B,,...,Stu C
Teacher 3,Stu D,Stu A,...,Stu E

I'm trying to make the whole process as simple as possible for the office staff to use so is it possible to do the transposing of rows and columns in Kiba (or plain Ruby)? In Kiba I assume I'd have to process all the rows, accumulating a hash per teacher and then at the end output each teacher's hash?

When you give an example it's best to make it complete (no ...., for example), so that those giving answers can demonstrate the result their suggested code provides for the example. Here you need only three or four rows. — Cary Swoveland, Mar 10 '21 at 06:59

score 2 · Accepted Answer · edited Mar 10 '21 at 08:52

Kiba author here!

I see at least two ways of doing this (no matter if you work with plain Ruby or with Kiba):

converting your HTML to a table, then work from that data
work directly with the HTML table (using Nokogiri & selectors), applicable only if the HTML is mostly clean

In all cases, because you are doing some scraping; I recommend that you have a very defensive code (because HTML changes and can contain bugs or cornercases later), e.g. strong assertions on the fact that the lines / columns contain what you expect, verifications etc.

If you go plain Ruby, then for instance you could do something like (here modelizing your data as text separated with commas to keep things clear):

task :default do
  data = <<DOC
  Blocks  ,  Teacher 1  , Teacher 2  , Teacher 3
  3:00 pm  ,    Stu A   ,    Stu B   ,          
  3:10 pm   ,   Stu B   ,            ,    Stu C
DOC
  data = data.split("\n").map &->(x) { x.split(",").map(&:strip)}
  blocks, *teachers = data.transpose
  teachers.each do |teacher|
    pp blocks.zip(teacher)
  end
end

This will output:

[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]
[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]
[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]

Something that you can massage into what you expect (but again: be very defensive & put assertions everywhere on all the data, including the number of cells in a table etc, or you'll get off-by-one errors, incorrect schedules etc).

If you want to use Kiba and CSS selectors, you could go like this:

task :default do
  html = <<HTML
    <table>
      <tr>
        <th>Blocks</th>
        <th>Teacher 1</th>
        <th>Teacher 2</th>
        <th>Teacher 3</th>
      </tr>
      <tr>
        <td>3:00 pm</td>
        <td>Stu A</td>
        <td>Stu B</td>
        <td></td>
      </tr>
      <tr>
        <td>3:10 pm</td>
        <td>Stu B</td>
        <td></td>
        <td>Stu C</td>
      </tr>
    </table>
HTML
  require 'nokogiri'
  require 'kiba'
  require 'kiba-common/sources/enumerable'
  require 'kiba-common/transforms/enumerable_exploder'
  Kiba.run do
    # just one doc here, but we could have a sequence instead
    source Kiba::Common::Sources::Enumerable, -> { [html] }

    transform { |r| Nokogiri::HTML(r) }

    transform do |doc|
      Enumerator.new do |y|
        blocks, *teachers = doc.search("table tr:first th").map(&:text)
        # you'd have to add more defensive checks here!!! important!
        teachers.each_with_index do |t, i|
          headers = doc.search("table>tr>:nth-child(1)").map(&:text)
          data = doc.search("table>tr>:nth-child(#{i + 2})").map(&:text)
          y << { teacher: t, data: headers.zip(data) }
        end
      end
    end

    transform Kiba::Common::Transforms::EnumerableExploder

    transform { |r| pp r }
  end
end

Which would give:

{:teacher=>"Teacher 1",
 :data=>[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]}
{:teacher=>"Teacher 2",
 :data=>[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]}
{:teacher=>"Teacher 3",
 :data=>[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]}

I think I would prefer a blend of the 2 methods: first converting the HTML to a proper CSV file or in-memory table, then a second step to transpose from there.

score 1 · Answer 2 · answered Mar 10 '21 at 09:16

Suppose we are given the following schedule.

schedule =<<~END
Blocks,15:00,15:10,15:55
Teacher 1,Stu A,Stu B,Stu C
Teacher 2,Stu B,Stu C,Stu A
Teacher 3,Stu C,Stu A,Stu B
END

To produce the desired array of hashes we need additional information. Suppose we are also given the following.

teacher_emails = {
  "Teacher 1"=>"teacher.1@school.edu",
  "Teacher 2"=>"teacher.2@school.edu",
  "Teacher 3"=>"teacher.3@school.edu"
}

parent_emails = {
  "Stu A"=> { "Parent 1"=>"stuap1@example.com",
              "Parent 2"=>"stuap2@example.com" },
  "Stu B"=> { "Parent"=>"stubp@example.com" },
  "Stu C"=> { "Parent 1"=>"stuapc@example.com",
              "Parent 2"=>"stuapc@example.com" }
}

mins_per_meeting = 8

We may then procede as follows.

blks, *sched = schedule.split(/\n/)
blks
  #=> "Blocks,15:00,15:10,15:55"
sched
  #=> ["Teacher 1,Stu A,Stu B,Stu C",
  #    "Teacher 2,Stu B,Stu C,Stu A",
  #    "Teacher 3,Stu C,Stu A,Stu B"]

time_blocks = blks.scan(/\d{1,2}:\D{2}/).map do |s|
  hr, min = s.split(':')
  mins_from_midnight = 60*(hr.to_i) + min.to_i
  { start: "%d:%02d" % mins_from_midnight.divmod(60),
  { end: "%d:%02d" % (mins_from_midnight + mins_per_meeting).divmod(60),
end
  #=> [{:start=>"15:00", :end=>"15:08"},
  #    {:start=>"15:10", :end=>"15:18"},
  #    {:start=>"15:55", :end=>"16:03"},

sched.map do |s|
  teacher, *students = s.split(',')
  { name: teacher,
    email: teacher_emails[teacher],
    appointments: time_blocks.zip(students).map do |tb,stud|
      tb.merge(
        { student: stud,
          attendees: parent_emails[stud].map do |par_name, par_email|
            { name: par_name, email: par_email }
          end
        }
      )
    end    
  }

end
  #=> [{:name=>"Teacher 1", :email=>"teacher.1@school.edu",
  #     :appointments=>[
  #       {:start=>"15:00", :end=>"15:08",
  #        :student=>"Stu A",
  #        :attendees=>[
  #          {:name=>"Parent 1", :email=>"stuap1@example.com"},
  #          {:name=>"Parent 2", :email=>"stuap2@example.com"}
  #        ]
  #       },
  #       {:start=>"15:10", :end=>"15:18",
  #        :student=>"Stu B",
  #        :attendees=>[
  #          {:name=>"Parent", :email=>"stubp@example.com"}
  #        ]
  #       },
  #       {:start=>"15:55", :end=>"16:03",
  #        :student=>"Stu C",
  #        :attendees=>[
  #          {:name=>"Parent 1", :email=>"stuapc@example.com"},
  #          {:name=>"Parent 2", :email=>"stuapc@example.com"}
  #        ]
  #       }
  #     ]
  #    },

  #    {:name=>"Teacher 2", :email=>"teacher.2@school.edu",
  #     :appointments=>[
  #       {:start=>"15:00", :end=>"15:08",
  #        :student=>"Stu B",
  #        :attendees=>[
  #          {:name=>"Parent", :email=>"stubp@example.com"}
  #        ]
  #       },
  #       ....

Transpose CSV rows and columns during ETL process using Kiba (or plain Ruby)

2 Answers2