6

I'm playing with Ruby EventMachines for some time now and I think I'm understandings its basics.

However, I am not sure how to read in a large file (120 MB) performantly. My goal is to read a file line by line and write every line into a Cassandra database (same should be with MySQL, PostgreSQL, MongoDB etc. because the Cassandra client supports EM explicitly). The simple snippet blocks the reactor, right?

require 'rubygems'
require 'cassandra'
require 'thrift_client/event_machine'

EM.run do
  Fiber.new do
    rm = Cassandra.new('RankMetrics', "127.0.0.1:9160", :transport => Thrift::EventMachineTransport, :transport_wrapper => nil)
    rm.clear_keyspace!
    begin
      file = File.new("us_100000.txt", "r")
    while (line = file.gets)
      rm.insert(:Domains, "#{line.downcase}", {'domain' => "#{line}"})
    end
      file.close
    rescue => err
      puts "Exception: #{err}"
      err
    end
    EM.stop
  end.resume
end

But what's the right way to get a file read asynchronously?

halfer
  • 19,824
  • 17
  • 99
  • 186
ctp
  • 1,077
  • 1
  • 10
  • 28
  • possible duplicate of [What is the best way to read files in an EventMachine-based app?](http://stackoverflow.com/questions/2749503/what-is-the-best-way-to-read-files-in-an-eventmachine-based-app) – Theo Oct 14 '11 at 18:46

2 Answers2

5

There is no asynchronous file IO support in EventMachine, the best way to achieve what you're trying to do is to read a couple of lines on each tick and send them off to the database. The most important is to not read too large chunks since that would block the reactor.

EM.run do
  io = File.open('path/to/file')
  read_chunk = proc do
    lines_sent = 10
    10.times do
      if line = io.gets
        send_to_db(line) do
          # when the DB call is done
          lines_sent -= 1
          EM.next_tick(read_chunk) if lines_sent == 0
        end
      else
        EM.stop
      end
    end
  end
  EM.next_tick(read_chunk)
end

See What is the best way to read files in an EventMachine-based app?

Community
  • 1
  • 1
Theo
  • 131,503
  • 21
  • 160
  • 205
  • Many thanks for your prompt response ;-) Didnt know there's no way for async filesystem I/O before, so thanks for the hint. On the other hand I tried your snippet: http://pastie.org/2696497. But i got a new error in this case: http://pastie.org/2696500. The Cassandra client is EM aware. – ctp Oct 14 '11 at 19:06
  • I have no idea what causes that error, but I changed two details about my example: it needs to wait for the DB call to return (yield, actually) before reading the next chunk, and also wait for all the ten lines... – Theo Oct 14 '11 at 19:22
  • Hm, still some trouble with the async reading of files. It seems the send_to_db call blocks the reactor. Maybe its the same what you mean with "wait for the DB call to return"? Do you have a code snippet what exactley you changed? – ctp Oct 21 '11 at 10:37
  • You must replace `send_to_db` with a call to your DB driver. If your DB driver is asynchronous it must have some way to specify a callback, in that callback do the thing in the block. – Theo Oct 21 '11 at 14:17
1

If you haven't already, you might take a look at EM::FileStreamer. For one thing, FileStreamer uses a C++ based 'fast file reader'. Couldn't you stream the file over a local socket/pipe and handle the sending to db in a separate process that's listening on the other end?

Also there is a non-Fiber based example of handling sync db connections gracefully in ThreadedResource, in case that's helpful...specifically mentions Cassandra. Although it sounds like your Cassandra library is Fiber based.

Eric G
  • 1,282
  • 8
  • 18