11

I have two scripts which use Mechanize to fetch a Google index page. I assumed EventMachine will be faster than a Ruby thread, but it's not.

EventMachine code costs: "0.24s user 0.08s system 2% cpu 12.682 total"

Ruby Thread code costs: "0.22s user 0.08s system 5% cpu 5.167 total "

Am I using EventMachine in the wrong way?

EventMachine:

require 'rubygems'
require 'mechanize'
require 'eventmachine'

trap("INT") {EM.stop}

EM.run do 
  num = 0
  operation = proc {
    agent = Mechanize.new
    sleep 1
    agent.get("http://google.com").body.to_s.size
  }
  callback = proc { |result|
    sleep 1
    puts result
    num+=1
    EM.stop if num == 9
  }

  10.times do 
    EventMachine.defer operation, callback
  end
end

Ruby Thread:

require 'rubygems'
require 'mechanize'


threads = []
10.times do 
  threads << Thread.new do 
    agent = Mechanize.new
    sleep 1
    puts agent.get("http://google.com").body.to_s.size
    sleep 1
  end
end


threads.each do |aThread| 
  aThread.join
end
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
allenwei
  • 4,047
  • 5
  • 23
  • 26
  • What version and implementation of ruby are you running? For implementations with a GIL (global interpreter lock), the green threads may not actually run completely concurrently. You might want to try running the example in jRuby or Rubinius to confirm your observed behavior – Jerry C. Apr 02 '12 at 22:57

4 Answers4

26

All of the answers in this thread are missing one key point: your callbacks are being run inside the reactor thread instead of in a separate deferred thread. Running Mechanize requests in a defer call is the right way to keep from blocking the loop, but you have to be careful that your callback does not also block the loop.

When you run EM.defer operation, callback, the operation is run inside a Ruby-spawned thread, which does the work, and then the callback is issued inside the main loop. Therefore, the sleep 1 in operation runs in parallel, but the callback runs serially. This explains the near 9-second difference in run time.

Here's a simplified version of the code you are running.

EM.run {
  times = 0

  work = proc { sleep 1 }

  callback = proc {
    sleep 1
    EM.stop if (times += 1) >= 10
  }

  10.times { EM.defer work, callback }
}

This takes about 12 seconds, which is 1 second for the parallel sleeps, 10 seconds for the serial sleeps, and 1 second for overhead.

To run the callback code in parallel, you have to spawn new threads for it using a proxy callback that uses EM.defer like so:

EM.run {
  times = 0

  work = proc { sleep 1 }

  callback = proc {
    sleep 1
    EM.stop if (times += 1) >= 10
  }

  proxy_callback = proc { EM.defer callback }

  10.times { EM.defer work, proxy_callback }
}

However, you may run into issues with this if your callback is then supposed to execute code within the event loop, because it is run inside a separate, deferred thread. If this happens, move the problem code into the callback of the proxy_callback proc.

EM.run {
  times = 0

  work = proc { sleep 1 }

  callback = proc {
    sleep 1
    EM.stop_event_loop if (times += 1) >= 5
  }

  proxy_callback = proc { EM.defer callback, proc { "do_eventmachine_stuff" } }

  10.times { EM.defer work, proxy_callback }
}

This version ran in about 3 seconds, which accounts for 1 second of sleeping for operation in parallel, 1 second of sleeping for callback in parallel and 1 second for overhead.

Benjamin Manns
  • 9,028
  • 4
  • 37
  • 48
  • Thanks Ben! I've took your proxy example a little further and created a unBlocking function. You can see my implementation here: http://goo.gl/8kbc6y – AXE Labs Mar 19 '15 at 15:37
9

Yep, you're using it wrong. EventMachine works by making asynchronous IO calls that return immediately and notify the "reactor" (the event loop started by EM.run) when they are completed. You have two blocking calls that defeat the purpose of the system, sleep and Mechanize.get. You have to use special asynchronous/non-blocking libraries to derive any value from EventMachine.

Ben Hughes
  • 14,075
  • 1
  • 41
  • 34
  • 2
    You're right that the example he posed can be rewritten with an async http library, but the point of the #defer method is specifically so you can spawn a new thread that does a blocking operation without affecting the reactor run loop. So theoretically, his example is not blocking the run loop. My guess with the time difference is how the threads are scheduled. – Jerry C. Apr 02 '12 at 23:01
  • In general EventMachine works exactly how you said, but your answer doesn't apply to his use of the `defer` call. – Martin Konecny Jan 29 '15 at 00:12
7

You should use something like em-http-request http://github.com/igrigorik/em-http-request

2

EventMachine "defer" actually spawns Ruby threads from a threadpool it manages to handle your request. Yes, EventMachine is designed for non-blocking IO operations, but the defer command is an exception - it's designed to allow you to do long running operations without blocking the reactor.

So, it's going to be a little slower then naked threads, because really it's just launching threads with the overhead of EventMachine's threadpool manager.

You can read more about defer here: http://eventmachine.rubyforge.org/EventMachine.html#M000486

That said, fetching pages is a great use of EventMachine, but as other posters have said, you need to use a non-blocking IO library, and then use next_tick or similar to start your tasks, rather then defer, which breaks your task out of the reactor loop.

Joshua
  • 5,336
  • 1
  • 28
  • 42