0

I need to process jobs off of a queue within a process, with IO performed asynchronously. That's pretty straightforward. The gotcha is that those jobs can add additional items to the queue.

I think I've been fiddling with this problem too long so my brain is cloudy — it shouldn't be too difficult. I keep coming up with an either-or scenario:

  1. The queue can perform jobs asynchronously and results can be joined in afterward.
  2. The queue can synchronously perform jobs until the last finishes and the queue is empty.

I've been fiddling with everything from EventMachine and Goliath (both of which can use EM::HttpRequest) to Celluloid (never actually got around to building something with it though), and writing Enumerators using Fibers. My brain is fried though.

What I'd like, simply, is to be able to do this:

items = [1,2,3]
items.each do |item|
  if item.has_particular_condition? 
    items << item.process_one_way
  elsif item.other_condition?
    items << item.process_another_way
  # ...
  end
end

#=> [1,2,3,4,5,6]

...where 4, 5, and 6 were all results of processing the original items in the set, and 7, 8, and 9 are results from processing 4, 5, and 6. I don't need to worry about indefinitely processing the queue because the data I'm processing will end after a couple of iterations.

High-level guidance, comments, links to other libraries, etc are all welcome, as well as lower-level implementation code examples.

coreyward
  • 77,547
  • 20
  • 137
  • 166
  • 1
    yeah, async systems can be tough to build. You have my moral support :) – Sergio Tulentsev Nov 12 '12 at 20:59
  • 1
    My first suggestion is: sleep on it. When you start to bang your head against the wall and things start to seem impossibly complex..from experience I know it's time to call it a day. The next day the problem usually solves itself surprisingly easily. I love to solve problems in my sleep - it requires no "effort", and the next day you usually feel like a genius. – Casper Nov 12 '12 at 21:08
  • @Casper I guess I didn't specify that "too long" entails working on this for several weeks, here and there. :/ – coreyward Nov 12 '12 at 21:37

2 Answers2

0

I have had similar requirements in the past and what you need is a solid, high performance work queue from the sounds of it. I recommend you check out beanstalkd which I discovered over a year ago and have since been using to process thousands and thousands of jobs reliably in ruby.

In particular, I have started developing solid ruby libraries around beanstalkd. In particular, be sure to check out backburner which is a production ready work queue in ruby using beanstalkd. The syntax and setup are easy, defining how jobs process is quick, handling job failures and retries is all built in as are job scheduling and a lot more.

Let me know if you have any questions but I think beanstalkd and backburner would fit your requirements quite well.

Nathan
  • 1,381
  • 20
  • 43
  • 58
  • This might work, but I'm performing work during an HTTP request lifecycle, and I need to parse and return a result to the client when all jobs are finished. Any tips on how to manage that with Backburner? – coreyward Nov 15 '12 at 18:10
0

I wound up implementing something a little less ideal — basically just wrapping an EM Fiber Iterator in a loop that terminates once no new results are queued.

require 'set'

class SetRunner
  def initialize(seed_queue)
    @results = seed_queue.to_set
  end

  def run
    begin
      yield last_loop_results, result_bucket
    end until new_loop_results.empty?

    return @results
  end

  def last_loop_results
    result_bucket.shift(result_bucket.count)
  end

  def result_bucket
    @result_bucket ||= @results.to_a
  end

  def new_loop_results
    # .add? returns nil if already in the set
    result_bucket.each { |item| @results.add? item }.compact
  end
end

Then, to use it with EventMachine:

queue = [1,2,3]
results = SetRunner.new(queue).run do |set, output|
  EM::Synchrony::FiberIterator.new(set, 3).each do |item|
    output.push(item + 3) if item <= 6
  end
end
# => [1,2,3,4,5,6,7,8,9]

Then each set will get run with the concurrency level passed to the FiberIterator, but the results from each set will get run in the next iteration of the outer SetRunner loop.

coreyward
  • 77,547
  • 20
  • 137
  • 166