5

I need something like PubSub but instead of broadcasting to all subscribers, the message is sent to only 1 subscriber (preferable the subscriber is chosen automatically based on the number of messages in it's receive buffer, lower is better).

What I'm attempting is to send several hundred thousand http requests using a controlled number of distributed workers.

Krut
  • 4,112
  • 3
  • 34
  • 42
  • Not sure if this is what you're looking for but give it a look: https://github.com/devinus/poolboy – Onorio Catenacci Jan 06 '15 at 01:37
  • @OnorioCatenacci poolboy will only be a component of such a system. It will handle a pool of workers, but on heavy load workers will start to fail. You cannot simply request an arbitrary number of workers, they will time out eventually. Somebody correct me if I am wrong, but setting the timeout to `:infinity` doesn't seem like a particularly good idea as well. I suppose you could build a message queue with a run loop that tries to request a worker in a non-blocking manner from poolboy, then either uses the worker or waits for a certain amount of time if none is available and then retries. – Patrick Oscity Jan 06 '15 at 09:37

2 Answers2

4

To solve this the first thing I'd try would be to have the workers pull requests to make rather than have them pushed to them.

So I'd have a globally registered Agent that holds the list of http requests to be performed with an API for adding and retrieving a request. I'd then just start N workers using worker(Task, ...) using a Supervisor and one_for_one rather than adding poolboy at this stage. Each worker would ask the Agent for a http request to make and do whatever work is necessary and then terminate normally, be restarted by the supervisor and ask for a new url.

Workers pulling http tasks from the list in the agent rather than having them pushed to them will ensure that an available worker is always busy if there's work to do.

If the solution looked good I'd then look into adding poolboy. You'd need to be careful with the supervisor options so a bunch of bad urls causing your workers to crash wouldn't trigger the supervisor to take everything else down.

chrismcg
  • 1,286
  • 7
  • 11
  • So (for distributed operation) would there be an Agent per node and the workers would check other nodes Agents in a round-robin fashion? – Krut Jan 07 '15 at 02:26
  • You could do that. In my head I'd just one Agent but that would be a single failure point. You could store the data distributed in mnesia and have a supervised Agent or plain GenServer on each node that the local workers could talk to. There's an Elixir mnesia wrapper called Amnesia but it's also easy enough to use the erlang library directly. – chrismcg Jan 07 '15 at 17:43
2

As stated in my comment, my approach would be to use Poolboy to handle workers, but it is not possible to just request N workers (N being the number of requested URLs) because this will exceed the process limit quickly and cause the checkout requests to time out. Instead, you need a loop that checks whether workers are available and if so, requests the url asynchronously. If no workers are free, it should sleep for a while and then retry.

For this purpose, Poolboy has the :poolboy.checkout/2 function, the second parameter allows us to specify whether it should block or not. If no workers are available it will return :full, otherwise you will get back a worker pid.

Example:

def crawl_parallel(urls) do
  urls
  |> Enum.map(&crawl_task/1)
  |> Enum.map(&Task.await/1)
end

defp crawl_task(url) do
  case :poolboy.checkout Crawler, false do
    :full ->
      # No free workers, wait a bit and retry
      :timer.sleep 100
      crawl_task url
    worker_pid -> 
      # We have a worker, asynchronously crawl the url
      Task.async fn ->
        Crawler.Worker.crawl worker_pid, url
        :poolboy.checkin Crawler, worker_pid
      end
  end
end
Patrick Oscity
  • 53,604
  • 17
  • 144
  • 168
  • I havent tried this yet. It seems to be possible to register the pool with a global name, you could then request workers from different nodes. However I am quite sure that the genserver calls to the workers end up being executed on the node that spawned the pool. Maybe you need to use a more sophisticated approach using RabbitMQ or so. – Patrick Oscity Jan 07 '15 at 05:54
  • I am reading from fixed urls list here but you could swap that for a loop that pops from a queue and then have one such loop and one pool per node. – Patrick Oscity Jan 07 '15 at 05:56
  • You should have a look at Pooler - https://github.com/seth/pooler. It seems to be able to manage multiple pools on different nodes. – Patrick Oscity Jan 07 '15 at 06:37
  • I'm currently using http://sidekiq.org/ with my rails projects and just getting into elixir and was hoping for a built-in way to distribute tasks across a cluster (and eliminate dependencies) – Krut Jan 07 '15 at 21:03
  • That would be nice although there are so many details to such a system that can vary depending on your needs. It is very hard, maybe even impossible to come up with a general solution that ships with the standard lib. There isnt even a pmap in Elixir (yet) because of the many possible implementations. – Patrick Oscity Jan 07 '15 at 22:05