4

I have a project that needs to parse literally hundreds of thousands of HTML and XML documents.

I thought this would be a perfect opportunity to learn Ruby fibers and the new Goliath framework.

But obviously, Goliath falls flat if you use blocking libraries. But the problem is, I don't know how to tell what is "thread safe" (if that's even the correct term for Goliath).

So my question is, is Nokogiri going to cause any issues with Goliath or multi-threading/fibers in general?

If so, is there something safer to use than Nokogiri?

Thanks

cbmeeks
  • 11,248
  • 22
  • 85
  • 136
  • I'd recommend taking the question directly to the developers at [Nokogiri-Talk](http://groups.google.com/group/nokogiri-talk). – the Tin Man Apr 11 '11 at 21:31

1 Answers1

5

Goliath is a web framework, so I'm assuming you're planning to "ingest" these documents via HTTP? Each request gets mapped into a ruby fiber, but effectively, the server runs in a single reactor thread.

So, to answer your question: Nokogiri is thread safe to the best of my knowledge, but that shouldn't even really matter here. The thing you will have to look out for: while the document is being parsed, the CPU is pinned, and Goliath wont accept any new requests in the meantime. So, you'll have to implement correct logic to handle your specific case (ex: you could do a stream parse on chunks of data arriving from the socket, or load balance between multiple goliath servers, or both ... :-))

igrigorik
  • 9,433
  • 2
  • 29
  • 30