I'm trying to scrape a group of pages with Mechanize and JRuby. I'm using JRuby to have multithreading, since the program is a little slow on MRI. However, I've been running into some problems with what seems to be non-threadsafe data types in Mechanize and the http-cookie
gem. In particular, I'm getting errors like this:
RuntimeError: can't add a new key into hash during iteration
[]= at org/jruby/RubyHash.java:991
push at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/history.rb:28
add_to_history at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:1290
get at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:441
(root) at main.rb:82
open_uri at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:150
open at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:678
open at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:33
(root) at main.rb:80
And the seemingly offending code in Mechanize is here:
def push(page, uri = nil)
super page
index = uri ? uri : page.uri
@history_index[index.to_s] = page # offending line
shift while length > @max_size if @max_size
self
end
When I comment out the code in lib/mechanize.rb
that adds the visited page to the history, that specific error goes away and gets replaced by a very similar error regarding the http-cookie
gem:
RuntimeError: can't add a new key into hash during iteration
[]= at org/jruby/RubyHash.java:991
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar/hash_store.rb:56
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:108
add at (eval):3
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/cookie_jar.rb:22
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:192
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:322
scan_set_cookie at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie/scanner.rb:212
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:281
tap at org/jruby/RubyKernel.java:1886
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:280
parse at (eval):3
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/cookie.rb:37
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:191
save_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:857
response_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:845
each at org/jruby/RubyArray.java:1613
response_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:844
fetch at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:282
post_form at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:1281
submit at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:548
submit at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/form.rb:223
(root) at main.rb:92
And there is a very similar thing going on in http-cookie
:
def add(cookie)
path_cookies = ((@jar[cookie.domain] ||= {})[cookie.path] ||= {})
path_cookies[cookie.name] = cookie # offending line
cleanup if (@gc_index += 1) >= @gc_threshold
self
end
And again, when I comment out the code in http-cookie
that adds a cookie, the error goes away. But then my program stops scraping the data properly, probably because I've removed said functionality of the gems I'm using. And the oddest thing about all this is that the program only errors out after scraping a certain number of pages, so I'm wondering if I'm doing something wrong on my end. I would share the code that I have, but it's kind of a private program and I'd rather only share parts of it as needed. Btw, my program is working properly on MRI, albeit somewhat slowly.
So, I guess that my question is: are Mechanize and its dependencies incompatible with multithreading in JRuby or am I doing something wrong on my end?