1

I'm trying to scrape a group of pages with Mechanize and JRuby. I'm using JRuby to have multithreading, since the program is a little slow on MRI. However, I've been running into some problems with what seems to be non-threadsafe data types in Mechanize and the http-cookie gem. In particular, I'm getting errors like this:

RuntimeError: can't add a new key into hash during iteration
             []= at org/jruby/RubyHash.java:991
            push at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/history.rb:28
  add_to_history at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:1290
             get at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:441
          (root) at main.rb:82
        open_uri at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:150
            open at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:678
            open at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:33
          (root) at main.rb:80

And the seemingly offending code in Mechanize is here:

def push(page, uri = nil)
    super page

    index = uri ? uri : page.uri
    @history_index[index.to_s] = page # offending line

    shift while length > @max_size if @max_size

    self
  end

When I comment out the code in lib/mechanize.rb that adds the visited page to the history, that specific error goes away and gets replaced by a very similar error regarding the http-cookie gem:

RuntimeError: can't add a new key into hash during iteration
               []= at org/jruby/RubyHash.java:991
               add at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar/hash_store.rb:56
               add at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:108
               add at (eval):3
               add at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/cookie_jar.rb:22
             parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:192
             parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:322
   scan_set_cookie at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie/scanner.rb:212
             parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:281
               tap at org/jruby/RubyKernel.java:1886
             parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:280
             parse at (eval):3
             parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/cookie.rb:37
             parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:191
      save_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:857
  response_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:845
              each at org/jruby/RubyArray.java:1613
  response_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:844
             fetch at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:282
         post_form at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:1281
            submit at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:548
            submit at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/form.rb:223
            (root) at main.rb:92

And there is a very similar thing going on in http-cookie:

def add(cookie)
  path_cookies = ((@jar[cookie.domain] ||= {})[cookie.path] ||= {})
  path_cookies[cookie.name] = cookie # offending line
  cleanup if (@gc_index += 1) >= @gc_threshold
  self
end

And again, when I comment out the code in http-cookie that adds a cookie, the error goes away. But then my program stops scraping the data properly, probably because I've removed said functionality of the gems I'm using. And the oddest thing about all this is that the program only errors out after scraping a certain number of pages, so I'm wondering if I'm doing something wrong on my end. I would share the code that I have, but it's kind of a private program and I'd rather only share parts of it as needed. Btw, my program is working properly on MRI, albeit somewhat slowly.

So, I guess that my question is: are Mechanize and its dependencies incompatible with multithreading in JRuby or am I doing something wrong on my end?

GDP2
  • 1,948
  • 2
  • 22
  • 38
  • Java's Collections throw `ConcurrentModificationException` when you modify the collection during iteration with an `Iterator`. It seems more an JRuby than a concurrency problem. – sschmeck Nov 26 '15 at 19:54
  • @sschmeck What do you mean? I'm not seeing any `ConcurrentModificationException` errors. – GDP2 Nov 29 '15 at 04:58
  • @sschmeck Oh, never mind, now I get what you mean. I previously wasn't aware of the `ConcurrentModificationException` from Java. – GDP2 Dec 28 '15 at 15:50

1 Answers1

1

seems that you've run into concurrent modification issues for some Hash instances. it's hard to blame you or the gems at this point but likely the gems such as http-cookie are not "truly thread-safe" (only MRI - GIL thread-safe) esp. since there's synchronization code to be found e.g. in each.

its likely a bug although maybe you can work-around those in your code as well by introducing some locking (hopefully won't affect concurrent performance much), really depends on the use case. if you can come up with a simple reproducable multi-thread .rb test-case, I would report an issue with http-cookie (have not examined the other gem).

kares
  • 7,076
  • 1
  • 28
  • 38
  • Thanks for the answer. I'll try your suggestions, and try to get a reproducable demo for you as well. – GDP2 Nov 30 '15 at 15:33