0

I have a file with million urls like: the data file is like:

http://wonderland.cjfallon.ie/
http://www.youtube.com/
http://www.starfall.com/
http://education.scholastic.co.uk/
http://www.scoilnet.ie/
http://www.nessy.com/
http://www.senteacher.org/
http://scoop.it/
http://www.moviemaker.com/
http://learni.st/
http://www.twitter.com/
http://www.facebook.com/
http://www.gutenberg.org/
http://www.gutenberg.org/cache/epub/42361/pg42361.txt

I want to crawl them,so the bound is network IO,so I want to use multiple threads or gevent to tackle it.

my multiple threads code works well in : https://gist.github.com/young001/5449751

but when using gevent, the code is : https://gist.github.com/young001/baa3eebbf7342c5ac077 it always goes wrong:

status is 200
status is 200
Internal error in evhttp
the url is down http://web2.socialcomputingmagazine.com/the_social_graph_issues_and_strategies_in_2008.htm

the reason 
status is 200
status is 200
status is 200
status is 200
status is 200
status is 200
status is 301
status is 200
status is 301
status is 200
status is 200
Internal error in evhttp

and then it stalled. I don't know why it comes out like that?

any help?

it seems all should go well but it's not,it makes me crazy.

poolie
  • 9,289
  • 1
  • 47
  • 74
kuafu
  • 1,466
  • 5
  • 17
  • 28
  • Please cut out the irrelevant code and add the correct imports your sample, so that people can actually run it. – poolie Apr 24 '13 at 05:17
  • By the way, I hope that before you really run this on a million URLs, you make it [respect `robots.txt`](http://robotstxt.org). – poolie Apr 24 '13 at 05:29

1 Answers1

1

I can reproduce it here after fixing up your sample.

Basically this seems to be a gevent bug that it sometimes gives Internal error in evhttp.

The source code says:

# sometimes this happens, don't know why
sys.stderr.write("Internal error in evhttp\n")

You'll have to either debug that or use something else, or just retry when it fails.

poolie
  • 9,289
  • 1
  • 47
  • 74
  • oh,thx poolie,but why it stalled? traceback tells "CTraceback (most recent call last): File "test_urls_from_file.py", line 55, in print "good url",url File "/usr/local/lib/python2.7/dist-packages/gevent/pool.py", line 277, in spawn self._semaphore.acquire() File "/usr/local/lib/python2.7/dist-packages/gevent/coros.py", line 110, in acquire result = get_hub().switch() File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 164, in switch return greenlet.switch(self) KeyboardInterrupt " – kuafu Apr 24 '13 at 06:05
  • Yes, I see that too. I guess it's some other consequence of the bug, and it's probably going to make it hard to retry. I suggest you test against gevent head, and if you still see it there [file a bug](https://github.com/surfly/gevent/issues/) (and put the url here.) – poolie Apr 24 '13 at 07:22