0

I have almost 300 urls in my start_urls list, but the scrapy only scrawl about 200 urls. But not all of these listed urls. I do not know why? How I can deal with that. I have to scrawl more items from the website.

Another question I do not understand is: how I can see the log error when the scrapy finishes? From the terminal or I have to write code to see the log error. I think the log is enabled by default.

Thanks for your answers.


updates:

The output is in the following. I do not know why there are only 2829 items scraped. There are 600 urls in my start_urls actually.

But when I only give 400 urls in start_urls, it can scrape 6000 items. I expect to scrape almost the whole website of www.yhd.com. Could anyone give any more suggestions?

2014-12-08 12:11:03-0600 [yhd2] INFO: Closing spider (finished)
2014-12-08 12:11:03-0600 [yhd2] INFO: Stored csv feed (2829 items) in myinfoDec.csv        
2014-12-08 12:11:03-0600 [yhd2] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 142586,
'downloader/request_count': 476,
'downloader/request_method_count/GET': 476,
'downloader/response_bytes': 2043856,
'downloader/response_count': 475,
'downloader/response_status_count/200': 474,
'downloader/response_status_count/504': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 8, 18, 11, 3, 607101),
'item_scraped_count': 2829,
'log_count/DEBUG': 3371,
'log_count/ERROR': 1,
'log_count/INFO': 14,
'response_received_count': 474,
'scheduler/dequeued': 476,
'scheduler/dequeued/memory': 476,
'scheduler/enqueued': 476,
'scheduler/enqueued/memory': 476,
'start_time': datetime.datetime(2014, 12, 8, 18, 4, 19, 698727)}
2014-12-08 12:11:03-0600 [yhd2] INFO: Spider closed (finished)
mootvain
  • 39
  • 4
  • about the logs, maybe you should set the log level to DEBUG? – Elias Dorneles Dec 06 '14 at 00:09
  • about the urls, are you sure not one of them is repeated? scrapy filters duplicate requests. – Elias Dorneles Dec 06 '14 at 00:10
  • 1
    although the code seems to enable dont_filter option for the urls in start_urls: https://github.com/scrapy/scrapy/blob/master/scrapy/spider.py#L60 – Elias Dorneles Dec 06 '14 at 00:11
  • 1
    For that many start urls, you should consider using `start_requests()` and yield the Requests from there. And for the logs, consider running your spider like this: `scrapy crawl myspider -o out.jl > myspider.log 2>&1` - this way you will get all the output in the log file. From there you might find out the reason some URL's are being dropped. Could it be that some of them are malformed? Like not having the "http" part? – bosnjak Dec 08 '14 at 09:09
  • @elias Thanks. My url in start_urls may include duplicate. But I do not think the duplicates have high proportion. I will set the log level and use the command line to show the log. – mootvain Dec 08 '14 at 18:37
  • what is the meaning of out.j1? in your"crawl myspider -o out.jl > myspider.log 2>&1" ? When I type " crawl yhd2 -o out.jl > yhd2.log 2>&1" to the command line, it shows: " crawl: error: running 'scrapy crawl' with more than one spider is no longer supported Process finished with exit code 137 – mootvain Dec 10 '14 at 19:26

1 Answers1

2

Finally I solved the problem....

First,it does not crawl all url listed in start_urls is because I have a typo in url in start_urls. One of the "http://..." is mistakenly written as "ttp://...", the first 'h' is missing. Then it seems the spider stopped to looked at the rest urls listed after it. Horrifed.

Second, I solved the log file problem by click the configuiration panel of Pycharm, which provides showing log file panel. By the way, my scrapy framework is put into the Pycharm IDE. It works great for me. Not advertisement.

Thanks for all the comments and suggestions.

mootvain
  • 39
  • 4