Parse Document with Scrapy

Question

I have an issue, I want to parse a web site and crawl each article's links from it but the problem is Scrapy do not crawls all the links and crawls some of them a random number of times.

import scrapy

from tutorial.items import GouvItem

class GouvSpider(scrapy.Spider):

    name = "gouv"

    allowed_domains = ["legifrance.gouv.fr"]

    start_urls = [

        "http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160128"

        ]

    def parse(self, response):
        for href in response.xpath('//span/a/@href'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_article)

    def parse_article(self, response):
        for art in response.xpath("//div[@class='corpsArt']"):
            item = GouvItem()
            item['article'] = art.xpath('p/text()').extract()
            yield item




#And this is the GouvItem :

import scrapy

class GouvItem(scrapy.Item):
    title1 = scrapy.Field()
    title2 = scrapy.Field()
    title3 = scrapy.Field()
    title4 = scrapy.Field()
    title5 = scrapy.Field()
    title6 = scrapy.Field()
    link = scrapy.Field()
    article = scrapy.Field()

This is some lines of the json file and we can see that some articles missing and others are there but many times

The problem is that each article of the law should be there and only one time. On the website, each article appears only time.

Thanks a lot !

Please edit your post and paste your code here so that people can copy-paste it in their editors — Yannis P., Feb 02 '16 at 16:51
well... I just realize that I if I execute the same script two times, both results aren't the same... I don't understand that... — Aurelien.Farcy, Feb 05 '16 at 10:10
I can't spot any obvious errors in your code (except that GouvItem has no field named 'article'). Can you specify what you expect and how that is different from what you get? Because scrapy sends/receives multiple requests/responses in parallel there is no predictable order of results. It can and probably will be different every time you run your script. If you need order add a field to your item so that you can sort the results after running your script. — Frank Martin, Feb 11 '16 at 19:45
I have eddited my post to show you what's going wrong with this. thank you so much for your help ! Someone told me to implement a timer what do you think about that ? — Aurelien.Farcy, Feb 13 '16 at 11:21
I'm still diving into this problem. Seems to me like the response is determined in first order by the sessionID so that you don't get always the requested document but the document for the last request of your sessionID. And because Scrapy sends multiple requests things get out of sync. I'm still analyzing and will give you more feedback. — Frank Martin, Feb 16 '16 at 13:12

score 0 · Answer 1 · edited May 23 '17 at 12:15

0

The links to the sub-pages of the website contain a sessionID. It looks like the response for a request takes into account that sessionID in a manner that works not well with scrapy sending multiple concurrent requests.

One way to fix this is to set the number of CONCERRENT_REQUESTS in the settings.py to 1 . Scraping will take longer of course with this setting.

Another approach would be to control the the requests manually with a list. See this answer on SO.

To prevent empty results use a relative XPath (trailing dot) and extract all the text:

item['article'] = art.xpath('.//text()').extract()

Hope this helps.

edited May 23 '17 at 12:15

Community

1
1

answered Feb 18 '16 at 16:37

Frank Martin

2,584
2
22
25

Thank you so much ! It's seems to do something better but laws aren't in the right order. It means that the crawler take all ul/li text then all ul/li/ul/li etc ? I'm going to test with the entire page to understand. – Aurelien.Farcy Feb 21 '16 at 18:59
It Works !!!!! Thank you so much !!! I got everything !! The only issue I have right now is the fact that laws are still not in the right order... Do you have an idea about that ? – Aurelien.Farcy Feb 21 '16 at 20:31
Save the article section text as additional field to your item. Then you can sort the resulting json-file by that field. I have no idea how that could be done directly with scrapy - Sorry! – Frank Martin Feb 22 '16 at 17:42

Parse Document with Scrapy

1 Answers1