From local scrapy to scrapy cloud (scraping hub) - Unexpected results

Question

The scraper I deployed on Scrapy cloud is producing an unexpected result compared to the local version. My local version can easily extract every field of a product item (from an online retailer) but on the scrapy cloud, the field "ingredients" and the field "list of prices" are always displayed as empty. You'll see in a picture attached the two elements I'm always having empty as a result whereas it's perfectly working I'mu using Python 3 and the stack was configured with a scrapy:1.3-py3 configuration. I thought first it was in a issue with the regex and unicode but seems not. So i tried everything : ur, ur RE.ENCODE .... and didn't work.

For the ingredients part, my code is the following :

    data_box=response.xpath('//*[@id="ingredients"]').css('div.information__tab__content *::text').extract()
    data_inter=''.join(data_box).strip()

    match1=re.search(r'([Ii]ngr[ée]dients\s*\:{0,1})\s*(.*)\.*',data_inter)
    match2=re.search(r'([Cc]omposition\s*\:{0,1})\s*(.*)\.*',data_inter)


    if match1:
        result_matching_ingredients=match1.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()

    elif match2 : 
        result_matching_ingredients=match2.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()

    else:
        result_matching_ingredients=''

    ingredients=result_matching_ingredients

It seems that the matching never occurs on scrapy cloud.

For prices, my code is the following :

    list_prices=[]

    for package in list_packaging : 
        tonnage=package.css('div.product__varianttitle::text').extract_first().strip()
        prix_inter=(''.join(package.css('span.product__smallprice__text').re(r'\(\s*\d+\,\d*\s*€\s*\/\s*kg\)')))
        prix=prix_inter.replace("(","").replace(")","").replace("/","").replace("€","").replace("kg","").replace(",",".").strip()

        list_prices.append(prix)

That's the same story. Still empty.

I repeat : it's working fine on my local version. Those two data are the only one causing issue : i'm extracting a bunch of other data (with Regex too) with scrapy cloud and I'm very satisfied with it ?

Any ideas guys ?

score 1 · Answer 1 · answered Jul 29 '18 at 09:11

1

I work really often with ScrapingHub, and usually the way I do to debug is:

Check the job requests (through the ScrapingHub interface)

In order to check if there is not a redirection which makes the page slightly different, like a query string ?lang=en

Check the job logs (through the ScrapingHub interface)

You can either print or use a logger to check everything you want trough your parser. So if you really want to be sure the scraper display the same on local machine and on ScrapingHub, you can print(response.body) and compare what might cause this difference.

If you can not find, I'll try to deploy a little spider on ScrapingHub and edit this post if I can manage to have some time left today !

answered Jul 29 '18 at 09:11

Sewake

737
1
5
16

Thansks for your input. Request seems ok. No language issue so far. One thing. I'm using python3 locally and whereas I used this in yml `stacks: default: scrapy:1.3-py3`the log is showing me a 2.7 setup. Concerning the response.body, it seems okay and comparable. We are just having an "encoding issue". Scrappy hub is displaying some unicode character but it shouldn't be a problem from regex. – BoobaGump Jul 29 '18 at 18:05
Sewake would you have any other idea to solve this headaches giving issue ? – BoobaGump Jul 30 '18 at 17:53
Mhh if you don't see anything strange in response.body right now I don't have many ideas... If we focus on the first field, what is output of `data_inter` ? Is it what you expect ? The encoding issue could be the reason why, because of the `é` accent of `Ingrédient` and `€` special character of price field. (Sorry I didn't had time to look in more detail) – Sewake Jul 31 '18 at 08:35
1

Okay it was very simple. It was only the python 3.6 stack which wasn’t launching. It was no coding issue or scraping difficulty. Just the stack. – BoobaGump Jul 31 '18 at 21:27

score 1 · Accepted Answer · answered Jul 31 '18 at 21:33

1

Check that Scrapping Hub’s logs are displaying the expected version of Python even if the stack is correctly set up in the project’s yml file.

answered Jul 31 '18 at 21:33

BoobaGump

525
1
6
17

From local scrapy to scrapy cloud (scraping hub) - Unexpected results

2 Answers2