Empty .json File in Scrapy

Question

I've written this very short spider to go to a U.S. News link and take the names of the colleges listed there.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import scrapy

class CollegesSpider(scrapy.Spider):
    name = "colleges"
    start_urls = [
        'http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities?_mode=list&acceptance-rate-max=20'
    ]

    def parse(self, response):
        for school in response.css('div.items'):
            yield {
                'name': school.xpath('//*[@id="view-1c4ddd8a-8b04-4c93-8b68-9b7b4e5d8969"]/div/div[1]/div[1]/h3/a').extract_first(),
            }

However, when I run this spider and ask for the names to be stored in a file named schools.json, the file comes out blank. What am I doing wrong?

@Umair what do you mean? My terminal output showed no errors. — ch1maera, Jan 28 '17 at 20:22
@Umair I did get this though "HTTP status code is not handled or not allowed" — ch1maera, Jan 28 '17 at 20:23
@ch1maera yeah I replicated that. My intuition is the auto stop robot. You need to code in some hearders and pretend tobe a browser — Bobby, Jan 28 '17 at 20:24
@Bobby so basically this: http://stackoverflow.com/questions/18920930/scrapy-python-set-up-user-agent ? — ch1maera, Jan 28 '17 at 20:35
@ch1maera yes! I just tested it out using the lighter module request. It worked. It should work like a charm after you set up the header. See answer below — Bobby, Jan 28 '17 at 20:36

score 1 · Accepted Answer · answered Jan 28 '17 at 20:34

Got it! It is because the robot detection.

Encode

>>> r = requests.get('http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities?_mode=list&acceptance-rate-max=20', headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'})
>>> r.status_code
200

Then you will have all the content you need. Do whatever parsing or extraction you need. The procedure to encode a header should be very similar in Scrapy.

scrapy doc for request with headers

User agent for Chrome

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36

score 0 · Answer 2 · answered Jan 28 '17 at 20:26

0

I am on my mobile so don't remember exact variable name, but it should be robots_follow

Set it to False

answered Jan 28 '17 at 20:26

Umair Ayub

19,358
14
72
146

score 0 · Answer 3 · answered Jan 28 '17 at 20:26

0

The page you're referring to as start url doesn't contain any element with id view-1c4ddd8a-8b04-4c93-8b68-9b7b4e5d8969- it looks like quite unique and doesn't seem to be the good choice for pretty universal XPath expression. I'd recommend to use something like school.xpath('.//div[@data-view="colleges-search-results-card"]//h3/a/text()').extract()

answered Jan 28 '17 at 20:26

mizhgun

1,758
15
14

I tried that but I'm still getting "HTTP status code is not handled or not allowed" – ch1maera Jan 28 '17 at 20:29

Empty .json File in Scrapy

3 Answers3