In my following file Reddit.py
, it has this Spider:
import scrapy
class RedditSpider(scrapy.Spider):
name = 'Reddit'
allowed_domains = ['reddit.com']
start_urls = ['https://old.reddit.com']
def parse(self, response):
for link in response.css('li.first a.comments::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(link), callback=self.parse_topics)
def parse_topics(self, response):
topics = {}
topics["title"] = response.css('a.title::text').extract_first()
topics["author"] = response.css('p.tagline a.author::text').extract_first()
if response.css('div.score.likes::attr(title)').extract_first() is not None:
topics["score"] = response.css('div.score.likes::attr(title)').extract_first()
else:
topics["score"] = "0"
if int(topics["score"]) > 10000:
author_url = response.css('p.tagline a.author::attr(href)').extract_first()
yield scrapy.Request(url=response.urljoin(author_url), callback=self.parse_user, meta={'topics': topics})
else:
yield topics
def parse_user(self, response):
topics = response.meta.get('topics')
users = {}
users["name"] = topics["author"]
users["karma"] = response.css('span.karma::text').extract_first()
yield users
yield topics
What it does that it gets all the URLs from the main page of old.reddit
, Then scrape each URL's title, author and score.
What I've added to it is a second part, Where it checks if the score is higher than 10000, If it is, Then the Spider goes to the user's page and scrape his karma from it.
I do understand that I can scrape the karma from the topic's page, But I would like to do it this way, Since there is other part of the user's page I scrape That doesn't exist in the topic's page.
What I want to do is to export the topics
list which contains title, author, score
into a JSON
file named topics.json
, Then if the topic's score is higher than 10000 to export the users
list which contains name, karma
into a JSON
file named users.json
.
I only know how to use the command-line
of
scrapy runspider Reddit.py -o Reddit.json
Which exports all the lists into a single JSON
file named Reddit
but in a bad structure like this
[
{"name": "Username", "karma": "00000"},
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"name": "Username2", "karma": "00000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"name": "Username3", "karma": "00000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
....
]
I have no-knowledge at all about Scrapy's Item Pipeline
nor Item Exporters
& Feed Exporters
on how to implement them on my Spider, or how to use them overall, Tried to understand it from the Documentation, But it doesn't seem I get how to use it in my Spider.
The final result I want is two files:
topics.json
[
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
....
]
users.json
[
{"name": "Username", "karma": "00000"},
{"name": "Username2", "karma": "00000"},
{"name": "Username3", "karma": "00000"},
....
]
while getting rid of duplicates in the list.