0

In my following file Reddit.py, it has this Spider:

import scrapy

class RedditSpider(scrapy.Spider):
    name = 'Reddit'
    allowed_domains = ['reddit.com']
    start_urls = ['https://old.reddit.com']

    def parse(self, response):

        for link in response.css('li.first a.comments::attr(href)').extract():
            yield scrapy.Request(url=response.urljoin(link), callback=self.parse_topics)



    def parse_topics(self, response):
        topics = {}
        topics["title"] = response.css('a.title::text').extract_first()
        topics["author"] = response.css('p.tagline a.author::text').extract_first()

        if response.css('div.score.likes::attr(title)').extract_first() is not None:
            topics["score"] = response.css('div.score.likes::attr(title)').extract_first()
        else:
            topics["score"] = "0"

        if int(topics["score"]) > 10000:
            author_url = response.css('p.tagline a.author::attr(href)').extract_first()
            yield scrapy.Request(url=response.urljoin(author_url), callback=self.parse_user, meta={'topics': topics})
        else:
            yield topics

    def parse_user(self, response):
        topics = response.meta.get('topics')

        users = {}
        users["name"] = topics["author"]
        users["karma"] = response.css('span.karma::text').extract_first()

        yield users
        yield topics

What it does that it gets all the URLs from the main page of old.reddit, Then scrape each URL's title, author and score.

What I've added to it is a second part, Where it checks if the score is higher than 10000, If it is, Then the Spider goes to the user's page and scrape his karma from it.

I do understand that I can scrape the karma from the topic's page, But I would like to do it this way, Since there is other part of the user's page I scrape That doesn't exist in the topic's page.

What I want to do is to export the topics list which contains title, author, score into a JSON file named topics.json, Then if the topic's score is higher than 10000 to export the users list which contains name, karma into a JSON file named users.json.

I only know how to use the command-line of

scrapy runspider Reddit.py -o Reddit.json

Which exports all the lists into a single JSON file named Reddit but in a bad structure like this

[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
  ....
]

I have no-knowledge at all about Scrapy's Item Pipeline nor Item Exporters & Feed Exporters on how to implement them on my Spider, or how to use them overall, Tried to understand it from the Documentation, But it doesn't seem I get how to use it in my Spider.


The final result I want is two files:

topics.json

[
 {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
 {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
 {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
 {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
 ....
]

users.json

[
  {"name": "Username", "karma": "00000"},
  {"name": "Username2", "karma": "00000"},
  {"name": "Username3", "karma": "00000"},
  ....
]

while getting rid of duplicates in the list.

Toleo
  • 764
  • 1
  • 5
  • 19
  • What's the desired output format? Also, I understand that you want to output (yield) at most one item per topic found. – Apalala Jun 16 '18 at 19:26
  • @Apalala I actually want to make each `yield` output to has its own `JSON` file, instead of all in a single file. – Toleo Jun 16 '18 at 19:50

2 Answers2

1

Applying approach from below SO thread

Export scrapy items to different files

I created a sample scraper

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        yield {"type": "unknown item"}
        yield {"title": "ExampleTitle1", "author": "Username", "score": "11000"}
        yield {"name": "Username", "karma": "00000"}
        yield {"name": "Username2", "karma": "00000"}
        yield {"someothertype": "unknown item"}

        yield {"title": "ExampleTitle2", "author": "Username2", "score": "12000"}
        yield {"title": "ExampleTitle3", "author": "Username3", "score": "13000"}
        yield {"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
        yield {"name": "Username3", "karma": "00000"}

And then in exporters.py

from scrapy.exporters import JsonItemExporter
from scrapy.extensions.feedexport import FileFeedStorage


class JsonMultiFileItemExporter(JsonItemExporter):
    types = ["topics", "users"]

    def __init__(self, file, **kwargs):
        super().__init__(file, **kwargs)
        self.files = {}
        self.kwargs = kwargs

        for itemtype in self.types:
            storage = FileFeedStorage(itemtype + ".json")
            file = storage.open(None)
            self.files[itemtype] = JsonItemExporter(file, **self.kwargs)

    def start_exporting(self):
        super().start_exporting()
        for exporters in self.files.values():
            exporters.start_exporting()

    def finish_exporting(self):
        super().finish_exporting()
        for exporters in self.files.values():
            exporters.finish_exporting()
            exporters.file.close()

    def export_item(self, item):
        if "title" in item:
            itemtype = "topics"
        elif "karma" in item:
            itemtype = "users"
        else:
            itemtype = "self"

        if itemtype == "self" or itemtype not in self.files:
            super().export_item(item)
        else:
            self.files[itemtype].export_item(item)

Add below to the settings.py

FEED_EXPORTERS = {
    'json': 'testing.exporters.JsonMultiFileItemExporter',
}

Running the scraper I get 3 files generated

example.json

[
{"type": "unknown item"},
{"someothertype": "unknown item"}
]

topics.json

[
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
]

users.json

[
{"name": "Username", "karma": "00000"},
{"name": "Username2", "karma": "00000"},
{"name": "Username3", "karma": "00000"}
]
Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • So I've made these changes in my files: exporters.py => https://ghostbin.com/paste/wb3qa - settings:py => https://ghostbin.com/paste/td2qn - Reddit.py => https://ghostbin.com/paste/4yyhs | Then wrote `scrapy runspider Reddit.py` but nothing happened, Did I miss something, or Am I supposed to move these files in a Single-folder? – Toleo Jun 19 '18 at 13:31
  • You need to use `-o example.json` – Tarun Lalwani Jun 19 '18 at 14:29
  • Got this error `ModuleNotFoundError: No module named 'testing'` – Toleo Jun 19 '18 at 17:39
  • That is because testing was the scrapy project name I had created. You need to update that to your project name – Tarun Lalwani Jun 20 '18 at 01:20
0

The spider is yielding two items when it crawls a user page. Perhaps it would work if:

def parse_user(self, response):
    topics = response.meta.get('topics')

    users = {}
    users["name"] = topics["author"]
    users["karma"] = response.css('span.karma::text').extract_first()
    topics["users"] = users

    yield topics

You can the post-process the JSON as you need.

BTW, I don't understand why you use the plural ("topics") when dealing with single elements (a single "topic").

Apalala
  • 9,017
  • 3
  • 30
  • 48
  • I aim to make it both on a different file by the plural names `topics` and `users`, That why I used the plural, Just to tell which file's list – Toleo Jun 18 '18 at 12:16
  • Scrapy will produce a single stream of JSON records. It's easy to post-process that stream using tools like `jq`. If you want two types of record in the stream, add a `type` field to each to make them easy to differentiate. – Apalala Jun 20 '18 at 17:09