0

I'm having difficulty finalizing a crawler (more specifically in the output file in txt). It must have a header (h), and a footer (p) that should be written only once. And variable data (col) that will be generated by Scrapy. Currently I have included the header and footer manually and I'm looking for a way to automate the process. I know that the plain text file does not have a header and a footer. But is there any way to simulate this without having to resort to external modules?

filename = item['cat'] + '.txt'

            f= open(filename,'a')
            h = ('As últimas notícias')
            p = ('Você só encontra aqui')
            col = ('item['title'] \n + item ['author']\n + item['img']\n\n+ item['news']')
            f.write('h \n + col \n + p')
            f.close()

Desired output:

As últimas notícias

title here
author here
img link here
news here

title here
author here
img link here
news here

title here
author here
img link here
news here

title here
author here
img link here
news here

Você só encontra aqui

1 Answers1

1

Maybe you can use pipelines like here: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file In open_spider you will be creating file descriptor and writing header, on close_spider write footer and close file descriptor, and with process_item you can write your content.

Also you can check this topic with alike theme: Scrapy pipeline spider_opened and spider_closed not being called

UPD:

class MySpider(Spider):
    files = {}

    def parse(self, response):
         # create you item and then:
         if item['cat'] in self.files:
             f = self.files[item['cat']]
         else:
             f = open(item['cat'] + '.txt', 'a')
             f.write('As últimas notícias')
             self.files[item['cat']] = f

         f.write('col \n')

And then on spider_closed iterate by self.files, write footers and close descriptors.

vezunchik
  • 3,669
  • 3
  • 16
  • 25
  • For the specific example does not work for a reason, **"spider_opened"** and **"spider_closed"** do not include **item**. This option only exists in the **"process_item"**. I do not know if you've noticed, but the file name is **item ['cat'] + '.txt'**. I used this suggestion of yours on another spider, in which the output name is fixed. This crawler fetches information on a news site and saves a separate **txt** by category automatically. This is done once a day and each **txt** of the category is emailed to a group that is related to each category. – Antonio Oliveira Mar 10 '19 at 14:29
  • I was trying to find a way with **close_spider** to open all files **(* .txt)** and include the **header** (at the beginning) and the **footer** (at the end) of the file. In **open_spider** there is no way because filename is dynamic. Any suggestion? – Antonio Oliveira Mar 10 '19 at 14:44
  • Do you know list of categories before crawl? Maybe create files from full list (if it is not huge) on `spider_opened` and on `spider_closed` remove empty ones? But not fancy solution still... – vezunchik Mar 10 '19 at 14:52
  • The categories are dynamic. They change periodically. And the quantity is not small. – Antonio Oliveira Mar 10 '19 at 14:59
  • I updated my post with one suggestion. We can store dict of categories with file descriptors. We use existing or create new one and store it in dictionary, then write content. Maybe it will fit you. – vezunchik Mar 10 '19 at 15:10
  • Thank you! I think that's a great possibility. – Antonio Oliveira Mar 10 '19 at 15:53
  • I tested your dictionary suggestion files {}. But that way the header goes to the end of the file and not at the beginning. I tried to use the seek (0) option, but it did not work. I keep searching. – Antonio Oliveira Mar 10 '19 at 18:31