0

I have the following code that is supposed to scrpae wikiperdia list of headings.. In the output csv, I expect to get the headings (the main headings) and in column B, the subheadings My problem is with the subheadings as I got all the subheadings in one line and I need to put each subheading in one row Here's my try (but I got only the first subheading not all of them)

import scrapy

class WikipediaTocSpider(scrapy.Spider):
    name = 'wikipedia_toc'
    start_urls = ['https://en.wikipedia.org/wiki/Python_(programming_language)']

    def parse(self, response):
        for toc in response.css('.toclevel-1'):
            yield {
                'heading': toc.css('span.toctext::text').get(),
                'sub_headings': '\n'.join(toc.css('li.toclevel-2 a span.toctext::text').getall())
            }

I run this code from powershell like that scrapy runspider wikipedia_toc.py -o output.csv

How can I remove the empty lists?

enter image description here

YasserKhalil
  • 9,138
  • 7
  • 36
  • 95

1 Answers1

1

Unfortunately CSV is not quite suitable for exporting nested values. It's way more handy to use json in this case. Otherwise, if you do really need to use CSV you'll have to write custom csv exporter for your items. Take a closer look here: How to create custom Scrapy Item Exporter?


To avoid exporting empty lists just add a condition.

import scrapy

class WikipediaTocSpider(scrapy.Spider):
    name = 'wikipedia_toc'
    start_urls = ['https://en.wikipedia.org/wiki/Python_(programming_language)']

    def parse(self, response):
        for toc in response.css('.toclevel-1'):
            item = {}
            item['heading'] = toc.css('span.toctext::text').get()
            sub_headings = toc.css('li.toclevel-2 a span.toctext::text').getall()
            if sub_headings:
                 item['sub_headings'] = sub_headings
            yield item

Michael Savchenko
  • 1,445
  • 1
  • 9
  • 13
  • I tried to export as JSON but didn't get the required output too ..How can this be exported to JSON in proper way? – YasserKhalil Nov 03 '20 at 14:12
  • I think using this line like that is OK `'sub_headings': toc.css('li.toclevel-2 a span.toctext::text').getall()`. But how can I avoid the empty lists? An attached snapshot in the question .. – YasserKhalil Nov 03 '20 at 14:18
  • Thanks a lot. Can you post the whole code as I am confused..? Forgive me as I am a newbie at python. – YasserKhalil Nov 04 '20 at 12:24
  • Thanks a lot. I tried but I got one column only in the CSV output. The second column is supposed to be for the subheadings. – YasserKhalil Nov 05 '20 at 02:41