3

I am scraping the news of a page with Scrapy, which is basically a title, meta text and text summary. The code is actually working fine but I have a problem with the dictionary output. The output displays first all titles, after that all meta text and finally all text summaries. But what I would need is one news after another with title, meta text and text summary. I guess something is wrong with the for loop or the selectors?

Thanks for any help!

My Code:

import scrapy
class testspider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://oilprice.com/Latest-Energy-News/World-News']    

    def parse(self, response):
        all_news = response.xpath('//div[@class="tableGrid__column tableGrid__column--articleContent category"]')

        for singlenews in all_news:         
            title_item = singlenews.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()
            meta_item = singlenews.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__meta"]//text()').extract()
            extract_item = singlenews.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__excerpt"]//text()').extract()      

            yield {
                'title_data' : title_item,
                'meta_data' :  meta_item,
                'extract_data' : extract_item        
            }

Output:

{'title_data': ['Global Energy-Related CO2 Emissions Stopped Rising In 2019', 'BHP
 Is Now The World’s Top Copper Miner', 'U.S. Budget Proposal Includes Sale Of 15 
Mln Barrels Strategic Reserve Oil', ... , '**meta_data**': ['Feb 11, 2020 at 12:02
 | Tsvetana Paraskova', 'Feb 11, 2020 at 11:27 | MINING.com ', 'Feb 11, 2020 at 
09:59 | Irina Slav', ... , '**extract_data**': ['The world’s energy-related carbon
 dioxide (CO2) emissions remained flat in 2019, halting two years of emissions 
increases, as lower emissions in advanced economies offset growing emissions
 elsewhere, the International Energy…', 'BHP Group on Monday became the world’s 
largest copper miner based on production after Chile’s copper commission announced 
a slide in output at state-owned Codelco.\r\nHampered by declining grades 
Codelco…', 'The budget proposal President Trump released yesterday calls for the 
sale of 15 million barrels of oil from the Strategic Petroleum Reserve of the 
United States.\r\nThe proceeds from the…', ... , ']}
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Kate
  • 33
  • 4

2 Answers2

2

From your output it seems like your code is extracting title, meta_data and extract_data all at once and saving it in one dictionary. If you want a dictionary for each news item on the website you are scraping, you should get all the data you need first and then parse it into dictionaries as you like. So your code would look something like this

def parse(self, response):
    all_news = response.xpath('//div[@class="tableGrid__column tableGrid__column--articleContent category"]')  
    titles = all_news.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()
    meta_items = all_news.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__meta"]//text()').extract()
    extract_items = all_news.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__excerpt"]//text()').extract()      

    # at this point titles, meta_items and extract_items should be 3 concurrent lists of the same length and now you can parse them as you need

    news_items = []
    for i in range(len(titles)): 
        news = { 'title': titles[i], 'meta_data': meta_items[i], 'extract_data': extract_items[i] }
        news_items.append(news)
    return news_items

This should return the news posts as you desire.

sudo97
  • 904
  • 2
  • 11
  • 22
  • Thank you so much. That worked !!! I think I have to go back to the for loop lessons. My code worked on another webpage. So I'm still a bit puzzled why it worked there. Anyhow I can now move on - Thank you! – Kate Feb 12 '20 at 11:57
0

When you uses // in Xpath the search will be executed in the entire document, then the line

title_item = singlenews.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()

Will return a list with all text in a div that match with this filterdiv[@class="categoryArticle__content]

What you need to do is filter for the relative path singlenews, try something like this:

title_item = singlenews.xpath('./div[@class="categoryArticle__content"]//a//text()').extract()

Ref: https://devhints.io/xpath

Breno Silva
  • 301
  • 3
  • 6
  • Thank you for that hint. I tested it but unfortunately there was no output at all anymore. But I need to investigate definitely further on the cheatsheet you are referencing to. – Kate Feb 12 '20 at 11:59