Some help in scraping a page in python

Question

I've already crawled the description of the articles. Now, I'm trying to scrape the description of a video from BBC news website, but it returns an empty string. Any advice guys ??!!

This is my code:

class BbcNewsSpider(CrawlSpider):
    name = 'BBCNews'
    start_urls = ['https://www.bbc.com/']
    rules=(Rule(LinkExtractor(restrict_xpaths="//li[contains(@class,'orb-nav-home')]//a",
                                                process_value=lambda x:x[0:16]+'com'), 
    callback='parse_home'),
       Rule(LinkExtractor(allow='bbc.com', 
       restrict_xpaths='//div[contains(@class,"module__content")]'
                                                           '//div[contains(@class,"media") and not 
       (contains(@class,"media--icon"))]'
                                                           '//a[contains(@class,"block-link__overlay-link")]'
                          , process_value=lambda x: 'https://www.bbc.com' + x if x[0:1] == "/" else x),
            callback='parse_item'),
       )

This is the function I'm using:

  def parse_home(self,response):
    if response.status==200:
        doc = pq(response.text)
        medias = doc('div.media--video').items()
        for media in medias:
            item=BbcmediaItem()
            item['url'] = media.find('a.media__link').attr('href')
            item['title']=media.find('a.media__link').text().strip()
            item['Type']=media.find('a.media__tag').text()
            item['description']=media.find('p.story-body__introduction').text().strip()
            yield item

first check if page uses JavaScript to add elements - turn off JavaScript in web browser and reload page to see what you can get without JavaScript. If you don't see elements then you will have to use Selenium to control real web browser which can run JavaScript. — furas, May 13 '20 at 06:16
I don't see `p.story-body__introduction` in HTML on main page - maybe you use wrong name. And don't see any destription for videos. Or maybe it uses it only for some devices (ie. mobile phones) or for some countries. — furas, May 13 '20 at 06:29
Thank you for your help I will try it. \• About the 'p.story-body__introduction', I did use it to scrap the articles but when I appy it for video doesn't work. I even try to use 'p.media__summary' but I have the same thing!!! Do you propose any other advices that could help! — MarMarhoun, May 13 '20 at 20:42
why do you think that `p.story-body__introduction` exist for video ? I don't see this element in HTML. First you have to check (manually) in HTML what you can get - don't try to guess. BTW: when I visit page then I don't see any desciption or summary for video. It seems you try to get element which never existed. — furas, May 14 '20 at 01:31
the only `media__summary` has main video - but it is not in class `media--video`. It is in class `video__player`. All videos in class `media--video` are without `summary` — furas, May 14 '20 at 01:40

score 0 · Answer 1 · answered May 13 '20 at 09:02

I my self have made a scraper that scrapes titles from yahoo news. Your code is OKAY. The problem is that BBC News might not be allowing you to scrape the description of the video's

Try Using a proxy.

OR

scrape yahoo news. because scraping it is easy

This is my code that scrapes all paragraphs from yahoo news you can change it to whatever you like

import bs4
import requests
import sys
import re 
import unicodedata
import os
import random
import datetime

Current_Date_Formatted = datetime.datetime.today().strftime ('%d-%b-%Y -- %H:%M')
time = str(Current_Date_Formatted)

filename = "Yahoo World News " + time 

filename=r"D:\Huzefa\Desktop\News\World\\" +filename+ ".txt"
url = "https://news.yahoo.com/"
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "lxml")
##
file = open(filename , 'wb')
for i in soup.select("p"):
    f=i.text
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", f)).encode('ascii', 'ignore'))
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", os.linesep)).encode('ascii', 'ignore'))
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", os.linesep)).encode('ascii', 'ignore'))
file.close()

Hope this works for you =)

Thank so much Usama for your advice and sharing with me you code, I apreciate it. Unfortunatly, I have a specific tasks that I must do, the major task that I must scrap the BBC news website!!! — MarMarhoun, May 13 '20 at 20:51

score 0 · Answer 2 · answered May 14 '20 at 07:15

Do you propose any other advices that could help! – MarMarhoun

You could download this application Scrape-Storm it is an AI-Powered Visual Web Scraping Tool. Built by ex-Google crawler team. No Programming Needed. Visual Operation. Easy to Use. It can scrape the whole page. You can also select the tags which you want to scrape. You can export your data in different formats. I hope this helps you

I am Sorry if i am not allowed to post this. I am kind of new to StackOverflow.

My intention is just to help people

Some help in scraping a page in python

2 Answers2