3

I've already crawled the description of the articles. Now, I'm trying to scrape the description of a video from BBC news website, but it returns an empty string. Any advice guys ??!!

This is my code:

class BbcNewsSpider(CrawlSpider):
    name = 'BBCNews'
    start_urls = ['https://www.bbc.com/']
    rules=(Rule(LinkExtractor(restrict_xpaths="//li[contains(@class,'orb-nav-home')]//a",
                                                process_value=lambda x:x[0:16]+'com'), 
    callback='parse_home'),
       Rule(LinkExtractor(allow='bbc.com', 
       restrict_xpaths='//div[contains(@class,"module__content")]'
                                                           '//div[contains(@class,"media") and not 
       (contains(@class,"media--icon"))]'
                                                           '//a[contains(@class,"block-link__overlay-link")]'
                          , process_value=lambda x: 'https://www.bbc.com' + x if x[0:1] == "/" else x),
            callback='parse_item'),
       )

This is the function I'm using:

  def parse_home(self,response):
    if response.status==200:
        doc = pq(response.text)
        medias = doc('div.media--video').items()
        for media in medias:
            item=BbcmediaItem()
            item['url'] = media.find('a.media__link').attr('href')
            item['title']=media.find('a.media__link').text().strip()
            item['Type']=media.find('a.media__tag').text()
            item['description']=media.find('p.story-body__introduction').text().strip()
            yield item
MarMarhoun
  • 45
  • 4
  • first check if page uses JavaScript to add elements - turn off JavaScript in web browser and reload page to see what you can get without JavaScript. If you don't see elements then you will have to use Selenium to control real web browser which can run JavaScript. – furas May 13 '20 at 06:16
  • I don't see `p.story-body__introduction` in HTML on main page - maybe you use wrong name. And don't see any destription for videos. Or maybe it uses it only for some devices (ie. mobile phones) or for some countries. – furas May 13 '20 at 06:29
  • Thank you for your help I will try it. \• About the 'p.story-body__introduction', I did use it to scrap the articles but when I appy it for video doesn't work. I even try to use 'p.media__summary' but I have the same thing!!! Do you propose any other advices that could help! – MarMarhoun May 13 '20 at 20:42
  • why do you think that `p.story-body__introduction` exist for video ? I don't see this element in HTML. First you have to check (manually) in HTML what you can get - don't try to guess. BTW: when I visit page then I don't see any desciption or summary for video. It seems you try to get element which never existed. – furas May 14 '20 at 01:31
  • the only `media__summary` has main video - but it is not in class `media--video`. It is in class `video__player`. All videos in class `media--video` are without `summary` – furas May 14 '20 at 01:40

2 Answers2

0

I my self have made a scraper that scrapes titles from yahoo news. Your code is OKAY. The problem is that BBC News might not be allowing you to scrape the description of the video's

Try Using a proxy.

OR

scrape yahoo news. because scraping it is easy

This is my code that scrapes all paragraphs from yahoo news you can change it to whatever you like

import bs4
import requests
import sys
import re 
import unicodedata
import os
import random
import datetime

Current_Date_Formatted = datetime.datetime.today().strftime ('%d-%b-%Y -- %H:%M')
time = str(Current_Date_Formatted)

filename = "Yahoo World News " + time 

filename=r"D:\Huzefa\Desktop\News\World\\" +filename+ ".txt"
url = "https://news.yahoo.com/"
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "lxml")
##
file = open(filename , 'wb')
for i in soup.select("p"):
    f=i.text
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", f)).encode('ascii', 'ignore'))
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", os.linesep)).encode('ascii', 'ignore'))
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", os.linesep)).encode('ascii', 'ignore'))
file.close()

Hope this works for you =)

huzefausama
  • 432
  • 1
  • 6
  • 22
  • Thank so much Usama for your advice and sharing with me you code, I apreciate it. Unfortunatly, I have a specific tasks that I must do, the major task that I must scrap the BBC news website!!! – MarMarhoun May 13 '20 at 20:51
  • Do you propose any other advices that could help! – MarMarhoun May 13 '20 at 20:53
0

Do you propose any other advices that could help! – MarMarhoun

You could download this application Scrape-Storm it is an AI-Powered Visual Web Scraping Tool. Built by ex-Google crawler team. No Programming Needed. Visual Operation. Easy to Use. It can scrape the whole page. You can also select the tags which you want to scrape. You can export your data in different formats. I hope this helps you

I am Sorry if i am not allowed to post this. I am kind of new to StackOverflow.

My intention is just to help people

huzefausama
  • 432
  • 1
  • 6
  • 22