0

I want to crawl around 500 articles from the site AlJazeera Website and want to collect 4 tags i.e

  • URL
  • Title
  • Tags
  • Author

I have written the script that collects data from home page, but it only collects couple of articles. Other articles are in different categories. How can I iterate through 500 articles. Is there an efficient way to do it.

import bs4
import pandas as pd
from bs4 import BeautifulSoup
import requests
from collections import Counter
page = requests.get('https://www.aljazeera.com/')
soup = BeautifulSoup(page.content,"html.parser")
article = soup.find(id='more-top-stories')
inside_articles= article.find_all(class_='mts-article mts-default-article')
article_title = [inside_articles.find(class_='mts-article-title').get_text() for inside_articles in inside_articles]
article_dec = [inside_articles.find(class_='mts-article-p').get_text() for inside_articles in inside_articles]
tag = [inside_articles.find(class_='mts-category').get_text() for inside_articles in inside_articles]
link = [inside_articles.find(class_='mts-article-title').find('a') for inside_articles in inside_articles]
sshashank124
  • 31,495
  • 9
  • 67
  • 76
MUK
  • 371
  • 4
  • 13
  • in websites only 6 articles under more top stories. there are no 500 articles and beautifulsoup only extract data from Html parser – Manali Kagathara Jan 07 '20 at 11:59
  • classes are different for different sections in the website. What is the better way to approach this problem. – MUK Jan 07 '20 at 12:27
  • yes, you can get articles from different categories still there is no 500 articles. – Manali Kagathara Jan 07 '20 at 12:57
  • is beautifulsoup is a better way ? or should I explore other libraries as well. Can u suggest any – MUK Jan 07 '20 at 12:59
  • if you want to scrape data from the static websites or low-level complex site then you should use beautifulsoup.When you want to deal with Core Javascript-based web Applications then Selenium would be a great choice. if you are dealing with complex Scraping operation that requires huge speed and with low power consumption then Scrapy would be a great choice. – Manali Kagathara Jan 07 '20 at 13:01
  • one more thing, if you print link variable, it's output is something like – MUK Jan 07 '20 at 13:01
  • I only want the url, but I could not work it out. – MUK Jan 07 '20 at 13:02
  • use get_text() method for extract data from element. – Manali Kagathara Jan 07 '20 at 13:06
  • get_text() and text() method extracts the text or title in this case. I want to just extract the href tag out of it – MUK Jan 07 '20 at 13:12
  • 1
    `link = [inside_articles.find(class_='mts-article-title').find('a')['href'] for inside_articles in inside_articles]` try this – Manali Kagathara Jan 07 '20 at 13:16
  • it's my pleasure, you can upvote the comment if it helps. – Manali Kagathara Jan 08 '20 at 04:22

1 Answers1

1

You can use scrapy for this purpose.

import scrapy
import json
class BlogsSpider(scrapy.Spider):
    name = 'blogs'
    start_urls = [
        'https://www.aljazeera.com/news/2020/05/fbi-texas-naval-base-shooting-terrorism-related-200521211619145.html',
    ]

    def parse(self, response):
        for data in response.css('body'):
            current_script = data.xpath("//script[contains(., 'mainEntityOfPage')]/text()").extract_first()
            json_data = json.loads(current_script)
            yield {
                'name': json_data['headline'],
                'author': json_data['author']['name'],
                'url': json_data['mainEntityOfPage'],
                'tags': data.css('div.article-body-tags ul li a::text').getall(),
            }

Save this file to file.py and run it by

$scrapy crawl blogs -o output.json

But configure scrapy structure first.

Shahzaib Butt
  • 31
  • 1
  • 6