-4

I wrote a code to scrape datas from a site but there is a problem. The site it is a news portal.

        articleIndex = 0
        for div in mainPage_soup.findAll('div', attrs={'class':'title'}):
            if(articleIndex<2):                
                article = requests.get(article_url)
                article_soup = BeautifulSoup(article.content, "html.parser") 
           
                d=""
                date_soup = BeautifulSoup(html)
                d=date_soup.find('time', class_='article-datetime').get_text()
                print(d)

                article_content_str = ""
                text = article_soup.find('div', class_='article-content entry-content')
                for item in text.find_all('p'):
                    text = "#" + item.text
                    article_content_str += text                

The site name: hvg.hu
I get a nontype error with date and p-s. The Date is the article realase date And the P get the article text by sentences.

I tried a lot about the date. normal text, get_text but nothing work.

It works (if I write out the class names) with a different sites.

I don't know where is the problem.
Maybe I chose wrong divs?

1 Answers1

1

There is no one fits all solution, so you have to decide on case.

  • Check wich links you like to scrape and how they differ:

    mainPage_soup.select('h1 a[title][href]')
    
  • Check on each article page, if there are the elements you expect (walrus operator needs python 3.8 and later):

    if (t := article_soup.find('time', class_='article-datetime')):
        time = t.get_text(strip=True)
    elif (t := article_soup.select_one('label:-soup-contains("megjelent") + p')):
        time = t.get_text(strip=True)
    else:
       time = None
    

    else use regular if/else statement:

    if article_soup.find('time', class_='article-datetime'):
        time = article_soup.find('time', class_='article-datetime').get_text(strip=True)
    elif article_soup.select_one('label:-soup-contains("megjelent") + p'):
        time = article_soup.select_one('label:-soup-contains("megjelent") + p').get_text(strip=True)
    else:
       time = None
    
Example

Sliced to first 20 results, skip [:20] from for loop if you like to scrape more, but be gentle and add some delay between your iterations:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://hvg.hu'

headers = ({'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})

r = requests.get(url, headers=headers)
mainPage_soup = BeautifulSoup(r.content)
data = []

for a in mainPage_soup.select('h1 a[title][href]')[:20]:
    if 'http' in a.get('href'):
        article = requests.get(a.get('href'))
    else:
        article = requests.get(url+a.get('href'))
    
    article_soup = BeautifulSoup(article.content) 
    
    if (t := article_soup.find('time', class_='article-datetime')):
        time = t.get_text(strip=True)
    elif (t := article_soup.select_one('label:-soup-contains("megjelent") + p')):
        time = t.get_text(strip=True)
    else:
        time = None

    if (t := article_soup.select('.article p')):
        text = t
    elif (t := article_soup.select('.article-content p')):
        text = t
    else:
        text = []

    data.append({
        'time': time,
        'text': ' '.join([p.get_text(strip=True) for p in text]),
        'url': url+a.get('href')
    })

print(data)
#or pd.DataFrame(data).to_csv('yourFile.csv', index=False)
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Thanks a lot. Why I got invalid syntax at this part `t := `? Forgive me as II am a beginner at python. – YasserKhalil Apr 05 '22 at 07:18
  • Maybe a python version issue - Is your version up to date? You can also go with `if article_soup.find('time', class_='article-datetime'): time = article_soup.find('time', class_='article-datetime').get_text(strip=True)` .... – HedgeHog Apr 05 '22 at 07:51
  • I am using python 3.7.5 – YasserKhalil Apr 05 '22 at 08:27
  • Thanks - Edited my answer and added version requirement as well as regular if/else stament example. – HedgeHog Apr 05 '22 at 08:39
  • I just wonder why it doesn't work on my side although python is updated to version 3 – YasserKhalil Apr 05 '22 at 08:46
  • 1
    It is specific to Python 3.8 and later - So you may have to check how version [update/upgrade works for your os](https://levelup.gitconnected.com/a-guide-to-upgrade-your-python-to-3-9-44ccb3eae31a) – HedgeHog Apr 05 '22 at 09:03
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/243610/discussion-between-hedgehog-and-yasserkhalil). – HedgeHog Apr 05 '22 at 09:07