5

Hy, i am trying to scrape a web site https://www.dawn.com/pakistan but python find() find_all() method returns empty lists, i have tried the html5.parser, html5lib and lxml still no luck. Classes i am trying to scrape are present in the source code as well as in the soup object but things aren't seem to be working, any help will be appreciated thanks!

Code:

from bs4 import BeautifulSoup 

import lxml

import html5lib

import urllib.request

url1 = 'https://www.dawn.com/pakistan'


req = urllib.request.Request(
    url1, 
    data=None, 
    headers=
{
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
                        )
url1UrlContent=urllib.request.urlopen(req).read()
soup1=BeautifulSoup(url1UrlContent,'lxml')

url1Section1=soup1.find_all('h2', class_='story__title-size-five-text-black- 
font--playfair-display')
print(url1Section1)
Jawad Ahmad Khan
  • 279
  • 6
  • 19
  • Possible duplicate of [BeautifulSoup findAll() given multiple classes?](https://stackoverflow.com/questions/18725760/beautifulsoup-findall-given-multiple-classes) – strippenzieher Dec 12 '18 at 14:22
  • I am trying to get the specific classes and then extract data from them by further scraping, what i do not understand the empty list and "none" return type when i go after the "div" and "article" with specific class names. I tried all the the different parsers but no luck. – Jawad Ahmad Khan Dec 12 '18 at 15:00
  • my question is different, it has nothing to do with the marked duplicate answer, any help will be appreciated thanks ! – Jawad Ahmad Khan Dec 12 '18 at 15:01

2 Answers2

4

yours should work as well (I used a different syntax). But it's the string that you have that doesn't match.

you have: 'story__title-size-five-text-black- font--playfair-display'

and I have : 'story__title size-five text-black font--playfair-display ' it's a very slight difference

replace:

url1Section1=soup1.find_all('h2', class_='story__title-size-five-text-black- font--playfair-display')

with:

url1Section1=soup1.find_all('h2', {'class':'story__title size-five text-black font--playfair-display '})

and see if that helps

chitown88
  • 27,527
  • 4
  • 30
  • 59
  • oh thanks a lot it worked, please can you shed some light on it, how this worked and mine did not worked, because i have been using it to scrape other site and mine was working fine before, but didnt work for this site. and the syntax you used is no where to be found in bs4 documentation for scraping using the class name. – Jawad Ahmad Khan Dec 12 '18 at 15:13
  • it's the syntax I've just got accustomed to using. But your way DOES work as well. It's your string that you have for class. It's not exactly what's in the source html. anyways, if the answer worked, please accept the answer. cheers! – chitown88 Dec 12 '18 at 15:17
  • oh my mistake for the string, please have a look at this class name url1Section1=soup1.find_all('div', class_='col-sm-6 col-12') it returns empty list no matter what. – Jawad Ahmad Khan Dec 12 '18 at 15:29
  • thats weird, I get back 7 objects from that exact code – chitown88 Dec 12 '18 at 17:57
  • yes its weird in the inspect element class_='col-sm-6 col-12' exist with two space between 'col-sm-6' and 'col-12' but in the page source it exists with one space in between, so when you find the one with two spaces using find_all it returns empty list, but when you find the one with one space using find_all it returns 8 objects, i dont know why html behave different, in the inspect element windows and the page source window. – Jawad Ahmad Khan Dec 12 '18 at 18:23
  • another option `url1Section1=soup1.select('div.col-sm-6.col-12') ` – chitown88 Dec 12 '18 at 18:36
  • thanks a lot for the help, much appreciated. I narrowed the problem to the level such that one should first see the output a specific parser outputs and from that output look for the class names like how the classes are named after going through the parser, leave the inspect element and page source method, just use them to have an idea of the class name which you want to target. – Jawad Ahmad Khan Dec 12 '18 at 18:54
  • url1Section1=soup1.select('div.col-sm-6.col-12') thanks a lot this works perfectly, can you explain this statement "select('div.col-sm-6.col-12')". – Jawad Ahmad Khan Dec 12 '18 at 18:56
  • there's plenty of info out there that you can find. a simple google search gets https://stackoverflow.com/questions/38028384/beautifulsoup-is-there-a-difference-between-find-and-select-python-3-x – chitown88 Dec 13 '18 at 06:46
  • if the solution works, can you please accept it above. – chitown88 Dec 13 '18 at 06:47
2

I don't think you can pass compound class names like that. I use These are compound class names. I have used css selectors as a faster retrieval method. Compounds are filled with ".".

If you are after the headers you can use a slightly different selector combination

import requests
from bs4 import BeautifulSoup

url= 'https://www.dawn.com/pakistan'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
items = [item.text.strip() for item in soup.select('h2[data-layout=story] a')]
print(items)

To limit to just those on the left you can use:

items = [item.text.strip() for item in soup.select('.story__title.size-five.text-black.font--playfair-display a' )]

More broadly,

items = [item.text.strip() for item in soup.select('article [data-layout=story]')] 

As per your comment:

items = [item.text.strip() for item in soup.select('.col-sm-6.col-12')] 
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • I am trying to get the specific classes and then extract data from them by further scraping, what i do not understand the empty list and "none" return type when i go after the "div" and "articles" with specific class names. I tried all the the different parsers but no luck – Jawad Ahmad Khan Dec 12 '18 at 14:54
  • there is an example of grabbing articles above. Inside of that there isn't content in the divs. Can you give a specific example of div content you are expecting to see? – QHarr Dec 12 '18 at 15:08
  • what ever the class name i try to get using the url1Section1=soup1.find_all('h2', class_='story__title-size-five-text-black- font--playfair-display') returns empty or none, i tried all the parsers but not luck. – Jawad Ahmad Khan Dec 12 '18 at 15:19
  • it works fine but when i try url1Section1=soup1.find_all('h2', class_='story__title-size-five-text-black- font--playfair-display') this for other classes it returns empty lists or none, i am not been able to understand the problem i am having for example my code for this class doesnt work url1Section1=soup1.find_all('div', class_='col-sm-6 col-12'). – Jawad Ahmad Khan Dec 12 '18 at 15:39
  • I think it is due to the spaces. These are compound class names. I have used css selectors as a faster retrieval method and you fill the spaces with "." in the class names. – QHarr Dec 12 '18 at 15:54
  • no still not working after filing the spaces with "." its really annoying, i guess css selector is the way forward but still if you can shed some light on the problem it will be much appreciated thanks. or can you please extract the class i mention above by any method ? class_='col-sm-6 col-12' – Jawad Ahmad Khan Dec 12 '18 at 16:04
  • Did this now answer your question? – QHarr Dec 13 '18 at 17:15