Is this way to get items from a tag which has 2 class attributes with BeautifulSoup correct?

Question

I'd like to get items from a website with BeautifulSoup.

<div class="post item">

The target tag is this. The tag has two attrs and white space.

First, I wrote,

roots = soup.find_all("div", "post item")

But, it didn't work. Then I wrote,

html.find_all("div", {'class':['post', 'item']})

I could get items with this,but I am nost sure if this is correct or not. is this code correct?

//// Additional ////

I am sorry,

html.find_all("div", {'class':['post', 'item']})

didn't work properly. It also extracts class="item".

And, I had to write,

soup.find_all("div", class_="post item")

not = but _=. Although this doesn't work for me...(>_<)

Target url:

https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb

mycode:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

def main():
    target = "https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb"
    html = urlopen(target)
    soup = BeautifulSoup(html, "html.parser")
    roots = soup.find_all("div", class_="post item")
    print(roots)
        for root in roots:
            print("##################")


if __name__ == '__main__':
    main()

score 3 · Accepted Answer · edited May 23 '17 at 12:07

3

You could use a css select:

soup.select("div.post.item")

Or use class_

.find_all("div", class_="post item")

The docs suggest that *If you want to search for tags that match two or more CSS classes, you should use a CSS selector as per the first example. The give example of both uses:

You can also search for the exact string value of the class attribute:

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

Why your code fails why and any of the above solutions would fail has more to do with the fact the class does not exist in the source, it it were there they would all work:

In [6]: r = requests.get("https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb")

In [7]: cont = r.content

In [8]: "post item" in cont
Out[8]: False

If you look at the browser source and do a search you won't find it either. It is generated dynamically and can only be seen if you crack open a developer console or firebug. They also only contain some styling and a react ids so not sure what you expect to pull from it even if you did get them.

If you want to get the html that you see in the browser, you will need something like selenium

edited May 23 '17 at 12:07

Community

1
1

answered Mar 28 '16 at 01:48

Padraic Cunningham

176,452
29
245
321

Thank you for your answer. I used `.find_all("div", class_="post item")`, I don't know but I couldn't extract items. I added the target url, could you get from the site? Thank you. – yamachan Mar 28 '16 at 02:05
where is the link? – Padraic Cunningham Mar 28 '16 at 02:05
1

@yamachan, there is a good reason you don't get it, it is not in the source, – Padraic Cunningham Mar 28 '16 at 02:08
I am sooooooo sorry. I read html tags with firebug. And I didn't imagine the code which I extract with urllib or BeautifulSoup differed... I am sorry to bother you. And thank you for giving me advice. – yamachan Mar 28 '16 at 02:20
1

@yamachan, no worries, if you right click and choose view page source you will see what html you actually download. – Padraic Cunningham Mar 28 '16 at 02:21

score 2 · Answer 2 · edited May 23 '17 at 12:24

2

First of all, note that class is a very special multi-valued attribute and it is a common source of confusion in BeautifulSoup.

html.find_all("div", {'class':['post', 'item']})

This would find all div elements that have either post class or item class (or both, of course). This may produce extra results you don't want to see, assuming you are after div elements with strictly class="post item". If this is the case, you can use a CSS selector:

html.select('div[class="post item"]')

There is also some more information in a similar thread:

BeautifulSoup returns empty list when searching by compound class names

edited May 23 '17 at 12:24

Community

1
1

answered Mar 28 '16 at 01:48

alecxe

462,703
120
1,088
1,195

Thank you for your answer. Yes, I was wrong. My code got `item` class, not `post item` class. – yamachan Mar 28 '16 at 02:11
2

@yamachan anyway, the `'div[class="post item"]` is exactly what you need. – alecxe Mar 28 '16 at 02:14
I am so sorry, it turned out that I couldn't extract html tags which I saw through `Firebug`. I am sorry to bother you. Thank you for your kindness and advice. – yamachan Mar 28 '16 at 02:23

Is this way to get items from a tag which has 2 class attributes with BeautifulSoup correct?

2 Answers2