extract class name from tag beautifulsoup python

Question

I have the following HTML code:

    <td class="image">
      <a href="/target/tt0111161/" title="Target Text 1">
       <img alt="target img" height="74" src="img src url" title="image title" width="54"/>
      </a>
     </td>
     <td class="title">
      <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">
      </span>
      <a href="/target/tt0111161/">
       Other Text
      </a>
      <span class="year_type">
       (2013)
      </span>

I am trying to use beautiful soup to parse certain elements into a tab-delimited file. I got some great help and have:

for td in soup.select('td.title'):
 span = td.select('span.wlb_wrapper')
 if span:
     print span[0].get('data-tconst') # To get `tt0082971`

Now I want to get "Target Text 1" .

I've tried some things like the above text such as:

for td in soup.select('td.image'): #trying to select the <td class="image"> tag
img = td.select('a.title') #from inside td I now try to look inside the a tag that also has the word title
if img:
    print img[2].get('title') #if it finds anything, then I want to return the text in class 'title'

another thread here: http://stackoverflow.com/questions/41369344/beautifulsoup4-how-to-retrieve-a-list-of-the-class-name-of-specific-tag/41369459#41369459 — JinSnow, Dec 29 '16 at 14:22

score 13 · Accepted Answer · edited Apr 12 '19 at 16:39

If you're trying to get a different td based on the class (i.e. td class="image" and td class="title" you can use beautiful soup as a dictionary to get the different classes.

This will find all the td class="image" in the table.

from bs4 import BeautifulSoup

page = """
<table>
    <tr>
        <td class="image">
           <a href="/target/tt0111161/" title="Target Text 1">
            <img alt="target img" height="74" src="img src url" title="image title" width="54"/>
           </a>
          </td>
          <td class="title">
           <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">
           </span>
           <a href="/target/tt0111161/">
            Other Text
           </a>
           <span class="year_type">
            (2013)
           </span>
        </td>
    </tr>
</table>
"""
soup = BeautifulSoup(page)
tbl = soup.find('table')
rows = tbl.findAll('tr')
for row in rows:
    cols = row.find_all('td')
    for col in cols:
        if col.has_attr('class') and col['class'][0] == 'image':
            hrefs = col.find_all('a')
            for href in hrefs:
                print href.get('title')

        elif col.has_attr('class') and col['class'][0] == 'title':
            spans = col.find_all('span')
            for span in spans:
                if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
                    print span.get('data-tconst')

Thanks, can I also add in statement to retrieve the value for the "data-tconst" tag? — kegewe, Feb 06 '14 at 01:43
yep, you can add an elif statement that looks for td's with title, pasting code in a comment failed, so I will just update my answer. — Jared Messenger, Feb 06 '14 at 01:47
Thanks, now I just added `def getinfo:` before all that. Can I write getinfo to a CSV? — kegewe, Feb 06 '14 at 01:55
I've personally never written to csv, but you should be able to open a file before the iteration and instead of printing out the values, write them to a file. After the iterator, save the file. — Jared Messenger, Feb 06 '14 at 02:11

hemanth · Answer 2 · 2014-02-06T01:44:41.237

0

span.wlb_wrapper is a selector used to select <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">. Refer this & this for more information on selectors

change this in your python code span = td.select('span.wlb_wrapper') to span = td.select('span') & also span = td.select('span.year_type') and see what it returns.

If you try above and analyze what span holds you will get what you want.

edited Feb 06 '14 at 01:44

answered Feb 06 '14 at 01:20

hemanth

1,033
8
12

I've edited the body text to show what I attempted to do in my code. I've tried changing span.wlb_wrapper to just span but it now just returns a value of "None' – kegewe Feb 06 '14 at 01:38

extract class name from tag beautifulsoup python

2 Answers2