1

I am parsing this page with beautiful soup:

https://au.finance.yahoo.com/q/is?s=AAPL

I am attempting to get the total revenue for 27/09/2014 (42,123,000) which is one of the first values on the statement near the top.

I inspected the element in chrome tools and found that the value is in a table with class name yfnc_tabledata1.

My python code is as follows:

import requests
import bs4

#get webpage
page = requests.get("https://au.finance.yahoo.com/q/is?s=AAPL")

#put into beautiful soup
soup = bs4.BeautifulSoup(page.content)

#select tag
tag = soup.select("table.yfnc_tabledata1")

So far so good, this grabs the table that has the needed data but this is where I am stuck.

The chain that leads to the data I want is as follows:

tag > tbody > tr > td > table > tbody > (then the second tr)

But when I try to use this I get an empty element.

Can anybody help me with this?

Also for bonus points can anyone tell me how I can learn to extract data like this in a more general sense? I constantly need to extract data buried deep within an HTML document and can never seem to work out the correct code to get to the data I want.

Thanks a lot any help appreciated.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Kane
  • 914
  • 2
  • 11
  • 27

4 Answers4

4

There is no <tbody> tag in the HTML.

If you look at the page with a browser (e.g. with Chrome developer tools) it looks like there is a <tbody> tag, but that's a fake tag inserted into the DOM by Chrome.

Try omitting both tags in your search chain. I am certain the first one isn't there and (although the HTML is hard to read) I'm pretty sure the second isn't there either.

Update: Here are the HTML beginning with the table you are interested in:

<TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0">
  <TR>
    <TD>
      <TABLE width="100%" cellpadding="2" ...>
        <TR class="yfnc_modtitle1" style="border-top:none;">
          <td colspan="2" style="border-top:2px solid #000;">
            <small><span class="yfi-module-title">Period Ending</span></small>
          </td>
          <th scope="col" style="border-top:2px ...">27/09/2014</th>
          <th scope="col" style="border-top:2px ...">28/06/2014</th>
          ...

so no <tbody> tags.

ErikR
  • 51,541
  • 9
  • 73
  • 124
  • Oh, how did you find out it is a fake tag? – Kane Dec 06 '14 at 03:47
  • Damn, why would chrome dev tools do that? it has really screwed me up. – Kane Dec 06 '14 at 03:48
  • 3
    For uniformity Chrome creates a tbody element in the DOM even though one is not specified in the HTML. What you are seeing in Chrome dev tools is what was created in memory. This is just something you learn from experience. – ErikR Dec 06 '14 at 03:52
  • That's why I tend to always fetch using `curl` on the shell first, in order to see what really is returned by the server... – Martin C. Dec 06 '14 at 09:02
2

Let's be specific and practical.

The idea is to find the Total Revenue label and get the next cell's text using .next_sibling:

table = soup.find("table", class_="yfnc_tabledata1")
total_revenue_label = table.find(text=re.compile(r'Total Revenue'))
print total_revenue_label.parent.parent.next_sibling.get_text(strip=True)

Demo:

>>> import re
>>> import requests
>>> import bs4
>>> 
>>> page = requests.get("https://au.finance.yahoo.com/q/is?s=AAPL")
>>> soup = bs4.BeautifulSoup(page.content)
>>> 
>>> table = soup.find("table", class_="yfnc_tabledata1")
>>> total_revenue_label = table.find(text=re.compile(r'Total Revenue'))
>>> total_revenue_label.parent.parent.next_sibling.get_text(strip=True)
42,123,000
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Is there an easy way to skip that number to the next one i.e. I was able to do it by `print rev.parent.parent.next_sibling.next_sibling.get_text(strip=True)` but that seems pretty ridiculous. To get the previous quarter data (37,432,000) – Kane Dec 06 '14 at 04:12
  • 1
    @Kane yup, use [`find_next_siblings()`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-next-siblings-and-find-next-sibling) method. – alecxe Dec 06 '14 at 04:13
  • @Kane in other words, `result = [item.get_text(strip=True) for item in total_revenue_label.parent.parent.find_next_siblings()]`. – alecxe Dec 06 '14 at 04:14
1

To answer your general question:

I suggest book "Mining the Social Web" second edition. Specially chapter 5 - "Mining Web Pages".

Source code for the book is available here on github.

Edmon
  • 4,752
  • 4
  • 32
  • 42
1

I think there are probably better ways of getting the data you want? It's been provided for free for a number of years by a number of institutions, e.g. is the information you want in here somewhere?

http://www.afr.com/share_tables/

demented hedgehog
  • 7,007
  • 4
  • 42
  • 49
  • Thanks for that I've bookmarked it. It looks like it only covers ASX listed stocks unfortunately. – Kane Dec 06 '14 at 03:52
  • I believe there are similar sorts of data sources for other stock exchanges. (I dunno about NZ etc). – demented hedgehog Dec 06 '14 at 03:53
  • http://www.nasdaqomx.com/transactions/marketdata/datafeeds and http://eoddata.com/ etc.. (I dunno whether they're free or not or any good). ymmv. There's plenty. – demented hedgehog Dec 06 '14 at 03:55