11

I'm trying to get the data from a table with a specific ID which I know. For some reason, the code keeps giving me a None result.

From the HTML code I'm trying to parse:

<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
    <tr class="gridHeader" valign="top">
        <td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td>
        <td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td>
        <td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td>
        <td class="titleGridReg" align="center" valign="top">שער בסיס</td>
        <td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span></td>
        <td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
    </tr>
    <tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">

... And so on

My code:

html = br.response().read()
soup = BeautifulSoup(html)

table = soup.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
rows = table.findAll(lambda tag: tag.name=='tr')

In [100]: print table
None
Uyghur Lives Matter
  • 18,820
  • 42
  • 108
  • 144
erantdo
  • 685
  • 2
  • 9
  • 19
  • 3
    Why don't you just use `find_all(tag, id="id_name")`? – aIKid Oct 25 '13 at 13:58
  • You're talking about the rows creation? Unfortunately the table itself is empty, so it doesn't matter.. I need the get the "table" done right first.. – erantdo Oct 25 '13 at 14:01
  • It's the same thing with this line `table = soup.find()` – aIKid Oct 25 '13 at 14:03
  • @aIKid `table = soup.find(tag, id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")` gives: `NameError: name 'tag' is not defined` – erantdo Oct 25 '13 at 14:05
  • 1
    Does that work? I've added an answer. – aIKid Oct 25 '13 at 14:06

2 Answers2

21

From the documentation:

table = soup.find('table', id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")

And the for the rows line:

rows = table.findAll('tr')

For the encoding problem, try decoding it from utf-8, and re-encode it.

html = br.response().read().decode('utf-8')
soup = BeautifulSoup(html.encode('utf-8'))
aIKid
  • 26,968
  • 4
  • 39
  • 65
  • @alKid thanks for helping, but it still returns None: `table = soup.find('table', id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1") print table None` Could it be because it's a table inside a table ? – erantdo Oct 25 '13 at 14:11
  • 1
    @aIKid But the problem is this, he's using UTF-8, so this might be a problem to work with. There's hebrew in his html. – Games Brainiac Oct 25 '13 at 14:13
  • please save response from mechanize browser br. f=open('test.html','w');html = br.response().read();f.write(html), and check the saved page. May be you are getting a non valid response :) – Jasim Muhammed Oct 27 '13 at 10:05
  • How to find a table if I don't have `id`, but a `class`? – Hrvoje T Jan 17 '19 at 14:09
1

Improving upon aiKid's answer:

# coding=utf-8
from bs4 import BeautifulSoup

html = u"""
<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
                            <tr class="gridHeader" valign="top">
                                <td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td><td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td><td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td><td class="titleGridReg" align="center" valign="top">שער בסיס</td><td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span>
</td><td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
                            </tr><tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">
"""

soup = BeautifulSoup(html)
print soup.find_all("table",
                    id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")

Since you're working with UTF-8 data, you need to set the string as a unicode string like so u"""(...)""". All you need to do to work with unicode is this:

br.response().read().decode('utf-8')

The above will give you an ASCII string, that you can later encode into unicode. Like, say the string is stored in html, and you can encode it back to unicode using html.encode("utf-8"). If you do this, you do not need to put the u in front of anything. You can treat everything as a regular string again.

Games Brainiac
  • 80,178
  • 33
  • 141
  • 199
  • Try wrapping it unicode, like so `html = unicide(br.response(),read())` – Games Brainiac Oct 25 '13 at 14:27
  • It still doesn't work. I tried: `table_id = u"""ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1""" table = soup.findall(u"""table""", id=table_id)` And I get: `TypeError: 'NoneType' object is not callable` – erantdo Oct 25 '13 at 14:36
  • @erantdo have you tried using his recommendation? Changin this line: `html = unicode(br.response(),read())` – aIKid Oct 25 '13 at 14:38
  • @erantdo Well, the ID does not have an unicode in it so no problem. Did you make sure to put in a `# encoding=utf-8` at the very top of the file? – Games Brainiac Oct 25 '13 at 14:39
  • @alKid @Gamesbrainiac I added this on top but it's a comment (I'm using Canopy). also, changing to `html = unicode(br.response().read())` returns `UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 118: ordinal not in range(128)` – erantdo Oct 25 '13 at 14:42
  • @GamesBrainiac Thanks. Snippet #24499 – erantdo Oct 25 '13 at 14:45
  • @erantdo link please. – Games Brainiac Oct 25 '13 at 14:48
  • @erantdo I've updated my answer too. Tested it with a file and works fine. :) – aIKid Oct 25 '13 at 14:49
  • @alKid thanks, but now I'm getting: `TypeError: 'NoneType' object is not callable` for the line `table = soup.findall("table", id=table_id)` – erantdo Oct 25 '13 at 14:53
  • @erantdo I see now, you need to decode it first, you're using a psuedo browser called mechanize. – Games Brainiac Oct 25 '13 at 14:54
  • @GamesBrainiac @alKid Thanks, but I changed the code to https://dpaste.de/fWDa and now I'm getting `TypeError: 'NoneType' object is not callable` – erantdo Oct 25 '13 at 14:57
  • @erantdo Ask a separate question then, this convo is getting too long. – Games Brainiac Oct 25 '13 at 14:58
  • Can you do that, @Games ? – aIKid Oct 25 '13 at 14:59
  • @aIKid Not yet. But what he's asking about is mechanize, and thats a separate question altogether, and needs to create a separate topic. – Games Brainiac Oct 25 '13 at 15:01
  • @alKid Thanks a lot guys. New topic is at: http://stackoverflow.com/questions/19593340/python-beautiful-soup-parsing-a-utf-8-coded-table-using-mechanize – erantdo Oct 25 '13 at 15:03