0

I am following this question and this other one in order to parse a table from Wikipedia.

Specifically, I'd like to just get all the rows, and within each row, dump the contents of each column.

My code uses the xml library under MacOS X, but all I get is an empty list of rows.

import xml.etree.ElementTree

s = open("wikiactors20century.txt", "r").read()

# tree = xml.etree.ElementTree.fromstring(s)
# rows = tree.findall()
# headrow = rows[0]
# datarows = rows[1:]
#
# for num, h in enumerate(headrow):
#     data = ", ".join([row[num].text for row in datarows])
#     print "{0:<16}: {1}".format(h.text, data)

table = xml.etree.ElementTree.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print dict(zip(headers, values))

The input file has been pasted here in PasteBin. Both xml.etree.ElementTree.fromstring and xml.etree.ElementTree.XML versions cannot retrieve the row list. However, if I make a dummy table as this

s = "<table>  <tr><td>a</td><td>1</td></tr>  <tr><td>b</td><td>2</td></tr>  <tr><td>c</td><td>3</td></tr>  </table>"

then parsing works fine.

What am I doing wrong? Is there some cleaning I must apply before parsing the file?

Community
  • 1
  • 1
senseiwa
  • 2,369
  • 3
  • 24
  • 47

1 Answers1

1

Your try does not have the same structure like wikipedia example has.

>>> list(table)
[<Element 'thead' at 0x7ff0fdb73f50>, <Element 'tbody' at 0x7ff0fdb78590>, <Element 'tfoot' at 0x7ff0fb995a90>]

You can get headers name with:

>>> columns = list(k.text for k in table[0][0])

And then iter every rows to build data table:

>>> data_table = list(dict(zip(columns, list(v.text for v in row))) for row in table[1])
>>> print(json.dumps(data_table, indent=2))
[
  {
    "L,S": "L", 
    "Cause of death": "~", 
    "null": "F", 
    "Noms": "1", 
    "Wins": "0", 
    "Age": "26", 
    "Actor": null, 
    "Born": "1990", 
    "Film": null, 
    "Last": "~", 
    "WoF": "~", 
    "Died": "~", 
    "First": "2001"
  }, 
  {
    "L,S": "1L,1S", 
    "Cause of death": "~", 
    "null": "M", 
    "Noms": "2", 
    "Wins": "0", 
    "Age": "39", 
    "Actor": null, 
    "Born": "1977", 
    "Film": null, 
    "Last": "~", 
    "WoF": "~", 
    "Died": "~", 
    "First": "2001"
  }, 

[...]

Note: There is some parsing issues with links, and inner tags. It can be solved with itertext or deeper parsing.

Cyrbil
  • 6,341
  • 1
  • 24
  • 40