I am following this question and this other one in order to parse a table from Wikipedia.
Specifically, I'd like to just get all the rows, and within each row, dump the contents of each column.
My code uses the xml
library under MacOS X, but all I get is an empty list of rows.
import xml.etree.ElementTree
s = open("wikiactors20century.txt", "r").read()
# tree = xml.etree.ElementTree.fromstring(s)
# rows = tree.findall()
# headrow = rows[0]
# datarows = rows[1:]
#
# for num, h in enumerate(headrow):
# data = ", ".join([row[num].text for row in datarows])
# print "{0:<16}: {1}".format(h.text, data)
table = xml.etree.ElementTree.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
The input file has been pasted here in PasteBin. Both xml.etree.ElementTree.fromstring
and xml.etree.ElementTree.XML
versions cannot retrieve the row list. However, if I make a dummy table as this
s = "<table> <tr><td>a</td><td>1</td></tr> <tr><td>b</td><td>2</td></tr> <tr><td>c</td><td>3</td></tr> </table>"
then parsing works fine.
What am I doing wrong? Is there some cleaning I must apply before parsing the file?