6

I'm looking for a good quality HTML Microdata parser in Python. It doesn't have to be blazing fast but I'd like it to support as much of the spec as possible including itemref.

Here's what I've found so far:

Have you used any of these libraries? What were the pros and cons?

I'm also curious about parsing poorly formatted HTML documents. Have you found a Microdata parser that handles messy input or do you run the input through something like BeautifulSoup first?

Shawn Simister
  • 4,613
  • 1
  • 26
  • 31

1 Answers1

4

What format do you want the Microdata parsed to?

https://github.com/RDFLib/pymicrodata will parse to RDF.

If you want JSON instead you should use https://github.com/edsu/microdata, which has recently gotten some attention and should be more conformant to the spec.

https://pypi.python.org/pypi/pelican-microdata/0.1 looks like a way to generate Microdata for a particular static site generator, so I don't think it will help with parsing.

I don't know how tolerant to poorly formatted HTML either of the above parsers are. If you know of some poorly formatted markup on the wild that uses Microdata, I'd be interested in seeing how well the Ruby parsers handle these cases.

Jason R
  • 451
  • 4
  • 5
  • Either RDF of JSON are acceptable output. As long as I can iterate over the property values of an item. Eventually I might be interested in getting the output as JSON-LD but its probably about the same amount of work to generate that from JSON or RDF. – Shawn Simister Apr 02 '13 at 17:19
  • 1
    I ended up using Ed Summers' parser. It has handled everything I've thrown at it so far. Thanks! – Shawn Simister Apr 04 '13 at 18:51