2

I'm looking to parse the Gutenberg catalog available here using Python. I'm experienced at web scraping and parsing HTML, but this format eludes me. I've tried using the lxml etree and the below attempt at using RDFlib:

path = 'epub/10/pg%s.rdf'
g = rdflib.Graph()
g.parse(path)
s = g.serialize(format='nt')
print(g)

I'm looking for the various metadata values (title, author, Gutenberg url, etc). I'm including a sample file below.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xml:base="http://www.gutenberg.org/"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:cc="http://web.resource.org/cc/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
  xmlns:dcam="http://purl.org/dc/dcam/"
>
  <cc:Work rdf:about="">
    <cc:license rdf:resource="http://www.gnu.org/licenses/gpl.html"/>
    <rdfs:comment>Archives containing the RDF files for *all* our books can be downloaded at
            http://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog</rdfs:comment>
  </cc:Work>
  <pgterms:ebook rdf:about="ebooks/100">
    <dcterms:title>The Complete Works of William Shakespeare</dcterms:title>
    <pgterms:bookshelf>
      <rdf:Description rdf:nodeID="Ncc8361d84fc142969cf27b77ac8d0c24">
        <rdf:value>Plays</rdf:value>
        <dcam:memberOf rdf:resource="2009/pgterms/Bookshelf"/>
      </rdf:Description>
    </pgterms:bookshelf>
    <dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1994-01-01</dcterms:issued>
    <dcterms:publisher>Project Gutenberg</dcterms:publisher>
    <dcterms:rights>Copyrighted. Read the copyright notice inside this book for details.</dcterms:rights>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/files/100/100.txt">
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">5589917</dcterms:extent>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-08-29T12:08:52</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N19fd61f986a94cc18f5dce9ed07e8517">
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain; charset=us-ascii</rdf:value>
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:license rdf:resource="license"/>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.kindle.images">
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N0ee902d343e44cb5a8f639fa55fc6334">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/x-mobipocket-ebook</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">9509392</dcterms:extent>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:18:40.171080</dcterms:modified>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:subject>
      <rdf:Description rdf:nodeID="N0e2195113aa34bf7abfe001edf6a03a2">
        <rdf:value>English drama -- Early modern and Elizabethan, 1500-1600</rdf:value>
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:creator>
      <pgterms:agent rdf:about="2009/agents/65">
        <pgterms:name>Shakespeare, William</pgterms:name>
        <pgterms:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1564</pgterms:birthdate>
        <pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1616</pgterms:deathdate>
        <pgterms:alias>Shakspeare, William</pgterms:alias>
        <pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/William_Shakespeare"/>
        <pgterms:alias>Shakspere, William</pgterms:alias>
      </pgterms:agent>
    </dcterms:creator>
    <dcterms:subject>
      <rdf:Description rdf:nodeID="Ncb26996951d44761901e30445fc8a9dc">
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCC"/>
        <rdf:value>PR</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/files/100/100.zip">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2035857</dcterms:extent>
        <dcterms:format>
          <rdf:Description rdf:nodeID="Nb4f5881241fd42e9a0f8a07cb1462008">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/zip</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:format>
          <rdf:Description rdf:nodeID="Nc3c66052298f482488fb8f13215f92ba">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain; charset=us-ascii</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-08-29T12:09:20</dcterms:modified>
      </pgterms:file>
    </dcterms:hasFormat>
    <pgterms:downloads rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">4605</pgterms:downloads>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.epub.noimages">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2376083</dcterms:extent>
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:18:13.998200</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N9dc27629e3164dba98c659dcaf47c7fe">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/epub+zip</rdf:value>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.html.noimages">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">6944416</dcterms:extent>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:18:00.715792</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N7140e760a0f14ae4ba4b027bd7f7f4f6">
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/html</rdf:value>
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
          </rdf:Description>
        </dcterms:format>
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.kindle.noimages">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">9509383</dcterms:extent>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N34666f5ebdd8461ca1c6b8cfba5103e5">
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/x-mobipocket-ebook</rdf:value>
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
          </rdf:Description>
        </dcterms:format>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:19:07.134922</dcterms:modified>
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.epub.images">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2376084</dcterms:extent>
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N1e32eb8531504d378e05acb6440d24b0">
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/epub+zip</rdf:value>
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
          </rdf:Description>
        </dcterms:format>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:18:09.062427</dcterms:modified>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.rdf">
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-28T05:00:49.076168</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N1d915c961af44ab7ac9c71e7ec068bde">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/rdf+xml</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">11275</dcterms:extent>
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:language>
      <rdf:Description rdf:nodeID="N5ff08142477c4bfeb3bac48c18ba23a4">
        <rdf:value rdf:datatype="http://purl.org/dc/terms/RFC4646">en</rdf:value>
      </rdf:Description>
    </dcterms:language>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.txt.utf-8">
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:17:42.102580</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N98845b3d16bd42d787e9d7cba42bf44b">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">5589889</dcterms:extent>
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:type>
      <rdf:Description rdf:nodeID="N47bb369dd96248ffb1f412145cdb0713">
        <rdf:value>Text</rdf:value>
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/DCMIType"/>
      </rdf:Description>
    </dcterms:type>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.html.images">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">6944416</dcterms:extent>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:17:55.634002</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="Nd1733441ad824cff97a5d9ad50f0307b">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/html</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:isFormatOf rdf:resource="ebooks/100"/>
      </pgterms:file>
    </dcterms:hasFormat>
  </pgterms:ebook>
  <rdf:Description rdf:about="http://en.wikipedia.org/wiki/William_Shakespeare">
    <dcterms:description>Wikipedia</dcterms:description>
  </rdf:Description>
</rdf:RDF>
T. Arboreus
  • 1,067
  • 1
  • 9
  • 17
  • Which element *exactly* do you need help to get from that sample XML? Hopefully, given an example of selecting one or two elements, you can figure a way to select the rest – har07 May 06 '16 at 07:48

2 Answers2

2

Can you just parse it with regular expressions? eg

import re
title = re.search("<dcterms:title>([^<]*)", xml)

EDIT If you want to do this with an xml parser, you'll need to declare the namespace (defined at the top of the xml file):

import xml.etree.ElementTree as et
tree = et.parse(path)
ns = {"dcterms": "http://purl.org/dc/terms/"}
title = tree.find(".//dcterms:title", ns)
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • While this will work with some massaging, it's pretty inelegant for a (supposedly) structured data format. Is RDF really this inaccessible? – T. Arboreus May 06 '16 at 06:17
  • As long as it is an XML, there must be an 'XML way' to access information inside (though some structure can be harder to deal with then another).. – har07 May 06 '16 at 07:48
  • A hacky solution but it lets me get on with my life. Thank you. – T. Arboreus May 06 '16 at 18:15
  • i've edited my answer to show you how to use namespaces to parse the xml. i think you're being a bit harsh with my answer though, if you're not actually using the xml structure more than just as tags (and in you're case i don't think you are), personally find 1 line of regex more elegant than 3 lines of xml parsing -- it's overkill for what you're trying to do. – maxymoo May 09 '16 at 00:05
  • While this works for this specific case it's important to realize that the same RDF data can be serialized in _different_ XML structures. If Gutenberg ever update that file there's no guarantee that a regex (or even an XML-based search) will continue to work. It may be overkill if you're just looking for a quick workaround but if you need something robust, an RDF-based approach (using e.g. RDFlib) is really the way to go. – Jeen Broekstra May 09 '16 at 21:06
2

I know you've already got your quick shortcut but I thought I'd briefly illustrate the RDF-based approach as well, You're pretty close already: you've managed to create a Graph object and load the RDF file into it. The way forward is to then query that Graph object for the properties you're interested in.

As a simple example, to retrieve the title of the e-book with id http://www.gutenberg.org/ebooks/100, you'd do something like this (caveat: I'm no Python programmer so there may be errors):

from rdflib import URIRef, Namespace

id = URIRef("http://www.gutenberg.org/ebooks/100")

# we create a Namespace for the relationship names, to make easy to reuse
pgterms = Namespace("http://www.gutenberg.org/2009/pgterms/")

# print out the object value(s) of the 'title' relation for ebook 100.
for title in g.objects(id, pgterms.title)) 
    print(title)

Note that I'm probably missing some efficient shortcuts here - I don't know RDFLib very well and just concocted this example from looking at their documentation for a few minutes. It may well be possible to just grab that namespace directly from the graph you previously loaded instead of having to manually define them like this.

The general principle is this: RDF is a graph consisting of individual statements, with a subject, a predicate, and an object. You work with it by querying that graph. The above is a very simple query that just retrieves values for a single subject and a single relation, but of course you can do loops, paths, lists, etc.

Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73