0

I have an RDF/XML file that is formatted like so (truncated to only show the necessary data):

<rdf:RDF xml:base="http://www.gutenberg.org/">
    <pgterms:ebook rdf:about="ebooks/48666">
        <pgterms:downloads rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">34</pgterms:downloads>
        <dcterms:creator>
            <pgterms:agent rdf:about="2009/agents/36363">
            <pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1926</pgterms:deathdate>
            <pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/Edmund_Candler"/>
            <pgterms:alias>Chandler, Edmund</pgterms:alias>
            <pgterms:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1874</pgterms:birthdate>
            <pgterms:name>Candler, Edmund</pgterms:name>
            </pgterms:agent>
        </dcterms:creator>
        <dcterms:title>The Sepoy</dcterms:title>
        <dcterms:subject>
            <rdf:Description rdf:nodeID="Nd62b88adeb1347d9b99ba9d763e74269">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
            <rdf:value>Soldiers -- India -- Conduct of life</rdf:value>
            </rdf:Description>
        </dcterms:subject>
    </pgterms:ebook>
</rdf:RDF>

I would like to retrieve certain properties from this file such as:

  • title: The Sepoy
  • creator - name: Candler, Edmund
  • downloads: 34
  • subject - value: Soldiers -- India -- Conduct of life

I have identified that SPARQL is most likely the technology that I would need for this type of job but I have no experience with RDF and am quite confused by how this data is formatted. How can I parse this file to retrieve the desired information in python?

Luca Guarro
  • 1,085
  • 1
  • 11
  • 25
  • use `rdflib`, look at your data as `TURTLE` or `N-Triples` format (because this is close to SPARQL), check a small RDF/SPARQL tutorial - it's really simple for basic queries, you only write patterns that match your graph – UninformedUser May 15 '21 at 02:50
  • as an example, to get the title, do `<> select ?book ?title where {?book a pgterms:ebook . ?book dcterms:title ?title .}` – UninformedUser May 15 '21 at 02:52
  • @UninformedUser thanks for the help. I was able to get all of them except for the subject - value one which is inside some sort of array: `dcterms:subject [dcam:memberOf dcterms:LCC ; rdf:value "U" ], [dcam:memberOf dcterms:LCSH ; rdf:value "Soldiers -- India -- Conduct of life" ];` and I don't know how to get just the "Soldiers -- India -- Conduct of life" value – Luca Guarro May 17 '21 at 03:57
  • hm, ok - my bad. What you see is the blank node notation of Turtle syntax. You can reuse the same notation in SPARQL: `select ?book ?title ?subjectName where {?book a pgterms:ebook . ?book dcterms:title ?title . ?book dcterms:subject [rdf:value ?subjectName] .}` – UninformedUser May 17 '21 at 09:56
  • and, as you can see, there are multiple subjects assigned to the book, thus, you'll get two rows aka bindings with `"U"` and `"Soldiers -- India -- Conduct of life"` – UninformedUser May 17 '21 at 09:58

0 Answers0