4

Linked data collections are usually given in RDF/XML, JSON-LD, or TTL format. Relatively large data dumps seem fairly difficult to process. What is a good way to convert an RDF/XML file to a TSV of triplets of linked data?

I've tried OpenRefine, which should handle this, but a 10GB file, (e.g. the person authority information from German National Library) is too difficult to process on a laptop with decent processing power.

Looking for software recommendations or some e.g. Python/R code to convert it. Thanks!

TallTed
  • 9,069
  • 2
  • 22
  • 37
puslet88
  • 1,288
  • 15
  • 25
  • 1
    (Note -- it's **RDF/XML**, not *RDF(XML)*.) Also... Why not load the data into a proper RDF triple/quad-store? If you need CSV/TSV of query results (which seems much more likely than that you need CSV/TSV of the entire dataset), many SPARQL processors can deliver that. [Virtuoso](http://virtuoso.openlinksw.com) (from my employer), either Open Source or Enterprise, is one that can handle all of this, on pretty much any modern laptop/desktop. – TallTed Jun 25 '19 at 19:00

3 Answers3

4

Try these:

Lobid GND API

http://lobid.org/gnd/api

Supports OpenRefine (see blogpost) and a variety of other queries. The data is hosted as JSON-LD (see context) in an elasticsearch cluster. The service offers a rich HTTP-API.

Use a Triple Store

Load the data to a triple store of your choice, e.g. rdf4j. Many triple stores provide some sort of CSV serialization. Together with SPARQL this could be worth a try.

Catmandu

http://librecat.org/Catmandu/

A strong perl based data toolkit that comes with a useful collection of ready-to-use transformation pipelines.

Metafacture

https://github.com/metafacture/metafacture-core/wiki

A Java-Toolkit to design transformation pipelines in Java.

jschnasse
  • 8,526
  • 6
  • 32
  • 72
1

You could use the ontology editor Protege: There, you can SPARQL the data according to your needs and save them as TSV file. It might be important, however, to configure the software beforehand in order to make the amounts of data manageable.

Yahalnaut
  • 157
  • 1
  • 8
0

Canonical N-Triples may be already what you are after, as it is essentially a space-separated line-based format for RDF (you cannot naively split at space though, as you need to take care of literals, see below). Of the dataset you cited, many files are available as N-Triples. If not, use a parsing tool like rapper for the conversion to N-Triples, eg.

rapper -i turtle -o ntriples rdf-file-in-turtle-format.ttl > rdf-file-in-ntriples-format.nt

Typically, the n-triples exporters do not exploit all that is allowed in the specification regarding whitespace and use canonical n-triples. Hence, given a line in a canonical n-triples file such as:

<http://example.org/s> <http://example.org/p> "a literal" .

you can get CSV by replacing the first and the second space character of a line with a comma and remove everything after and including the last space character. As literals are the only RDF term where spaces are allowed, and as literals only allowed in object position, this should work for canonical n-triples.

You can get TSV by replacing said space characters with tab. If you also do that for the last space character and do not remove the dot, you have a file that is both a valid n-triples and a TSV file. If you take these positions as split positions, you can work with canonical n-triples files without conversion to CSV/TSV.

Note that you may have to deal with commas/tabs in the RDF terms (eg. by escaping), but that problem exists in any solution for RDF as CSV/TSV.

kaefer3000
  • 23
  • 4