I've been processing some older versions of some medium and large sized knowledge graphs in N-Triples and Turtle format, such as:
They all seem to contain malformed triples.
Examples of errors while processing them with serdi -l
:
Wikidata 2015
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021322:54: invalid IRI character `|'
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021323:0: bad subject
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021543:0: invalid IRI character (escape %0A)
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863553:32: invalid IRI character `}'
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863554:34: expected prefixed name
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863555:20: bad verb
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863556:67: expected digit
...
Freebase 2012
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67541:51: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67543:57: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67570:52: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67571:51: missing ';' or '.'
...
LinkedBrainz 2017
error: linkedbrainz_201712_kb_files/place.nt:551:6: expected `]', not `/'
error: linkedbrainz_201712_kb_files/place.nt:551:6: bad verb
error: linkedbrainz_201712_kb_files/place.nt:551:6: bad subject
error: linkedbrainz_201712_kb_files/place.nt:553:277: line end in short string
error: linkedbrainz_201712_kb_files/place.nt:554:6: expected: ':', '<', or '_'
...
There are more examples. I have two mains questions:
- Is there an explanation of why and/or how these files were generated with such errors? I'd expect these files to have been generated by dumping a triple store or an engine such as Apache Jena, and as such to be well formed. Instead, it seems more likely that they were put together using some kind of custom script (or a pipeline of Unix tools, maybe?), hence the errors...
- Is there a way to fix these files? (or, worst case scenario, to ignore the malformed lines, other than
serdi -l
. Extra points for a solution which also doesn't require me to implement a cleaning script from scratch).