Invalid characters and other problems in RDF knowledge graphs

Question

I've been processing some older versions of some medium and large sized knowledge graphs in N-Triples and Turtle format, such as:

They all seem to contain malformed triples. Examples of errors while processing them with serdi -l:

Wikidata 2015

error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021322:54: invalid IRI character `|'                                                     
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021323:0: bad subject                                                                    
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021543:0: invalid IRI character (escape %0A)
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863553:32: invalid IRI character `}'                                                     
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863554:34: expected prefixed name                                                        
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863555:20: bad verb                                                                      
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863556:67: expected digit
...

Freebase 2012

error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67541:51: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67543:57: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67570:52: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67571:51: missing ';' or '.'
...

LinkedBrainz 2017

error: linkedbrainz_201712_kb_files/place.nt:551:6: expected `]', not `/'
error: linkedbrainz_201712_kb_files/place.nt:551:6: bad verb
error: linkedbrainz_201712_kb_files/place.nt:551:6: bad subject
error: linkedbrainz_201712_kb_files/place.nt:553:277: line end in short string
error: linkedbrainz_201712_kb_files/place.nt:554:6: expected: ':', '<', or '_'
...

There are more examples. I have two mains questions:

Is there an explanation of why and/or how these files were generated with such errors? I'd expect these files to have been generated by dumping a triple store or an engine such as Apache Jena, and as such to be well formed. Instead, it seems more likely that they were put together using some kind of custom script (or a pipeline of Unix tools, maybe?), hence the errors...
Is there a way to fix these files? (or, worst case scenario, to ignore the malformed lines, other than serdi -l. Extra points for a solution which also doesn't require me to implement a cleaning script from scratch).

1) you'd have to ask the devs I think. Dumping a triple store or using Jena - yeah, but thatÄs not always the case. Using Jena means using Java and not everybody uses standard tools or APIs. Look at the source code of e.g. LinkedBrainz: https://github.com/LinkedBrainz/linkedbrainz-d2rs-translators - take a MusicBrainz SQL dump, mapped via D2RQ and converted to RDF. No Apache Jena used. URIs cerated manuelly via String concat. Why? Because the devs decided to do so I'd say. — UninformedUser, Sep 25 '19 at 06:27
2) I'm not aware of any auto-fixing tool because that would need some kind of magic, right? In the best case you find a tool that doesn't apply "fail fast" to extract the ill-formed triples first. I mean, sure you could run a URI escaper on each URI. Might solve some types of errors. But not all. I know other types of errors like broken literals etc. which make a parser fail. Apache Jena RIOT tool sometimes gives warnings for invalid IRIs but also fails once the IRI contains serious errors. — UninformedUser, Sep 25 '19 at 06:33
By the way, unfortunately this does not only affect older datasets: Quite recently I had trouble with a recent DBpedia dump: IRIs like `http://dbpedia.org/resource/Mini__\"Mark_I\"__1` made the Jena parser fail instantly - I fixed those errors by some weird `sed` calls. Neither a generic solution nor satisfying. I talked to the devs, looks like the root causes in the extraction frameworks have been fixed. Moreover, some more thorough data validation pipeline is used now before publishing the data. — UninformedUser, Sep 25 '19 at 06:37
I doubt that 5-10 years ago all data providers did this before publishing datasets. Takes time and resources I think. So, maybe you could also contribute to the LinkedBrainz source code and fix those issues before that data is serialized instead of writing some auto-fixing" parser - which won't be work in all cases I think — UninformedUser, Sep 25 '19 at 06:39
And yes, especially those types of syntactic errors are really annoying, I totally understand you. Those errors should be avoidable ... we have plenty of other semantic errors we have to live with or work on ... — UninformedUser, Sep 25 '19 at 06:43

Invalid characters and other problems in RDF knowledge graphs

Wikidata 2015

Freebase 2012

LinkedBrainz 2017

0 Answers0