How to deal with data quality issues in Linked Data

Question

I have recently been exploring linked data and I keep running into one issue after the other. To overcome the performance lag while accessing external endpoints, I wanted to store data dumps locally.

However, the datasets I come across mostly have issues. One frequent one is the URI quality (e.g. Error importing in Jena's TDB: Bad character in IRI (space): <http://bio2rdf.org/genecards:BCR/ABL[space]...>)

How do I deal with such issue? Is there a way to clean such data dumps or even remove such triples with issues?

score 0 · Answer 1 · answered Nov 06 '17 at 10:34

When the URIs are bad, then there processing the input files with text tools is the way to start. N-Triples is easier to work with. There maybe later processing to do more.

In the case of [space] , replacing it with %20 will create legal URIs but they are different URIs. What effect this has on the data depends on the data and what you want to do with it. Like just removing bad triples (another text processing option), whether the data should be cleaned by removing all triples around some bad subject depends on the shape of the data.

The other thing to do is report the problems back upstream so it can be fixed at the origin.

For mu current requirement, both options are okay for me. Could you point me to such text tools? — RDangol, Nov 06 '17 at 10:41
I ended up writing a python script to clean the dataset. So far, seems to work fine. — RDangol, Nov 06 '17 at 13:12

How to deal with data quality issues in Linked Data

1 Answers1