Bulk edit subject of triples in rdflib

Question

I create an rdflib graph by parsing records from a database using rdflib-jsonld. However, the subject of triples has a missing / from the url. To add it, I use the following code:

for s,p,o in graph1:
            print 'parsing to graph2. next step - run query on graph2'
            pprint.pprint((s,p,o))
            s = str(s)
            s1 =s.replace('https:/w','https://w')
            s = URIRef(s1)
            graph2.add((s,p,o))

This step takes a very long time (couple of hours) to run because of the high number of triples in the graph. How can I reduce the time taken? Instead of looping through every element, how do I alter the subject in bulk?

If you can run SPARQL queries, then you could use INSERT/DELETE (e.g., as described in a blog post, [SPARQL: Updating the URI of an owl:Class in place](http://semanticarts.com/blog/sparql-update-class-uri-in-place/)). There's an example in the answer to [SPARQL Update example for updating more than one triple in a single query](http://stackoverflow.com/questions/19502398/sparql-update-example-for-updating-more-than-one-triple-in-a-single-query), that shows "an update that replaces triples for a given subject". — Joshua Taylor, Apr 27 '16 at 11:51

score 2 · Accepted Answer · answered Apr 23 '16 at 23:37

First of all, to make proper time measurements, remove anything not related to the replacement itself, particularly, both ordinary and pretty print, you don't need them. If you need some progress indicator, write a short message (e.g. a single dot) into a logfile every N steps.

Avoid memory overconsumption. I don't know how your graph looks like internally, but it'd be better to make the replacement in place, without creating a parallel graph structure. Check memory usage during the process and if the program gets out of free RAM, you're in trouble, all processes will slow down to their knees. If you can't modify the existing graph and go out of memory, for measurement purposes simply avoid the second graph creation, even if such a replacement is lost and thus useless.

If nothing helps, do one step back. You could perform the replacements on a stage when you haven't parsed the file(s) yet with either python re, or with a text tool like sed dedicated to batch text processing.

Bulk edit subject of triples in rdflib

1 Answers1