How to prevent triples from getting mixed up while uploading to Dydra programmatically?

Question

I am trying to upload some data to Dydra from a Sesame triplestore I have on my computer. While the download from Sesame works fine, the triples get mixed up (the s-p-o relationships change as the object of one becomes object of another). Can someone please explain why this is happening and how it can be resolved? The code is below:

#Querying the triplestore to retrieve all results
sesameSparqlEndpoint = 'http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name'
sparql = SPARQLWrapper(sesameSparqlEndpoint)
queryStringDownload = 'SELECT * WHERE {?s ?p ?o}'
dataGraph = Graph()

sparql.setQuery(queryStringDownload)
sparql.method = 'GET'
sparql.setReturnFormat(JSON)
output = sparql.query().convert()
print output

for i in range(len(output['results']['bindings'])):
    #The encoding is necessary to parse non-English characters
    output['results']['bindings'][i]['s']['value'].encode('utf-8')
    try:
        subject_extract = output['results']['bindings'][i]['s']['value']
        if 'http' in subject_extract:
            subject = "<" + subject_extract + ">"
            subject_url = URIRef(subject)
            print subject_url

        predicate_extract = output['results']['bindings'][i]['p']['value']
        if 'http' in predicate_extract:
            predicate = "<" + predicate_extract + ">"
            predicate_url = URIRef(predicate)
            print predicate_url

        objec_extract = output['results']['bindings'][i]['o']['value']
        if 'http' in objec_extract:
            objec = "<" + objec_extract + ">"
            objec_url = URIRef(objec)
            print objec_url
        else:
            objec = objec_extract
            objec_wip = '"' + objec + '"'
            objec_url = URIRef(objec_wip)

        # Loading the data on a graph       
        dataGraph.add((subject_url,predicate_url,objec_url))

    except UnicodeError as error: 
        print error

#Print all statements in dataGraph      
for stmt in dataGraph:
    pprint.pprint(stmt)

# Upload to Dydra
URL = 'http://dydra.com/login'
key = 'my_key'

with requests.Session() as s:
    resp = s.get(URL)
    soup = BeautifulSoup(resp.text,"html5lib")
    csrfToken = soup.find('meta',{'name':'csrf-token'}).get('content')
    # print csrf_token
    payload = {
    'account[login]':key,
    'account[password]':'',
    'csrfmiddlewaretoken':csrfToken,
    'next':'/'
    }
    # print payload

    p = s.post(URL,data=payload, headers=dict(Referer=URL))
    # print p.text

    r = s.get('http://dydra.com/username/rep_name/sparql')
    # print r.text

    dydraSparqlEndpoint = 'http://dydra.com/username/rep_name/sparql'
    for stmt in dataGraph:
        queryStringUpload = 'INSERT DATA {%s %s %s}' % stmt
        sparql = SPARQLWrapper(dydraSparqlEndpoint)
        sparql.setCredentials(key,key)
        sparql.setQuery(queryStringUpload)
        sparql.method = 'POST'
        sparql.query()

Wow. You are taking the long way around here. Why you are using a SELECT-query to extract all triples (and jumping through all kinds of hoops to reconstruct the actual RDF triples from the query result), rather than using a CONSTRUCT query (which gives you the result ready-made as RDF statements)? — Jeen Broekstra, Dec 22 '15 at 20:05
Well, this is embarrassing; I should have thought of this. Just doing this appears to do the trick: sesameSparqlEndpoint = 'http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name' sparql = SPARQLWrapper(sesameSparqlEndpoint) queryStringDownload = 'CONSTRUCT {?s ?p ?o} WHERE {?s ?p ?o}' dataGraph = Graph() — kurious, Dec 22 '15 at 20:22
I'm having issues iterating over the CONSTRUCT query output. The follow-up question is at http://stackoverflow.com/questions/34425876/how-to-iterate-over-construct-output-from-rdflib. — kurious, Dec 22 '15 at 23:04

Jeen Broekstra · Accepted Answer · 2015-12-22T20:28:33.630

A far simpler way to copy your data over (apart from using a CONSTRUCT query instead of a SELECT, like I mentioned in the comment) is simply to have Dydra itself directly access your Sesame endpoint, for example via a SERVICE-clause.

Execute the following on your Dydra database, and (after some time, depending on how large your Sesame database is), everything will be copied over:

   INSERT { ?s ?p ?o }
   WHERE { 
      SERVICE <http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name> 
      { ?s ?p ?o }
   }

If the above doesn't work on Dydra, you can alternatively just directly access the RDF statements from your Sesame store by using the URI http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name/statements. Assuming Dydra has an upload-feature where you can provide the URL of an RDF document, you can simply provide it the above URI and it should be able to load it.

It seems one needs permission from Dydra to use the SERVICE clause. Otherwise, this is probably the most painless way to do port the data. — kurious, Dec 22 '15 at 23:04

score 0 · Answer 2 · edited May 23 '17 at 12:30

The code above can work if the following changes are made:

Use CONSTRUCT query instead of SELECT. Details here -> How to iterate over CONSTRUCT output from rdflib?
Use key as input for both account[login] and account[password]

However, this is probably not the most efficient way. Primarily, doing individual INSERTs for every triple is not a good way. Dydra doesn't record all statements this way (I got only about 30% of the triples inserted). On the contrary, using the http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name/statements method as suggested by Jeen enabled me to port all the data successfully.

How to prevent triples from getting mixed up while uploading to Dydra programmatically?

2 Answers2

Linked