0

As outlined:in

https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/storage/sparql.py

and

https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testSPARQL.py

I tried to allow for a "round trip" operation between python list of dicts and Jena/SPARQL based storage.

The approach performs very well for my usecase and after trying it out for a while i get into more details that need to be addressed.

The stackoverflow question listOfDict to RDF conversion in python targeting Apache Jena Fuseki addresses the initial issues and https://github.com/WolfgangFahl/DgraphAndWeaviateTest/issues?q=is%3Aissue+is%3Aclosed issues 2-5 show some detail problems that were already fixed.

Now I am working with some 180000 records i'd like to import from 6 different data sources and each data source seems to have new exotic records that make the approach fail.

E.g. one batch of records gives me the following log:

read 45601 events in   0.6 s
storing 45601 events to sparql
  batch for         1 -      2000 of     45601 cr:Event in    0.6 s ->    0.6 s
  batch for      2001 -      4000 of     45601 cr:Event in    0.5 s ->    1.1 s
  batch for      4001 -      6000 of     45601 cr:Event in    0.5 s ->    1.6 s
  batch for      6001 -      8000 of     45601 cr:Event in    0.5 s ->    2.1 s
  batch for      8001 -     10000 of     45601 cr:Event in    0.5 s ->    2.6 s
  batch for     10001 -     12000 of     45601 cr:Event in    0.7 s ->    3.2 s
======================================================================
ERROR: testCrossref (tests.test_Crossref.TestCrossref)
test loading crossref data
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/wf/Library/Python/3.8/lib/python/site-packages/SPARQLWrapper/Wrapper.py", line 1073, in _query
    response = urlopener(request)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

SPARQLWrapper.SPARQLExceptions.QueryBadFormed: QueryBadFormed: a bad request has been sent to the endpoint, probably the sparql query is bad formed.

Response:
b'Error 400: Bad Request\n'

Now since I don't get any details on what the problem is i am working with a binary search. With the error above i only know the problem is with a record with a batchIndex between 12000 and 14000 so I am . setting the limit to 14000 and batchSize to 100 to get closer.

 batch for     13301 -     13400 of     14000 cr:Event in    0.0 s ->    4.3 s

is now the last successful batch. So i am using a binary search: 13450 fail, 13425 fail, 13412 ok, 13418 ok, 13422 fail, 13420 ok, 13421 ok So record 13422 is the culprit and I switch on debug mode to see the INSERT Data created for the record:

  cr:Event__102140gtm20003 cr:Event_name "Higher local fields".
  cr:Event__102140gtm20003 cr:Event_location "M\\"unster, Germany".
  cr:Event__102140gtm20003 cr:Event_source "crossref".
  cr:Event__102140gtm20003 cr:Event_eventId "10.2140/gtm.2000.3".
  cr:Event__102140gtm20003 cr:Event_title "Invitation to higher local fields".
  cr:Event__102140gtm20003 cr:Event_startDate "1999-08-29"^^<http://www.w3.org/2001/XMLSchema#date>.
  cr:Event__102140gtm20003 cr:Event_year 1999.
  cr:Event__102140gtm20003 cr:Event_month 9.
  cr:Event__102140gtm20003 cr:Event_endDate "1999-09-05"^^<http://www.w3.org/2001/XMLSchema#date>.

So the Umlaut-encoding "\u" in the location "Münster" is the culprit here. I will work around this issue. The real question is:

How can i get the Fuseki API via SPARQLWrapper to properly report a detailed error message*

e.g. with something like

error in line # cr:Event__102140gtm20003 cr:Event_location "M\\"unster, Germany". is  not a valid triple?
Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186
  • `formatted_msg` in `SPARQLWrapperException`? – Stanislav Kralin Aug 25 '20 at 11:27
  • @StanislavKralin - indeed this gives some more info e.g. "formatted_msg str: QueryBadFormed: a bad request has been sent to the endpoint, probably the sparql query is bad formed. \n\nResponse:\nb'Error 400: Bad Request\\n' " but that does not really solve the problem. – Wolfgang Fahl Aug 25 '20 at 12:05
  • see also https://lists.apache.org/thread.html/r55586eec9b37b2441e0b97cc6c3adc8fe172e6cf9a494688ee0256bf%40%3Cusers.jena.apache.org%3E – Wolfgang Fahl Aug 25 '20 at 13:03

0 Answers0