0

When I use the AllegroGraph 4.6 Python API, I can use the connection.addTriple() method to try to add a triple that ends in a literal containing a unicode character (×):

conn.addTriple( ..., ..., '5 × 10**5' )

This doesn't work. I get the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position...

Here's the full traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 357, in addTriple
    self._convert_term_to_mini_term(obj), cxt)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 235, in _convert_term_to_mini_term
    return self._to_ntriples(term)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 367, in _to_ntriples
    else: return term.toNTriples();
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/model/literal.py", line 182, in toNTriples
    sb.append(strings.encode_ntriple_string(self.getLabel()))
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/util/strings.py", line 52, in encode_ntriple_string
    string = unicode(string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128)

Instead I can add the triple like this:

conn.addTriple( ..., ..., u'5 × 10**5' )

That way I don't get an error.

But if I load a file of ntriples that contains some UTF-8 encoded characters using connection.addFile(filename, format=RDFFormat.NTRIPLES), I get this error message if the ntriples file is saved as ANSI encoding from Notepad++:

400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream @ #x10046f9ea2> at line 12764 (last character was
#\×): nil
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
    commitEvery=self.add_commit_size)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
    nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
    if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing

I get this error message if the file is saved as UTF-8 encoding:

400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream @ #x100486e8b2> at line 1 (last character was
#\): Subjects must be resources (i.e., URIs or blank nodes)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
    commitEvery=self.add_commit_size)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
    nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
    if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing

However, if the file is set to ANSI encoding in Notepad++, I can go in and paste the × character, save, and then the file loads fine. Or, if I change the file encoding to UTF-8 after I paste the character, then the character changes to some strange xD7 character. If the file is set to UTF-8 encoding and I paste the × in there, then if I change the encoding to ANSI the × changes to a ×.

When the file was given to me, it had × where the × should have been, and when I tried to load it in AllegroGraph I got the first 400 MALFORMED DATA error, which fails at the line where the character actually appears in the file (12764), instead of just at the first line. I assume that the reason I get the second 400 MALFORMED DATA error on line 1 has something to do with the header written by Notepad++ for UTF-8 encoded files. So apparently, I have to save a file as ANSI if I want AllegroGraph not to hiccup immediately, but there has to be some way to tell AllegroGraph to read things like × as UTF-8 characters.

In the file, the triple looks like:

<...some subject URI...> <...some predicate URI...> "5 × 10**5" .

John Thompson
  • 1,674
  • 1
  • 20
  • 35
  • What code are you using to add the file of ntriples? – deadly Jun 11 '12 at 19:56
  • @deadly `connection.addFile(filename, format=RDFFormat.NTRIPLES)` – John Thompson Jun 11 '12 at 19:58
  • Bizarre. I can understand why addTriple gives an error, but with addFile you're only giving it the filename and it should be going off an parsing the file happily. What's the full traceback? – deadly Jun 11 '12 at 20:29
  • @deadly I actually made a mistake in my original question. I don't get the same error message when I load a file. See the tracebacks I added. – John Thompson Jun 11 '12 at 21:21
  • 2
    See this: http://bit.ly/unipain – Daenyth Jun 12 '12 at 00:02
  • @Daenyth that was very informative. Thanks. – John Thompson Jun 12 '12 at 15:02
  • Can you try opening the files using UTF-8? There should be an option in Notepad++ to coerce it to UTF-8 in the format menu. – deadly Jun 12 '12 at 15:22
  • Yeah, I can view the file as UTF-8 in Notepad++ and save it as UTF-8 with or without a BOM. If I save it with a BOM, then the parser fails at line 1, so apparently it doesn't know what to do when it sees a BOM. If I save it without the BOM, then the parser fails at the `×`. – John Thompson Jun 12 '12 at 18:43
  • I'm sorry. I just don't know where to go from here. The errors seem quite specific to AllegroGraph. You might have to go get some product specific support. – deadly Jun 14 '12 at 12:57
  • Thanks. I did email the support folks and they said that AllegroGraph can take unicode characters in nTriples using `\uXXXX` notation. Alternatively I can use RDFXML, which allows me to leave the unicode characters as they are. – John Thompson Jun 15 '12 at 14:58
  • Ah, I'm glad you've got it sorted. I don't really feel I deserve the accept, but I'll edit my answer and add in the extra details from our comment conversations. – deadly Jun 15 '12 at 16:52

2 Answers2

1

use codecs module.

import codecs
f = codecs.open('file.txt','r','utf8')

this will open your file forcing the utf8 encoding

Justin.Wood
  • 695
  • 4
  • 10
  • the addFile method takes a filename string rather than a file object, but I suppose I can try editing the AllegroGraph Python API to see if this works. – John Thompson Jun 11 '12 at 21:39
  • I edited the API to do this, but I still get a 400 MALFORMED DATA error. – John Thompson Jun 11 '12 at 22:20
  • @JohnPeterThompsonGarcés I don't know the API that you are dealing with, but can't you open the file and then iterate over its contents? something along the lines of: `f = codecs.open('file.txt','r','utf8')` for line in f.readlines(): conn.addTriple( ..., ..., line ) you could even wrap a unicode function around the line variable if necessary eg: `conn.addTriple( ..., ..., unicode(line,'utf8') )` – Justin.Wood Jun 12 '12 at 14:49
1

\xd7 is the Latin-1 encoding of ×.

× is what you get if you mistakenly decode × to cp1252 (often Windows' default codec) if it's been encoded in UTF-8.

When you're given files that show ×, try changing the codec that's used to display them to UTF-8.


For an overview of Unicode in Python see here. ~ Thanks to Daenyth.


As you found out from AllegroGraph support:

AllegroGraph can take unicode characters in nTriples using \uXXXX notation. Alternatively one can use RDFXML, which allows you to leave the unicode characters as they are.

deadly
  • 1,194
  • 14
  • 24
  • There's still a problem with addFile() though. If the file is encoded in UTF-8, I get a `400 MALFORMED DATA` and the parser fails at `position 1`. It fails even if I do what Justin.Wood suggested and use `f = codecs.open(...,'r','utf8')`. It won't fail at `position 1` if I give it something with ANSI encoding, but if it includes `×` (as it appears in Notepad++ under ANSI encoding) then it will fail at the position of the character. It won't fail if it's ANSI encoding and the Latin-1 encoding of `×` is pasted in, but that's not a good solution because I have files with multiple utf8 chars. – John Thompson Jun 12 '12 at 14:17