How to encode RDF N-Triples string literals?

Question

The specification for RDF N-Triples states that string literals must be encoded.

https://www.w3.org/TR/n-triples/#grammar-production-STRING_LITERAL_QUOTE

Does this "encoding" have a name I can look up to use it in my programming language? If not, what does it mean in practice?

score 4 · Accepted Answer · answered Nov 27 '16 at 22:21

The grammar productions that you need are right in the document that you linked to:

[9] STRING_LITERAL_QUOTE    ::= '"' ([^#x22#x5C#xA#xD] | ECHAR | UCHAR)* '"'
[141s]  BLANK_NODE_LABEL    ::= '_:' (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')* PN_CHARS)?
[10]    UCHAR   ::= '\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
[153s]  ECHAR   ::= '\' [tbnrf"'\]

This means that a string literal begins and ends with a double quote ("). Inside of the double quotes, you can have:

any character except: #x22, #x5C, #xA, #xD. Offhand, I don't know what each of those is, but I'd assume that they're the space characters covered in the escapes;
a unicode character represented with a \u followed by four hex digits, or a \U followed by eight hex digits; or
an escape character, which is a \ followed by any of t, b, n, r, f, ", ', and \, which represent various characters.

#x22 = ", #x5C = \, #xA = line feed, #xD = carriage return, so I guess those 4 need to be escaped — aveltens, Oct 15 '21 at 14:23

score 3 · Answer 2 · answered Nov 09 '19 at 22:55

3

You could use Literal#n3()

e.g.

# pip install rdflib

>>> from rdflib import Literal
>>> lit = Literal('This "Literal" needs escaping!')
>>> s = lit.n3()
>>> print(s)
"This \"Literal\" needs escaping!"

answered Nov 09 '19 at 22:55

coderfi

378
3
7

score 2 · Answer 3 · edited Jun 20 '20 at 09:12

In addition to Josh's answer. It is almost always a good idea to normalize unicode data to NFC,e.g. in Java you can use the following routine

java.text.Normalizer.normalize("rdf literal", Normalizer.Form.NFKC);

For more information see: http://www.macchiato.com/unicode/nfc-faq

What is NFC?

For various reasons, Unicode sometimes has multiple representations of the same character. For example, each of the following sequences (the first two being single-character sequences) represent the same character:
U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ( Å ) ANGSTROM SIGN
U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE
These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. For more information on these, see the introduction of UAX #15: Unicode Normalization Forms. A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).

How to encode RDF N-Triples string literals?

3 Answers3