Should ASCII control characters be stripped from documents before sending to Vespa?

Question

I'm trying to store a document into Vespa with a string field. When using the document-api http endpoint it's getting rejected with a parsing error. I've validated that the correct JSON is being sent (other documents go through fine).

Here is the error message that I'm seeing:

PARSER_ERROR Error in document 'id:x:y:n=1:1FVzo2l7mMLticB0WMkBKIECMLzAg' - could not parse field 'content' of type 'string': The string field value contains illegal code point 0xB

I can see that there's a check for these sorts of characters (vertical tab in my case) com.yahoo.text.Text in allowedAsciiChars but I don't see anywhere in the documentation that I should be stripping these chars before sending to Vespa. In fact I see sort of the opposite situation where Vespa will go out of its way to replace certain chars behind the scenes without rejecting them.

score 2 · Accepted Answer · edited Oct 07 '21 at 11:06

2

Please strip ASCII control characters from the documents before feeding.

I will update the documentation, although is seems the JSON spec says these control characters must be escaped, so these are implicitly not allowed in the feed

edited Oct 07 '21 at 11:06

Community

1
1

answered Jan 05 '19 at 09:40

Kristian Aune

876
5
5

Does this apply only to feeding via the http endpoint or for the Java feeder as well? Thanks! – user3230650 Jan 05 '19 at 17:27
Pretty sure we were escaping the control characters when sending them but will double check. – user3230650 Jan 07 '19 at 19:16

score 1 · Answer 2 · answered Jan 07 '19 at 09:33

1

I see sort of the opposite situation where Vespa will go out of its way to replace certain chars behind the scenes

Where do you see this?

There is a Text.stripInvalidCharacters utility method provided as a utility for clients in Java which need to strip characters from non-sanitized text.

answered Jan 07 '19 at 09:33

Jon

2,043
11
9

1

I meant that in reference to linguistics processing like accent normalization, which of course is not really the same as control character handling. – user3230650 Jan 07 '19 at 19:14

Should ASCII control characters be stripped from documents before sending to Vespa?

2 Answers2