Microsoft Azure Text Analytics Coginitive Service Encoding Issue

Question

In order to use their text analytics, Azure requires a json file/document that looks like this:

document = {
  "documents" :[
    {"id": "1", "language": "en", "text": "I had a wonderful experience! The rooms were wonderful and the staff was helpful."},
    {"id": "2", "language": "en", "text": "I had a terrible time at the hotel. The staff was rude and the food was awful."},
    {'id': '3', 'language': 'es', 'text': 'Los caminos que llevan hasta Monte Rainier son espectaculares y hermosos.'},  
    {'id': '4', 'language': 'es', 'text': 'La carretera estaba atascada. Había mucho tráfico el día de ayer.'}]}

The issue I am getting at the moment is that the last record id: 4 is causing this error:

b'{"code":"BadRequest","message":"Invalid request","innerError":{"code":"InvalidRequestBodyFormat","message":"Request body format is wrong. 
Make sure the json request is serialized correctly and there are no null members."}}'

The formatting of the JSON is correct, it's straight from their site and it runs perfectly fine without the last record. I tested some more and then found out that the í and á are the ones throwing the error. To make sure, I even tested it out with English words like resumé or fiancé but still the same error. But that doesn't make sense since Spanish is one of the supported languages for the text analysis and the text language is even define as Spanish before it's processed.

So my question is, am I missing something before passing my data through Azure? Am I suppose to convert, changing the encoding, or remove those characters or is this something that Azure's API should be able to handle?

EDIT: A little more background, I followed the instructions provided on their site to set it up to work with python. It works perfectly except for what I mentioned.

It's likely an encoding problem. Are you sending your data with a UTF-8 content type? — ADyson, Aug 08 '18 at 21:38
Thanks @ADyson! I originally thought that since Python 3 was encoded in UTF-8 that it would carry over to its string variables as well, but I guess not. — Burhan Nurdin, Aug 08 '18 at 22:43

score 0 · Answer 1 · answered Aug 08 '18 at 22:41

0

Figured it out thanks to @ADyson.

You must ensure the input is encoded as either UTF-8 or UTF-16 in order for it to run correctly.

answered Aug 08 '18 at 22:41

Burhan Nurdin

13
4

What did you do to ensure the input is encoded as either UTF-8 or UTF-16? – Kirby Sep 02 '23 at 00:05

Microsoft Azure Text Analytics Coginitive Service Encoding Issue

1 Answers1