0

I am uploading JSON data content to an elasticsearch index using python http.client. I successfully achieve to put the data but I'm having a char issue. Once inserted, special chars like é are outputed like é.

Here is the code :

import http.client
connection = http.client.HTTPConnection(elastic_address)
headers = {"Content-type": "application/json", "Accept": "text/plain"}
connection.request('PUT', url=endpoint, headers = headers, body=json_data.encode('utf-8'))

I have noticed that if I change the special chars in the source JSON before sending it like é replaced by \u00E9, it's working fine. It may be because Elasticsearch uses another char encoding but according to this link, ES uses utf-8 as character coding.

I've also overviewed the client.py of the http.client package and it seems that the data are encoded in latin-1, see below :

def _encode(data, name='data'):
    """Call data.encode("latin-1") but show a better error message."""
    try:
        return data.encode("latin-1")
    except UnicodeEncodeError as err:
        raise UnicodeEncodeError(
            err.encoding,
            err.object,
            err.start,
            err.end,
            "%s (%.20r) is not valid Latin-1. Use %s.encode('utf-8') "
            "if you want to send it encoded in UTF-8." %
            (name.title(), data[err.start:err.end], name)) from None

I'm not sure where the issue is, in the script? in the http.client package? in the Elasticsearch index settings?

Any idea?

WillMonge
  • 1,005
  • 8
  • 19
DanyDC
  • 73
  • 1
  • 14
  • I have never used http.client but if you use the [Requests library](http://docs.python-requests.org/en/master/) i believe it will automatically encode characters in the body – BigGerman May 11 '18 at 19:27
  • If you pass a string to `client.request` it uses the default HTTP encoding - variously known as toISO/IEC 8859-1 latin1, cp1252 - serialize it. If you pass `bytes` no encoding is needed. Since you did the encoding yourself, you need to add `"charset":"UTF-8"` to the header to let the other side know what you did. – tdelaney May 11 '18 at 19:27
  • Since there was no charset given, the receiving side thinks its latin1 and decodes to those funky characters. – tdelaney May 11 '18 at 19:28
  • @tdelaney Added the header encoding but no luck :/ – DanyDC May 12 '18 at 09:02

0 Answers0