0

I have one source of data, that I don't control, and that sends strings with different encodings, and I have no way to know the encoding in advance! I would need to know the format to be able to correctly decode and store properly in a format that I understand and control, let's say UTF-8.

for example:

  • "J'ai déjÃ\xa0 un problème, après... je ne sais pas"

should read

  • "J'ai déjà un problème, après... je ne sais pas"

What I have tried:

> stringToTest="J'ai déjÃ\xa0 un problème, après... je ne sais pas"
# there is no decode for string, directly, but one can try
> stringToTest.encode().decode()
"J'ai déjÃ\xa0 un problème, après... je ne sais pas"
# what does not help :)

From trial and error, I found that the encoding is 'iso-8859-1'

> stringToTest.encode('iso-8859-1').decode()
"J'ai déjà un problème, après... je ne sais pas"

What I want/need is to find the 'iso-8859-1' automatically!

I tried to use chardet!

> import chardet

> chardet.detect(stringToTest)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>

But... as it is a string... chardet does not accept it! And, I am ashamed to admit, but I don't manage to convert the string into something that chardet accepts!

> test1=b"J'ai déjà un problème, après... je ne sais pas"
  File "<input>", line 1
SyntaxError: bytes can only contain ASCII literal characters.

# Ok str and unicode are similar things... but who knows?!?!
> test1=u"J'ai déjà un problème, après... je ne sais pas"
> chardet.detect(test1)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>

# NOP
> bytes(stringToTest)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
TypeError: string argument without an encoding

Why not unidecode?!?

from unidecode import unidecode

from unidecode import unidecode
unidecode(stringToTest)
'J\'ai dA(c)jA un problA"me, aprA"s... je ne sais pas'
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • _I have one source of data, that I don't control, and that sends strings with different encodings_ You receive the data as Python strings directly? – AMC Apr 02 '20 at 20:16
  • I am REALLY sorry for the delay, I kind off got out of the circulation :(. I receive a json file, from different sources, all around of the word and they send on their own encoding. It is like a rest service, if you wish. – Daniel Camara Apr 26 '20 at 06:28
  • _I receive a json file_ How? How are you reading/parsing it? – AMC Apr 27 '20 at 02:55
  • I receive it in a file, a dump if you wish. But each one of these files came from a different source. I just open the file, for example: DAM!!!!! You are RIGHT I AM STUPID!!!!!!!!! I am really sorry!!! I am opening with "with open(fullFileName, encoding="utf8") as json_data:" THAT maybe why!!! I will make some more tests! – Daniel Camara Apr 30 '20 at 08:04

1 Answers1

1

The string

"J'ai déjÃ\xa0 un problème, après... je ne sais pas"

is an example of mojibake - encoded text (bytes) which has been decoded with the wrong encoding. In this particular case, the string was originally encoded as UTF-8; re-encoding as ISO-8859-1 (latin-1) recreates the UTF-8 bytes, and decoding from UTF-8 (the default in Python3) produces the expected result.

If you are receiving these mojibake strings from an external source, you can safely encode them using ISO-8859-1 to recreate the original bytes. The bytes - encoded text - may be passed to chardet for analysis.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • Thanks for your answer, and the same, I am really sorry for the long delay in returning to you. The thing is that in this particular case it is ISO-8859-1, and I got that by trial and error, but I have no guarantees that the next string that will arrive will not arrive in another XXX encoding that I don't know off. I understand it works like that, stringToTest.encode('XXX').decode()== "a clean utf8 string". But how to find "XXX" automatically? I don't know, I may be going the wrong way and the answer is clear to all, but me! :( – Daniel Camara Apr 26 '20 at 06:36
  • @DanielCamara Can you provide some more context for this? How exactly do you receive the data? – AMC Apr 27 '20 at 02:52