0

I have a list with binary type strings looking like this which is obtained by reading a text file in rb mode (as r does not work for reading the file due to probable mixed up characters from various encodings):-

new_list = [b'Vanessa Skarski\'s Account of Her Father\'s Death....', b'Hornslet wind-turbine collapse\r\nFrom Wikipedia' .....] etc.

with a total of 271 items in the list. But I want the list items to be normal strings not binary ones. I have looked into using new_list = [item.decode(encoding='utf-8') for item in new_list]

but it gives UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 643: invalid start byte. I simply want to get rid of the b' and get normal strings. Any ideas please?

EDIT The solution mentioned in Convert bytes to a string? did not solve the issue as I already mentioned in my initial post. My Python version is listed below if that has anything do to with the error at all

3.5.2 (v3.5.2:4def2a2901a5, Jun 26 2016, 10:47:25) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
JChat
  • 784
  • 2
  • 13
  • 33
  • Possible duplicate of [Convert bytes to a string?](https://stackoverflow.com/questions/606191/convert-bytes-to-a-string) – norok2 Jul 19 '19 at 12:40
  • @norok2 as I already mention, I tried the various solutions using utf-8 decoding (syntax articulated in my question already) but nothing worked. So I guess this isn't a duplicate. I edited the question to mention this. Thanks – JChat Jul 19 '19 at 12:43
  • If you inspect the accepted answer more closely, it says: `utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.` The approach is correct, the encoding is not. [Check out](https://docs.python.org/3/library/codecs.html#standard-encodings) the one you might think it is. – norok2 Jul 19 '19 at 12:49
  • @norok2 I read my text file in the binary mode (rb). So can you tell what the encoding would be for that? Thanks – JChat Jul 19 '19 at 12:50
  • The whole idea of "binary" is that it doesn't _have_ an encoding - it's raw bytes. You need to know the encoding to turn it to a string. We could have tried to figure it out for you, but you "etc"-d the important bits. What's at (and around) position 643? – Amadan Jul 19 '19 at 12:56

2 Answers2

2

The bytes object you have are not encoded in UTF-8. The enconding depends on the actual information of your files and nobody can tell you how to encode them properly except for the one who created the files in the first place and knows what enconding was used.

However, popular choices, given the context, may be:

  • latin1 (will always decode, but may not be meaningful to you)
  • cp1252 a popular choice for Windows systems

hence, e.g.:

new_list = [item.decode(encoding='latin1') for item in new_list]
norok2
  • 25,683
  • 4
  • 73
  • 99
  • 2
    As I said, `latin1` will **ALWAYS** work (i.e. never raises an error), but it may not be giving you a meaningful result, e.g. `'24 °C'.encode('utf8').decode('latin1')` will give you: `'24 °C'`. Hence, make sure to inspect your results. – norok2 Jul 19 '19 at 13:07
  • When I stumbled upon situation where I was not sure about encoding, I used the approach, when I just looped through the list of all the encodings and applied each of them inside `try`-`catch` block. Then I looked which of them worked the best. Not sure if this is considered as good approach, but worked for me. – Constantine Ketskalo Jul 19 '19 at 18:54
  • @ConstantineKetskalo given that some will **ALWAYS** work, you need to find better validation approaches than catching the error. – norok2 Jul 19 '19 at 18:57
  • @norok2 probably yes. But one time I used it by just manual selection of what worked when I needed just to process once 1 or 2 files. Another time when I needed to use a lot of them I just wrote some code that does the same, but selects the first encoding, which leads to text containig certain string I needed there. Probably somebody might have even better idea about it. I would be glad to hear it. ) – Constantine Ketskalo Jul 19 '19 at 19:01
-1

Just use 'utf8' instead of 'utf-8'. Worked for me using Python 3.7 on Windows 10.

new_list = [b'Vanessa Skarski\'s Account of Her Father\'s Death....', b'Hornslet wind-turbine collapse\r\nFrom Wikipedia']

for item in new_list:
    decoded_item = item.decode('utf8')
    print(item)
    print(type(item))
    print(decoded_item)
    print(type(decoded_item))
    print()

output:

b"Vanessa Skarski's Account of Her Father's Death...."
<class 'bytes'>
Vanessa Skarski's Account of Her Father's Death....
<class 'str'>

b'Hornslet wind-turbine collapse\r\nFrom Wikipedia'
<class 'bytes'>
Hornslet wind-turbine collapse
From Wikipedia
<class 'str'>
  • I get the same error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 643: invalid start byte even when using utf8 instead of utf – JChat Jul 19 '19 at 12:54
  • 1
    'utf-8' and 'utf8' refer to exactly the same encoding in Python - see the list [here](https://docs.python.org/3/library/codecs.html#standard-encodings) – snakecharmerb Jul 19 '19 at 13:57
  • Oh, ok then. Thanks, snakecharmerb. I think I had situation, when 'utf-8' was wrong and 'utf8' was right to use. But I might confuse things, because it was decent time ago. – Constantine Ketskalo Jul 19 '19 at 18:55