0

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte

I have a pandas dataframe, which I import as latin-1, I get a specific column, which contains a url, use re.findall to get a hex code from the url. I remove the 0x part and I get a correct hex code. However upon trying bytes.fromhex(hex).decode('utf-8'), I get a continuation byte error.

import re
import pandas as pd
import codecs
import binascii

df = pd.read_csv(file, encoding='latin-1', low_memory=False)

urls = df['g_maps_claimed']


def hex_to_string(hex):
    if hex[:3] == ':0x':
        hex = hex[3:].lower()
        print("Corrected1:",hex)
    elif hex[:2] == '0x':
        hex = hex[2:].lower()
        print("Corrected2:",hex)
    print(len(hex))
    # hex = hex.encode('utf-8').decode('latin-1')
    # string_value = codecs.decode(hex, 'hex').decode('utf-8')
    ascii_data = binascii.unhexlify(hex).decode('utf-8') #Takes one line from line and converts it to ASCII
    print(ascii_data) #Prints the ascii on screen
    # string_value = bytes.fromhex(hex).decode('utf-8') #<--ERROR!
    # print("String value:",string_value)
    # return string_value

for url in urls:
    try:
        hexadecimal_id = re.findall(':0x[A-Z0-9]*', url)[0]
    except:
        try:
            hexadecimal_id = ''
        except TypeError as error:
            print(error, url)
    print("Hexadecimal_id:",hexadecimal_id)
    hex_to_string(hexadecimal_id)


# ascii_data = binascii.unhexlify(hex) #Takes one line from line and converts it to ASCII
# print ascii_data #Prints the ascii on screen


I've tried using both latin-1 encoding and ISO-8859-1 both producing the same error. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte.

Example of what I get: Hexadecimal_id: :0xE91DF6E4F947252C Corrected1: e91df6e4f947252c It has a string class.

I tried looking over other answers, but didn't find anything that would work for me. Any help would be appreciated!

  • The problem with URL: it is impossible to known the encoding (but mostly it doesn't matter, because it is just a label given by page author). So probably the URL escapes are not representing UTF-8. Change `binascii.unhexlify(hex).decode('utf-8')` with `latin-1`. In reality the best action: try with UTF-8 and if it fails try latin-1, so you cover both cases – Giacomo Catenazzi Nov 17 '22 at 15:58
  • Why would you expect data imported as latin1 to then decode as UTF-8? `E9 1D` is definitely an invalid UTF-8 byte sequence. This question would be less confusing if you provided the content of the CSV and made a [mcve]. – Mark Tolonen Nov 17 '22 at 21:58

0 Answers0