Trying to load csv file with Hebrew text but gets gibberish

Question

I'm trying to load a csv file to my Jupiter Notebook, I managed to load the file but some columns of the data holds text in hebrew and it loads it as gibberish

The code I used is the following:

import pandas as pd

cars = pd.read_csv (r'C:\Users\MyName\Folder\number_of_cars.csv',encoding='cp862',sep='|')

I tried a few diffrent encodings that work with Hebrew like cp424 / cp856 / cp1255 / iso8859_8 but got error

UnicodeDecodeError: 'charmap' codec can't decode byte 0x73 **in position 2**: character maps to <undefined>

The only encoding that worked was cp862 and latin-1 (not sure if latin-1 even works with hebrew) but both return gibberish instead of Hebrew text.

Edit: also tried utf-8 and got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 309: invalid continuation byte

I'm not a proggremer, my experience with python is only about analyzing data.

you can view the data set here : https://data.gov.il/dataset/private-and-commercial-vehicles/resource/053cea08-09bc-40ec-8f7a-156f0677aff3

yes and got error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 309: invalid continuation byte — Liad Traube, Aug 10 '23 at 13:56
Welcome to SO. It would be useful in this case to post a small sample of your text or similar text. Remember not to include personal information. — MyICQ, Aug 10 '23 at 14:58
It's more useful to post the raw bytes of the file than to post the file after it's been decoded with an incorrect codec. Can you post the hexadecimal representation of the first line of data? — Nick ODell, Aug 10 '23 at 15:37

score 0 · Answer 1 · answered Aug 10 '23 at 20:27

The particular file (download version) is encoded in ANSI 1255, so your cp1255 should work. But! The file has 3 errors that prevents correct parsing in that codepage. Example: at 06161062, after 1KD. Byte 0x9F.

You can handle conversion errors in Python.

Useful information is available in read_csv documentation.

See a list of ways to handle encoding errors

The following slight change works here. Choose the error handler you see best.

import pandas as pd

cars = pd.read_csv (r'hebrew__cars__download.csv',
           encoding='cp1255',
           sep='|',
           encoding_errors='backslashreplace')

# --- read some info
print("Shape: ", cars.shape)

#  Output for download:
#     Shape:  (3781124, 23)

Trying to load csv file with Hebrew text but gets gibberish

1 Answers1