0

I'm trying to load a csv file to my Jupiter Notebook, I managed to load the file but some columns of the data holds text in hebrew and it loads it as gibberish

The code I used is the following:

import pandas as pd

cars = pd.read_csv (r'C:\Users\MyName\Folder\number_of_cars.csv',encoding='cp862',sep='|')

I tried a few diffrent encodings that work with Hebrew like cp424 / cp856 / cp1255 / iso8859_8 but got error

UnicodeDecodeError: 'charmap' codec can't decode byte 0x73 **in position 2**: character maps to <undefined>

The only encoding that worked was cp862 and latin-1 (not sure if latin-1 even works with hebrew) but both return gibberish instead of Hebrew text.

Edit: also tried utf-8 and got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 309: invalid continuation byte

I'm not a proggremer, my experience with python is only about analyzing data.

you can view the data set here : https://data.gov.il/dataset/private-and-commercial-vehicles/resource/053cea08-09bc-40ec-8f7a-156f0677aff3

1 Answers1

0

The particular file (download version) is encoded in ANSI 1255, so your cp1255 should work. But! The file has 3 errors that prevents correct parsing in that codepage. Example: at 06161062, after 1KD. Byte 0x9F.

You can handle conversion errors in Python.

Useful information is available in read_csv documentation.

See a list of ways to handle encoding errors

The following slight change works here. Choose the error handler you see best.

import pandas as pd

cars = pd.read_csv (r'hebrew__cars__download.csv',
           encoding='cp1255',
           sep='|',
           encoding_errors='backslashreplace')

# --- read some info
print("Shape: ", cars.shape)

#  Output for download:
#     Shape:  (3781124, 23)
MyICQ
  • 987
  • 1
  • 9
  • 25