Reading large csv files using pandas with specifying dtypes giving memory error?

Question

I am trying to read expedia data from Kaggle which contains a 4GB csv file I tried reading it using pd.read_csv('filename') and got memory error. Second approach I tried reading particular columns only using the code:

pd.read_csv('train.csv', dtype={'date_time':np.str, user_location_country': np.int32, 'user_location_region':np.int32, 'user_location_city':np.int32, 'orig_destination_distance':np.float64, 'user_id':np.int32})

this again gives me memory error but using another modification of the same method which is:

train = pd.read_csv('train.csv', dtype={'user_id':np.int32,'is_booking':bool, 'srch_destination_id':np.int32, 'hotel_cluster':np.int32}, usecols=['date_time', 'user_id', 'srch_ci', 'srch_co', 'srch_destination_id', 'is_booking', 'hotel_cluster'])'

reads the data in about 5 minutes.

My problem is I want to read more columns using any of the methods but both fails and gives Memory error. I am using 8GB RAM with 8GB swap space so reading only 7-8 columns out of 24 columns in the data will reduce the data size around 800MB so no issues on the hardware usage. I also tried reading in chunks that I don't want to do based on the algorithms that I am going to read in the later part.

score -1 · Answer 1 · answered Nov 14 '17 at 13:56

-1

Unfortunately, reading a csv file requires more memory than its size on the disk (I do not know how much more though).

You can find an alternative way to process your file here

answered Nov 14 '17 at 13:56

Michel Gawron

24
5

This should be a comment. – Ignacio Vergara Kausel Nov 14 '17 at 14:06
I already tried that converting the data into chunks that is what I don't want. – Samyak Upadhyay Nov 16 '17 at 08:52

Reading large csv files using pandas with specifying dtypes giving memory error?

1 Answers1