0

I am trying to read expedia data from Kaggle which contains a 4GB csv file I tried reading it using pd.read_csv('filename') and got memory error. Second approach I tried reading particular columns only using the code:

pd.read_csv('train.csv', dtype={'date_time':np.str, user_location_country': np.int32, 'user_location_region':np.int32, 'user_location_city':np.int32, 'orig_destination_distance':np.float64, 'user_id':np.int32})

this again gives me memory error but using another modification of the same method which is:

train = pd.read_csv('train.csv', dtype={'user_id':np.int32,'is_booking':bool, 'srch_destination_id':np.int32, 'hotel_cluster':np.int32}, usecols=['date_time', 'user_id', 'srch_ci', 'srch_co', 'srch_destination_id', 'is_booking', 'hotel_cluster'])'

reads the data in about 5 minutes.

My problem is I want to read more columns using any of the methods but both fails and gives Memory error. I am using 8GB RAM with 8GB swap space so reading only 7-8 columns out of 24 columns in the data will reduce the data size around 800MB so no issues on the hardware usage. I also tried reading in chunks that I don't want to do based on the algorithms that I am going to read in the later part.

Samyak Upadhyay
  • 573
  • 1
  • 12
  • 24

1 Answers1

-1

Unfortunately, reading a csv file requires more memory than its size on the disk (I do not know how much more though).

You can find an alternative way to process your file here