I am trying to read expedia data from Kaggle which contains a 4GB csv file I tried reading it using pd.read_csv('filename')
and got memory error. Second approach I tried reading particular columns only using the code:
pd.read_csv('train.csv', dtype={'date_time':np.str, user_location_country': np.int32, 'user_location_region':np.int32, 'user_location_city':np.int32, 'orig_destination_distance':np.float64, 'user_id':np.int32})
this again gives me memory error but using another modification of the same method which is:
train = pd.read_csv('train.csv', dtype={'user_id':np.int32,'is_booking':bool, 'srch_destination_id':np.int32, 'hotel_cluster':np.int32}, usecols=['date_time', 'user_id', 'srch_ci', 'srch_co', 'srch_destination_id', 'is_booking', 'hotel_cluster'])'
reads the data in about 5 minutes.
My problem is I want to read more columns using any of the methods but both fails and gives Memory error
. I am using 8GB RAM with 8GB swap space so reading only 7-8 columns out of 24 columns in the data will reduce the data size around 800MB so no issues on the hardware usage.
I also tried reading in chunks that I don't want to do based on the algorithms that I am going to read in the later part.