So I have a coupe of excel files totaling 1.8GB for now and is growing. All excel files have same columns and may have some overlapping rows with other files. Currently I have to read all files in memory (which is slow and soon I will not be able to because of PC RAM limitation). I am using following two methods but both are equally memory-inefficient and almost same:
all_data = pd.concat(data_dict.values(), ignore_index=True)
for df in data_dict.values(): all_data=pd.concat([all_data,df]).drop_duplicates().reset_index(drop=True)
So I was thinking is there a way I do not have to read all data in memory for comparison and ideally could limit the memory usage of pandas. Speed is not a big concern for me but memory is. I want my code stay relevant as the data keeps growing. So any suggestions?