I'm trying to open 10 csv files in pandas as a dataframe using the read_csv() function however I keep getting the following error- "MemoryError: Unable to allocate 207. MiB for an array with shape (10, 2718969) and data type int64".
8 of the csv files are around 1-3 KB while one is 11,819 KB and the other is 99,694 KB. The 8 files are like lookup tables while the 99,694 KB file is the main file.
I also have to merge/join these files into one file based on a few conditions. For example, the 99,694 KB file(Let's call it Table1) has the following rows:
One of the smaller lookup files(Table 2) has this information:
I'm trying to merge the files based on SId of Table 1 with SId of Table 2. I tried to use ms access to do this and got an "Overflow" error.
Is there any better way to do this?
I was able to use Dask to join the multiple tables but the problem is the main file has more than 2 million rows. I tried to use df.head(1) to see just the first row of the final combined file and Dask threw a MemoryError. I tried saving it as csv and again I got a MemoryError.
I'm trying to use this dataset to perform some EDA and hopefully classification but I don't think I will be able to do that using this large dataset.
In such cases, is it better to take a sample of the data to perform EDA and ML? or is there a better way?