Loading and merging large csv files python

Question

I'm trying to open 10 csv files in pandas as a dataframe using the read_csv() function however I keep getting the following error- "MemoryError: Unable to allocate 207. MiB for an array with shape (10, 2718969) and data type int64".

8 of the csv files are around 1-3 KB while one is 11,819 KB and the other is 99,694 KB. The 8 files are like lookup tables while the 99,694 KB file is the main file.

I also have to merge/join these files into one file based on a few conditions. For example, the 99,694 KB file(Let's call it Table1) has the following rows:

One of the smaller lookup files(Table 2) has this information:

I'm trying to merge the files based on SId of Table 1 with SId of Table 2. I tried to use ms access to do this and got an "Overflow" error.

Is there any better way to do this?

I was able to use Dask to join the multiple tables but the problem is the main file has more than 2 million rows. I tried to use df.head(1) to see just the first row of the final combined file and Dask threw a MemoryError. I tried saving it as csv and again I got a MemoryError.

I'm trying to use this dataset to perform some EDA and hopefully classification but I don't think I will be able to do that using this large dataset.

In such cases, is it better to take a sample of the data to perform EDA and ML? or is there a better way?

might be better to work in chunks or use `Dask` have a look into the `chunk` paramter in `pd.read_csv` also look up memory efficient methods when working with large dataframes. [this post](https://realpython.com/python-pandas-tricks/) has a few good examples. — Umar.H, Jun 10 '20 at 13:15
Could you try passing explicit column types via the [`dtype` argument](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)? Looks like you're getting a DataFrame object with extremely large types. — joebeeson, Jul 08 '20 at 21:32

Loading and merging large csv files python

0 Answers0