0

I have sensor data recorded over a timespan of one year. The data is stored in twelve chunks, with 1000 columns, ~1000000 rows each. I have worked out a script to concatenate these chunks to one large file, but about half way through the execution I get a MemoryError. (I am running this on a machine with ~70 GB of usable RAM.)

import gc
from os import listdir
import pandas as pd

path = "/slices02/hdf/"
slices = listdir(path)
res = pd.DataFrame()

for sl in slices:
    temp = pd.read_hdf(path + f"{sl}")
    res = pd.concat([res, temp], sort=False, axis=1)
    del temp
    gc.collect()
res.fillna(method="ffill", inplace=True)
res.to_hdf(path + "sensor_data_cpl.hdf", "online", mode="w")

I have also tried to fiddle with HDFStore so I do not have to load all the data into memory (see Merging two tables with millions of rows in Python), but I could not figure out how that works in my case.

wallisers
  • 388
  • 3
  • 16
  • From what I've heard, pandas is not very strong with handling this kind of amounts of data - not speaking from my own experience, or knowledge, though A quick search brought up this hint for example https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c – Yuri Feldman Jan 14 '20 at 08:22
  • I concur with Yuri. `pandas` isn't adequat for this amount of data. You might want to move toward solutions like `pyspark` using `parquet` format files. – eliasah Jan 14 '20 at 08:45

2 Answers2

2

When you read in a csv as a pandas DataFrame, the process will take up to twice the needed memory at the end (because of type guessing and all the automatic stuff pandas tries to provide).

Several methods to fight that :

  1. Use chunks. I see that your data is already in chunks, but maybe those are too big, so you can read each files by chunks using the chunk_size parameter of pandas.read_hdf or pandas.read_csv

  2. Provide dtypes to avoid type guessing and mixed types (ex: a column of strings with null value with have mixed type), this will work along low_memory parameters.

If this is not sufficient you'll have to turn to distributed technologies like pyspark, dask, modin or even pandarallel

Luis Blanche
  • 557
  • 9
  • 18
  • I wrote a function to convert the `dtypes` of the columns. The `dtype` of my original `DataFrames` was `float64`. The function loops over all columns. For each column it tries to convert it to integer format (`int8` to `int64` depending on the values). If that does not work, `dtype` is converted to `float32`. That way I could reduce the size of the file chunks by almost 60%. Thanks for the hint. – wallisers Jan 20 '20 at 07:36
0

When you have so much data avoid creating temporary dataframes as they take up memory too. Try doing it in one pass:

folder = "/slices02/hdf/"
files = [os.path.join(folder, file) for file in os.listdir(folder)]
res = pd.concat((pd.read_csv(file) for file in files), sort=False)

See how this works for you.

NotAName
  • 3,821
  • 2
  • 29
  • 44
  • This is what I tried at first. It also gives me a `MemoryError`. Because of that I tried to read the files one by one and clean memory with `gc` inbetween. – wallisers Jan 14 '20 at 09:40