MemoryError from melt or concat with large data

Question

I got the error when I try to run pd.melt().
I checked on this post and tried to modified the code and still got the error. (LINK)

Here is my original code:

melted = pd.melt(df, ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value').sort_values('ID')

After modifying:

pivot_list = list()
chunk_size = 100000
for i in range(0, len(df), chunk_size):
    row_pivot = pd.melt(df.iloc[i:i+chunk_size], ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value')
    pivot_list.append(row_pivot)
melted = pd.concat(pivot_list).sort_values('ID')

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File /path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/path/Current_Proj/Main_Dir/Python_Program.py", line 122, in My_Function
    melted = pd.concat(pivot_list).sort_values('ID')
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 307, in concat
    return op.get_result()
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
    new_data = concatenate_managers(
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
    values = _concatenate_join_units(join_units, concat_axis, copy=copy)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
    to_concat = [
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
    ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 466, in get_reindexed_values
    values = algos.take_nd(values, indexer, axis=ax)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 108, in take_nd
    return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 149, in _take_nd_ndarray
    out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File /path/Current_Proj/Main_Dir/Python_Program.py", line 222, in <module>
    result = pool.starmap(My_Function, zip(arg1, arg2, arg3))
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
numpy.core._exceptions.MemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object

I think the main issue came from melt() and concat() parts.
Any idea to deal with should be thankful.

Michael Delgado · Answer 1 · 2022-06-10T19:05:17.597

Usually, when you get a "MemoryError: unable to allocate" error, this falls into the "user error" category of requesting a reshape operation which is simply too large to fit into memory.

pd.melt is a memory-intensive operation which not only requires creating new arrays for all elements in your data, it also reshapes your data into a less efficient format, creating many duplicates for current values. the result and the memory penalty will depend on the structure of your data and the number of value columns.

Give the pandas docs on reshaping by melt a close read, and calculate whether you can afford to create an array of all elements in your id_vars column and repeat them for all columns specified by value_vars.

As an example, if your dataframe has 1M rows and 1000 columns, with all cells as float32, the dataframe would take up approximately 4GB in memory. If you then try to melt and specify 4 id_vars, then you'll have 4*1M id cells which will each get repeated (996) times, giving you 4*1e6*996 giving you 4Bn cells for the index. Additionally, you'll have a column with 1e6*996 "variables" and finally the same number of "values". You'd need to know the length and dtype of all the column names and the data types of the cells, but this simple example would result in a 23 GB array even if all values were relatively compact float32s.

Melt is a helpful convenience function for reshaping small dataframes. If you have a dataframe which is anywhere near the size I'm talking about in this example, I'd mostly suggest you don't do this, or if you really do need to reshape this way, then you need to get serious about understanding the operation and chunking the data in a way that is tailored to your data's size. You may want to write out the data iteratively rather than attempting to concatenate the data at the end. This isn't something that will work out of the box - expect some trial & error. You could also look into using out-of-core computation tools - dask.dataframe has a port of melt which could leverage multiple cores and write in parallel to disk.

MemoryError from melt or concat with large data

1 Answers1