2

I have a dataset of size around 270MB and I use the following to write to feather file:

df.reset_index().to_feather(feather_path)

This gives me an error :

  File "C:\apps\Python\lib\site-packages\pandas\util\_decorators.py", line 207, in wrapper
    return func(*args, **kwargs)
  File "C:\apps\Python\lib\site-packages\pandas\core\frame.py", line 2519, in to_feather
    to_feather(self, path, **kwargs)
  File "C:\apps\Python\lib\site-packages\pandas\io\feather_format.py", line 87, in to_feather
    feather.write_feather(df, handles.handle, **kwargs)
  File "C:\apps\Python\lib\site-packages\pyarrow\feather.py", line 152, in write_feather
    table = Table.from_pandas(df, preserve_index=False)
  File "pyarrow\table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
  File "C:\apps\Python\lib\site-packages\pyarrow\pandas_compat.py", line 607, in dataframe_to_arrays
    arrays[i] = maybe_fut.result()
  File "C:\apps\Python\lib\concurrent\futures\_base.py", line 438, in result
    return self.__get_result()
  File "C:\apps\Python\lib\concurrent\futures\_base.py", line 390, in __get_result
    raise self._exception
  File "C:\apps\Python\lib\concurrent\futures\thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "C:\apps\Python\lib\site-packages\pyarrow\pandas_compat.py", line 575, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow\array.pxi", line 302, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow\error.pxi", line 114, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: realloc of size 3221225472 failed

Note : This works well in PyCharm. No issues writing the feather file. But when the python program is called in a Windows batch file like:

call python "myprogram.py"

and when I schedule the batch file in a task using Task Scheduler it fails with above memory error.

PyArrow version is 5.0.0 if that helps.

Any ideas please?

SomeDude
  • 13,876
  • 5
  • 21
  • 44
  • Interesting. It appears to be trying to allocate ~3GB for one column of the dataframe. Is there any chance you could put a breakpoint on `File "C:\apps\Python\lib\site-packages\pyarrow\pandas_compat.py", line 575` inspect `type_` and `col` (e.g. how long is the array, what kind of values does it seem to have)? – Pace Oct 08 '21 at 07:46
  • I had similar problem happened on VS Code jupyternotebook. My dataframe was not very big so I tried to use Jupyternotebook on Chrome and the problem went away! – H.C.Chen Jan 03 '22 at 05:01
  • Same issue, in JNB it works in my PyCharm it doesn't. – Timbus Calin Feb 26 '22 at 14:08

0 Answers0