Panda 800MB csv causing memory errors ( 32GB RAM)

Question

I have a 32GB machine, the csv file is 1 million rows by 4 columns (800MB). When I run the code Python only uses up about 1GB of my memory, but I get a memory error:

MemoryError: Unable to allocate array with shape (23459822,) and data type int64

NOTE: problem only occurs running Windows, Ubuntu the problem vanishes with exact same code

The code in question:

elif light in entry:

    df = pandas.read_csv('maps_android_light_raw_20190909.csv')

    for i,g in df.groupby('device_id'):
        output_file2 = path+f'{i}/LIGHT/'

        if not os.path.exists(output_file2):
            os.makedirs(output_file2)

        g.to_csv(output_file2 + f'{i}.csv', index = False)
        del df

The full traceback:

Traceback (most recent call last):
  File "light.py", line 49, in <module>
    main()
  File "light.py", line 33, in main
    for i,g in df2:
  File "C:\Python37\lib\site-packages\pandas\core\groupby\ops.py", line 164, in get_iterator
    for key, (i, group) in zip(keys, splitter):
  File "C:\Python37\lib\site-packages\pandas\core\groupby\ops.py", line 899, in __iter__
    sdata = self._get_sorted_data()
  File "C:\Python37\lib\site-packages\pandas\core\groupby\ops.py", line 918, in _get_sorted_data
    return self.data.take(self.sort_idx, axis=self.axis)
  File "pandas/_libs/properties.pyx", line 34, in pandas._libs.properties.CachedProperty.__get__
  File "C:\Python37\lib\site-packages\pandas\core\groupby\ops.py", line 896, in sort_idx
    return get_group_index_sorter(self.labels, self.ngroups)
  File "C:\Python37\lib\site-packages\pandas\core\sorting.py", line 349, in get_group_index_sorter
    sorter, _ = algos.groupsort_indexer(ensure_int64(group_index), ngroups)
  File "pandas/_libs/algos.pyx", line 173, in pandas._libs.algos.groupsort_indexer
MemoryError: Unable to allocate array with shape (23459822,) and data type int64

This is a duplicate question that has already been asked 6 years ago. Please first check the answers/suggestions here: [Memory error when using pandas read_csv](https://stackoverflow.com/questions/17557074/memory-error-when-using-pandas-read-csv) — Peter, Nov 24 '19 at 21:44
@Peter, the answers in that thread are fairly out of date. The top rated answer is about using 32-bit version of windows. — James, Nov 24 '19 at 21:56
Geordie: any chance you are using a 32-bit install of Python? Are you able run this line of code: `df = pandas.DataFrame({'a': pandas.np.random.randint(0,1000, size=23459822, dtype=pd.np.int64)})`? — James, Nov 24 '19 at 21:56
If you have a small cluster of computers available, you can check out [dask](https://dask.org/) which offers a pandas like DataFrame, which can be distributed over a cluster of machines, handling larger datasets. — miku, Nov 24 '19 at 22:17
@james, just tried to run it on an Ubuntu VM, and it works fine. It must be some kind of operating system thing, but I have no idea what? — Geordie Wicks, Nov 24 '19 at 23:26
Have you tried `low_memory=False` ? `df = pandas.read_csv('maps_android_light_raw_20190909.csv',low_memory=False)` — kha, Nov 24 '19 at 23:37
@khant, yup, one of the first things I tried, made no difference. For now I am just going to do the analysis using an Ubuntu VM, it's one of those weird issues — Geordie Wicks, Nov 25 '19 at 00:05

Panda 800MB csv causing memory errors ( 32GB RAM)

0 Answers0