2

I have a 32GB machine, the csv file is 1 million rows by 4 columns (800MB). When I run the code Python only uses up about 1GB of my memory, but I get a memory error:

MemoryError: Unable to allocate array with shape (23459822,) and data type int64

NOTE: problem only occurs running Windows, Ubuntu the problem vanishes with exact same code

The code in question:

elif light in entry:

    df = pandas.read_csv('maps_android_light_raw_20190909.csv')

    for i,g in df.groupby('device_id'):
        output_file2 = path+f'{i}/LIGHT/'

        if not os.path.exists(output_file2):
            os.makedirs(output_file2)

        g.to_csv(output_file2 + f'{i}.csv', index = False)
        del df

The full traceback:

Traceback (most recent call last):
  File "light.py", line 49, in <module>
    main()
  File "light.py", line 33, in main
    for i,g in df2:
  File "C:\Python37\lib\site-packages\pandas\core\groupby\ops.py", line 164, in get_iterator
    for key, (i, group) in zip(keys, splitter):
  File "C:\Python37\lib\site-packages\pandas\core\groupby\ops.py", line 899, in __iter__
    sdata = self._get_sorted_data()
  File "C:\Python37\lib\site-packages\pandas\core\groupby\ops.py", line 918, in _get_sorted_data
    return self.data.take(self.sort_idx, axis=self.axis)
  File "pandas/_libs/properties.pyx", line 34, in pandas._libs.properties.CachedProperty.__get__
  File "C:\Python37\lib\site-packages\pandas\core\groupby\ops.py", line 896, in sort_idx
    return get_group_index_sorter(self.labels, self.ngroups)
  File "C:\Python37\lib\site-packages\pandas\core\sorting.py", line 349, in get_group_index_sorter
    sorter, _ = algos.groupsort_indexer(ensure_int64(group_index), ngroups)
  File "pandas/_libs/algos.pyx", line 173, in pandas._libs.algos.groupsort_indexer
MemoryError: Unable to allocate array with shape (23459822,) and data type int64
Geordie Wicks
  • 1,065
  • 1
  • 11
  • 27
  • 2
    This is a duplicate question that has already been asked 6 years ago. Please first check the answers/suggestions here: [Memory error when using pandas read_csv](https://stackoverflow.com/questions/17557074/memory-error-when-using-pandas-read-csv) – Peter Nov 24 '19 at 21:44
  • @Peter, the answers in that thread are fairly out of date. The top rated answer is about using 32-bit version of windows. – James Nov 24 '19 at 21:56
  • 1
    Geordie: any chance you are using a 32-bit install of Python? Are you able run this line of code: `df = pandas.DataFrame({'a': pandas.np.random.randint(0,1000, size=23459822, dtype=pd.np.int64)})`? – James Nov 24 '19 at 21:56
  • @James, yeah it appears that code runs fine, no errors. – Geordie Wicks Nov 24 '19 at 22:04
  • Can you post the full traceback? – James Nov 24 '19 at 22:09
  • @james, added traceback – Geordie Wicks Nov 24 '19 at 22:14
  • If you have a small cluster of computers available, you can check out [dask](https://dask.org/) which offers a pandas like DataFrame, which can be distributed over a cluster of machines, handling larger datasets. – miku Nov 24 '19 at 22:17
  • @james, just tried to run it on an Ubuntu VM, and it works fine. It must be some kind of operating system thing, but I have no idea what? – Geordie Wicks Nov 24 '19 at 23:26
  • Have you tried `low_memory=False` ? `df = pandas.read_csv('maps_android_light_raw_20190909.csv',low_memory=False)` – kha Nov 24 '19 at 23:37
  • @khant, yup, one of the first things I tried, made no difference. For now I am just going to do the analysis using an Ubuntu VM, it's one of those weird issues – Geordie Wicks Nov 25 '19 at 00:05

0 Answers0