pyarrow version 1.0 bug throws Out Of Memory exception while reading large number of files using ParquetDataset (works fine with version 0.13)

Question

I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072 failed". The same code on the same machine still works with older version. My machine has 256Gb memory way more than enough to load the data which requires < 10Gb. You can use below code to generate the issue on your side.

    # create a big dataframe
    import pandas as pd
    import numpy as np

    df = pd.DataFrame({'A': np.arange(50000000)})
    df['F1'] = np.random.randn(50000000) * 100
    df['F2'] = np.random.randn(50000000) * 100
    df['F3'] = np.random.randn(50000000) * 100
    df['F4'] = np.random.randn(50000000) * 100
    df['F5'] = np.random.randn(50000000) * 100
    df['F6'] = np.random.randn(50000000) * 100
    df['F7'] = np.random.randn(50000000) * 100
    df['F8'] = np.random.randn(50000000) * 100
    df['F9'] = 'ABCDEFGH'
    df['F10'] = 'ABCDEFGH'
    df['F11'] = 'ABCDEFGH'
    df['F12'] = 'ABCDEFGH01234'
    df['F13'] = 'ABCDEFGH01234'
    df['F14'] = 'ABCDEFGH01234'
    df['F15'] = 'ABCDEFGH01234567'
    df['F16'] = 'ABCDEFGH01234567'
    df['F17'] = 'ABCDEFGH01234567'

    # split and save data to 5000 files
    for i in range(5000):
        df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)

    # use a fresh session to read data

    # below code works to read
    import pandas as pd
    df = []
    for i in range(5000):
        df.append(pd.read_parquet(f'{i}.parquet'))

    df = pd.concat(df)


    # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
    # tried use_legacy_dataset=False, same issue
    import pyarrow.parquet as pq

    fnames = []
    for i in range(5000):
        fnames.append(f'{i}.parquet')

    len(fnames)

    df = pq.ParquetDataset(fnames).read(use_threads=False)

Could you open a bug report about this? See https://issues.apache.org/jira/projects/ARROW/issues/ (the red "Create" button at the top). Also, do you get the same error message with `use_legacy_dataset` set to True or False? — joris, Sep 08 '20 at 11:49
I don't think I have permission/option to open JIRA on the link you mentioned, can you please do it for me? The error message with use_legacy_dataset=False gives "terminate called after throwing an instance of 'std::bad_alloc'" — ashish, Sep 09 '20 at 12:31
Anybody can open a JIRA, you normally don't need to have any special permissions. You only might need to create an account (but which will be needed anyway to further answer / comment on the JIRA issue). I think the easiest would be if you could open a JIRA, as there will be several follow-up questions to try to diagnose the issue (like: does reading a single file with `pq.read_table()` works? What if you comment out the last few string columns? ..) (I can't test this, because I don't have enough memory to try with such a large dataset. With 10x smaller data, the example code runs fine for me) — joris, Sep 10 '20 at 09:10
Thanks I have created a JIRA (https://issues.apache.org/jira/browse/ARROW-9974). pq.read_table works for a single file. — ashish, Sep 11 '20 at 10:30

pyarrow version 1.0 bug throws Out Of Memory exception while reading large number of files using ParquetDataset (works fine with version 0.13)

0 Answers0