7

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration, whereas I would expect it to remain constant (as nothing should be appended in memory). This makes it tricky to scale.

I've added a minimum reproducible example which creates 10 000 parquet and loop appends to it.

import resource
import random
import string
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

schema = pa.schema([
                        pa.field('test', pa.string()),
                    ])

resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000))
number_files = 10000
number_rows_increment = 1000
number_iterations = 100

writers = [pq.ParquetWriter('test_'+id_generator()+'.parquet', schema) for i in range(number_files)]

for i in range(number_iterations):
    for writer in writers:
        table_to_write = pa.Table.from_pandas(
                            pd.DataFrame({'test': [id_generator() for i in range(number_rows_increment)]}),
                            preserve_index=False,
                            schema = schema,
                            nthreads = 1)
        table_to_write = table_to_write.replace_schema_metadata(None)
        writer.write_table(table_to_write)
    print(i)

for writer in writers:
    writer.close()

Would anyone have any idea what causes this leak and how to prevent it?

Abel Riboulot
  • 158
  • 1
  • 8

2 Answers2

3

We aren't sure what is wrong, but some other users have reported as yet undiagnosed memory leaks. I added your example to one of the tracking JIRA issues https://issues.apache.org/jira/browse/ARROW-3324

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
1

Update on 2022:

I've spent several days on memory leak issue from pyarrow. Please see here for a better understanding. I'll paste the key points below. Basically, they are saying it is not a library memory leak issue, rather it is a common behavior.

Pyarrow uses jemalloc, a custom memory allocator which does its best to hold onto memory allocated from the OS (since this can be an expensive operation). Unfortunately, this makes it difficult to track line by line memory usage with tools like memory_profiler. There are a couple of options:

  1. You could use this library function, pyarrow.total_allocated_bytes to track allocation instead of using memory_profiler.
  2. You can also put the following line at the top of your script, this will configure jemalloc to release memory immediately instead of holding on to it (this will likely have some performance implications). However, I used it, but did not work.
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)

The behavior you are seeing is pretty typical for jemalloc. For further reading, you can also see these other issues for more discussions and examples of jemalloc behaviors:

https://issues.apache.org/jira/browse/ARROW-6910

https://issues.apache.org/jira/browse/ARROW-7305

user3503711
  • 1,623
  • 1
  • 21
  • 32