3

I have a large .parquet dataset splitted into ~256k chunks (20GB). Lately I've repacked it into 514 chunks (28GB) to reduce the number of files.

What I really need is to load data based on a field which contains int32 values in the range from 0 to 99.999.999 (around 200k different values).

I've tried an example Writing large amounts of data, but pyspark 5 doesn't allow to write so many partitions and raises error pyarrow.lib.ArrowInvalid: Fragment would be written into 203094 partitions. This exceeds the maximum of 1024

Is it somehow possible to repartition the dataset based on the mentioned field so that each chunk contains range of values? e.g. partition 1 (0-99999), partition 2 (100000-199000), ...

Winand
  • 2,093
  • 3
  • 28
  • 48

1 Answers1

0

The max_partitions is configurable (pyarrow >= 4.0.0). You might start to run into ARROW-12321 because pyarrow is going to open a file descriptor for each partition and won't close it until its received all data. You could then bump the max file descriptors on your system to work around that.

Your idea about grouping the partition column is a good one too. That should reduce the number of files you have (making things easier to manage) and may even improve performance (even though each file will have more data). Unfortunately, this isn't quite ready to be easily implemented. Arrow's projection mechanism is what you want but pyarrow's dataset expressions aren't fully hooked up to pyarrow compute functions (ARROW-12060).

There is a slightly more verbose, but more flexible approach available. You can scan the batches in python, apply whatever transformation you want, and then expose that as an iterator of batches which the dataset writer will accept:

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc
import pyarrow.parquet as pq

table = pa.Table.from_pydict({'x': range(20), 'y': [1] * 20})
pq.write_table(table, '/tmp/foo.parquet')
part = pa.dataset.partitioning(pa.schema([("partition_key", pa.int64())]), flavor='hive')
dataset = pa.dataset.dataset('/tmp/foo.parquet')

scanner = dataset.scanner()
scanner_iter = scanner.scan_batches()

# Arrow doesn't have modulo / integer division yet but we can
# approximate it with masking (ARROW-12755).
# There will be 2^3 items per group.  Adjust items_per_group_exponent
# to your liking for more items per file.                                                                                                                                                                           
items_per_group_exponent = 3
items_per_group_mask = (2 ** items_per_group_exponent) - 1
mask = ((2 ** 63) - 1) ^ items_per_group_mask
def projector():
    while True:
        try:
            next_batch = next(scanner_iter).record_batch
            partition_key_arr = pc.bit_wise_and(next_batch.column('x'), mask)
            all_arrays = [*next_batch.columns, partition_key_arr]
            all_names = [*next_batch.schema.names, 'partition_key']
            batch_with_part = pa.RecordBatch.from_arrays(all_arrays, names=all_names)
            print(f'Yielding {batch_with_part}')
            yield batch_with_part
        except StopIteration:
            return

full_schema = dataset.schema.append(pa.field('partition_key', pa.int64()))
ds.write_dataset(projector(), '/tmp/new_dataset', schema=full_schema, format='parquet', partitioning=part)
Winand
  • 2,093
  • 3
  • 28
  • 48
Pace
  • 41,875
  • 13
  • 113
  • 156
  • Thanks for the valuable example! It took 3.5 hours until `StopIteration`. Then after 15 minutes all the files were closed, but the process is still doing something during half an hour. – Winand Aug 10 '21 at 14:45
  • side note: I've run the code under Windows, the process takes a lot of memory https://imgur.com/a/Sj6Fasc I've used `items_per_group_exponent=17`. System: i5-9600K, 16Gb, Samsung 860 evo – Winand Aug 10 '21 at 14:53
  • Hmm, 3.5 hours seems unreasonably long when talking about ~20GB of data. What sort of disk are you reading from and writing to? How many files do you end up creating? – Pace Aug 10 '21 at 18:18
  • Ah, is that 50GB of memory usage? On a 16GB server? If you are swapping then that may explain the performance. – Pace Aug 10 '21 at 18:21
  • yes, a lot of swapping. Does pyarrow keep all the data in memory? /// Also is it possible to repartition and apply snappy compression at the same time? Or it's easier to compress each .parquet file afterwards? – Winand Aug 11 '21 at 06:57
  • Pyarrow shouldn't keep all that data in memory. It has back pressure to slow itself down but that doesn't seem to be working. ARROW-12321 means you will end up with at least one batch per file so if you are writing many files it can add up. I don't honestly know how disk caching works on Windows but that could play a role. If the write method never blocks then Arrow will just keep on reading. But I assume the write method blocks at some point. If you have an reasonably short reproduction feel free to file a JIRA and it can be investigated further. – Pace Aug 11 '21 at 17:55
  • 1
    I did some investigation into this today and pyarrow is using way too much memory for dataset writing. I've added some details to ARROW-13590 and opened ARROW-13611 to cover these issues so don't worry about creating a reproduction / JIRA. Thank you very much for the question, it was a very productive investigation :) – Pace Aug 12 '21 at 03:41
  • A year has passed:) Arrow 8.0. I've tried to use `partition_key_arr = pc.divide(next_batch.column('x').cast('int32'), items_per_group)` instead of `bit_wise_and` and I don't see any issues. Has something changed? is it ok to use `divide`? – Winand Jul 28 '22 at 06:34