1

I have a partitioned dataset stored on internal S3 cloud. I am reading the dataset with pyarrow table

import pyarrow.dataset as ds
my_dataset = ds.dataset( ds_name, format="parquet", filesystem=s3file, partitioning="hive")
fragments = list(my_dataset.get_fragments())
required_fragment = fragements.pop()

The metadata from the required fragment shows the following:

required_fragment.metadata
<pyarrow._parquet.FileMetaData object at 0x00000291798EDF48>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 22
  num_rows: 949650
  num_row_groups: 29
  format_version: 1.0
  serialized_size: 68750

converting this to table however takes a long time

%timeit required_fragment.to_table()
6min 29s ± 1min 15s per loop (mean ± std. dev. of 7 runs, 1 loop each)

The size of the table itself is about 272mb

required_fragment.to_table().nbytes
272850898

Any ideas how i can speed up converting the ds.fragment to table?

Updates

So I instead of pyarrow.dataset, i tried using pyarrow.parquet Only part of my code that changed is

import pyarrow.parquet as pq
my_dataset = pq.ParquetDataset(ds_name,filesystem=s3file, partitioning="hive", use_legacy_dataset=False )
fragments = my_dataset.fragments
required_fragment = fragements.pop()

and when i tried the code again, the performance was much better

%timeit required_fragment.to_table()
12.4 s ± 1.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

While i am happy with the better performance, it still feels confusing as under the hood, by setting use_legacy_dataset = False, the program should have similar outcomes

PC Information Installed RAM: 21.0GB Software: Windows 10 Enterprise Internet speed: 10Mbps / 156 Mpbs (download / upload) s3 location: Asia

Femi King
  • 11
  • 2
  • 1
    Is your machine an ec2 container in the same region as your S3 bucket? Or is this an internet transfer? Have you measured the transfer speed? 1 minute for 272MB seems quite slow. – Pace Nov 03 '22 at 19:49
  • Thanks Pace, I've updated my post, looks like the performance difference comes down to using pyarrow.dataset vs pyarrow.ParquetDataset. This is an internet transfer, but i dosen't look as though the transfer speed is the main issue. – Femi King Nov 03 '22 at 20:08
  • I think to have a real benchmark you'd have to copy the data locally. And even then it's not 100% fair. With s3 and the network there are a lot of opportunity for caching and optimisation. Also to be fair in your comparison, you'd have to benchmark the whole pipeline (including the creation of the dataset and `fragments.pop`). It's possible that one implementation caches some fragments ahead of time. – 0x26res Nov 04 '22 at 09:06

0 Answers0