0

What I'm really trying to do is this (in Python):

import pyarrow.parquet as pq

# Note the 'columns' predicate...
table = pq.read_table('gs://my_bucket/my_blob.parquet', columns=['a', 'b', 'c'])

First, I don't think that gs:// is supported in PyArrow as of V3.0.0. So I have to modify the code to use the fsspec interface: https://arrow.apache.org/docs/python/filesystems.html

import pyarrow.parquet as pq
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-google-project')
with fs.open('my_bucket/my_blob.parquet', 'rb') as file:
    table = pq.read_table(file.read(), columns=['a', 'b', 'c'])

Does this achieve predicate pushdown (I doubt it, because I'm already readying the whole file with file.read()), or is there a better way to get there?

user5406764
  • 1,627
  • 2
  • 16
  • 23
  • Have you tried `table = pq.read_table(file, columns=['a', 'b', 'c'])` (without the `read`). `read_table` supports "file like objects" as an argument – 0x26res Apr 22 '21 at 07:40

1 Answers1

2

Does this work?

import pyarrow.parquet as pq
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-google-project')
table = pq.read_table('gs://my_bucket/my_blob.parquet', columns=['a', 'b', 'c'], filesystem=fs)
Pace
  • 41,875
  • 13
  • 113
  • 156
  • @user5406764 could you please tell us if this works for you – vi calderon Apr 30 '21 at 14:42
  • Yes this did indeed work. I tested the load time with columns set to a single known column versus not specifying columns at all. The load time was significantly faster with the single column. – user5406764 May 01 '21 at 15:55