0

I am pretty new to python and databases. I am trying to write my dataset in the feather format. The dataset is large and segmented, I want to store my data in chunks and retrieve only certain chunks when I need them. Would this be possible ? I appreciate any advice, Thank you!

I was looking through the API for feather in pyarrow, I found that the write function allows you to specify chunksize, but I haven't been able to find out how to read the chunks or query specific chunks.

MeganNN
  • 1
  • 1

1 Answers1

0

Yes, this is possible. It's a little hard to see because pyarrow sometimes calls its format 'feather', sometimes calls it 'arrow', and sometimes 'ipc'. The simpler method is pyarrow.feather.read_table, which you've find; but to access individual chunks you'll want to instead use pyarrow.ipc.open_file. That returns an object that can read one batch at a time.

reader = pyarrow.ipc.open_file('my_file.feather')
first_batch = reader.get_batch(0)
fourth_batch = reader.get_batch(3)

Note also though that when you use feather.read_table(), it doesn't actually load the whole table into memory; instead it use memory mapping to make it faster to work with but leaves it on disk. That means that for non-enormous datasets (up to a gigabyte or two) it's often fine to load the whole thing at once.

Ben Schmidt
  • 116
  • 1
  • 1
  • Thank you, I will try this out! Right now I am using the to_batches(self, max_chunksize=None) method of a Pyarrow Table after I read in the table and then get different batches. – MeganNN Apr 18 '23 at 16:43