4

How can I write a pandas dataframe to disk in .arrow format? I'd like to be able to read the arrow file into Arquero as demonstrated here.

RobinL
  • 11,009
  • 8
  • 48
  • 68

3 Answers3

7

Since Feather is the Arrow IPC format, you can probably just use write_feather. See http://arrow.apache.org/docs/python/feather.html

Neal Richardson
  • 792
  • 3
  • 3
  • Interesting - I will try this. I knew they were very similar! – RobinL Nov 02 '20 at 16:15
  • Thank you. I can confirm that `feather.write_feather(table, 'file.feather', compression='uncompressed')` works with Arquero, as well as saving to `arrow` using `pa.ipc.new_file`. The uncompressed feather file is about 10% larger on disk than the `.arrow` file. A compressed feather file cannot be read using the same methodology: https://observablehq.com/d/298f76ea5f91b5fe which is taken from https://observablehq.com/@uwdata/arquero-and-apache-arrow I've uploaded the files here: https://github.com/RobinL/arrow_test – RobinL Nov 02 '20 at 17:54
  • 2
    Correct, the JS implementation of Arrow hasn't added support for compressed feather/arrow files, so you'll need to write them uncompressed. – Neal Richardson Nov 03 '20 at 17:49
  • Cheers guys - couldn't find anything in the arrow docs, notebooks or mailing lists on this, is there anywhere? I'm assuming this is related to the lack of browser support for javascript compression/decompression. – nite Dec 15 '20 at 21:54
4

You can do this as follows:

import pyarrow
import pandas

df = pandas.read_parquet('your_file.parquet')

schema = pyarrow.Schema.from_pandas(df, preserve_index=False)
table = pyarrow.Table.from_pandas(df, preserve_index=False)

sink = "myfile.arrow"

# Note new_file creates a RecordBatchFileWriter 
writer = pyarrow.ipc.new_file(sink, schema)
writer.write(table)
writer.close()
buhtz
  • 10,774
  • 18
  • 76
  • 149
RobinL
  • 11,009
  • 8
  • 48
  • 68
1

Pandas can directly write a DataFrame to the binary Feather format. (uses pyarrow)

import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_feather('my_data.arrow')

Additional keywords are passed to pyarrow.feather.write_feather(). This includes the compression, compression_level, chunksize and version keywords.

ns15
  • 5,604
  • 47
  • 51