8

Is it possible to convert a Pandas dataframe from/to an ORC file? I can transform the df in a parquet file, but the library doesn't seem to have ORC support. Is there an available solution in Python? If not, what could be the best strategy? One option could be converting the parquet file to ORC using an external tool, but I have no clue where to find it.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
alcor
  • 515
  • 1
  • 8
  • 21
  • Are you using Hive or Spark (or both)? It is much easier to do what you are trying to do if you have one of those, without errors. In particular, I strongly suggest you use Hive to manage your ORC files. You can connect to it in python by using pyodbc or pyhive packages. – Habardeen Dec 04 '19 at 11:29
  • @alcor I have just finished the ORC adapter in C++ and Python so it is possible to write ORC files now if you use my fork: https://github.com/mathyingzhou/arrow. – Ying Zhou Jan 10 '21 at 14:59

3 Answers3

7

This answer is tested with pyarrow==4.0.1 and pandas==1.2.5.

It first creates a pyarrow table using pyarrow.Table.from_pandas. It then writes the orc file using pyarrow.orc.ORCFile.

Read orc

import pandas as pd
import pyarrow.orc  # This prevents: AttributeError: module 'pyarrow' has no attribute 'orc'

df = pd.read_orc('/tmp/your_df.orc')

Write orc

import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc

# Here prepare your pandas df.

table = pa.Table.from_pandas(df, preserve_index=False)
orc.write_table(table, '/tmp/your_df.orc')

As of pandas==1.3.0, there isn't a pd.to_orc writer yet.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
  • Do you have any idea if is possible to add compression type while writing ORC file using your described solution? – Dominik Dec 07 '21 at 18:02
4

To add to the answer above, Pandas v1.5.0 natively supports writing to ORC files. I'll update this with more documentation when it's released.

my_df.to_orc('myfile.orc')

Gabe
  • 5,113
  • 11
  • 55
  • 88
0

I have used pyarrow recently which has ORC support, although I've seen a few issues where the pyarrow.orc module is not being loaded.

pip install pyarrow

to use:

import pandas as pd
import pyarrow.orc as orc

with open(filename) as file:
    data = orc.ORCFile(file)
    df = data.read().to_pandas()
PHY6
  • 391
  • 3
  • 12