4

I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query. Otherwise, the data (all the other fields, and all the data) is identical. In a database world, I'd do an ALTER TABLE and rename the column. However, I don't know how to do that with parquet/PyArrow

Is there a way to rename the column in the file, rather than having to regenerate or duplicate the file?

Alternatively, can I read it (read_table or ParquetFile,I assume), change the column in the object (unsure how to do that) and write it out?

I see "rename_columns", but unsure how that works; I tried just using it by itself, it says "rename_columns is not defined".

rename_columns(self, names) Create new table with columns renamed to provided names.

Many thanks!

mbourgon
  • 1,286
  • 2
  • 17
  • 35

1 Answers1

8

I suspect you are using a version of pyarrow that doesn't support rename_columns. Can you run pa.__version__ to check?

Otherwise what you want to do is straightforward, in the example below I rename column b to c:

import pyarrow as pa
import pyarrow.parquet as pq

col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())

table = pa.Table.from_arrays(
    [col_a, col_b],
    schema=pa.schema([
        pa.field('a', col_a.type),
        pa.field('b', col_b.type),
    ])
)

pq.write_table(table, '/tmp/original')
original = pq.read_table('/tmp/original')
renamed = original.rename_columns(['a', 'c'])
pq.write_table(renamed, '/tmp/renamed')
0x26res
  • 11,925
  • 11
  • 54
  • 108
  • Ah! I had the right version, I just didn't try invoking(?) it as part of the variable. Cool, many thanks for the code! – mbourgon Aug 11 '20 at 13:43