1

I have parquet files with some data in them. I want to add more data to them frequently every day. I want to do this without having to load the object to memory and then concatenate and write again. instead, directly append to the end if the table in the file.

I tried using pq.ParquetWriter() and it's write_table() method too, but it overwrites previous data.

Keyhan
  • 11
  • 1
  • 4
    You can't trivially append to a Parquet file like a CSV, because internally it stores metadata and statistics about groups of rows, so appending a row at a time would at best leave you with a terribly-structured file. A common alternative is to write a bunch of new, similarly-structured files, which a query engine like Acero can then easily suck up together (and you can then compact into fewer files if you like). – alistaire Apr 27 '23 at 15:50

1 Answers1

0

the solution is to read the data then append then write back to file.

Example code assuming using pandas and data fits in memory if not you could use dask

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

# Read the existing Parquet file
existing_df = pd.read_parquet('existing_file.parquet')

# Create a new DataFrame with new data (alternatively, read from another source)
new_data = {'column1': [value1, value2, ...],
            'column2': [value1, value2, ...],
            ...}

new_df = pd.DataFrame(new_data)

# Concatenate the existing DataFrame with the new DataFrame
updated_df = pd.concat([existing_df, new_df], ignore_index=True)

# Write the updated DataFrame to the same Parquet file
table = pa.Table.from_pandas(updated_df)
pq.write_to_dataset(table, root_path='existing_file.parquet', compression='snappy', use_dictionary=True)

There are two solutions for out of memory data:

  • Either use dask as the following:
import dask.dataframe as dd

# Read the existing Parquet file
existing_ddf = dd.read_parquet('existing_file.parquet')

# Create a new Dask DataFrame with new data (alternatively, read from another source)
new_data = {'column1': [value1, value2, ...],
            'column2': [value1, value2, ...],
            ...}

new_ddf = dd.from_pandas(pd.DataFrame(new_data), npartitions=1)

# Concatenate the existing Dask DataFrame with the new Dask DataFrame
updated_ddf = dd.concat([existing_ddf, new_ddf], ignore_index=True)

# Write the updated Dask DataFrame to the same Parquet file (or a new file)
updated_ddf.to_parquet('updated_file.parquet', compression='snappy', write_options={'compression': 'snappy', 'use_dictionary': True})

Or you can write the file directly as a parquet without reading the old one. This is best practice, but write it in the same directory. Then when you read it read both files together as Parquet reader supports reading multiple files.

Eyad Sibai
  • 811
  • 7
  • 21