the solution is to read the data then append then write back to file.
Example code assuming using pandas and data fits in memory if not you could use dask
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
# Read the existing Parquet file
existing_df = pd.read_parquet('existing_file.parquet')
# Create a new DataFrame with new data (alternatively, read from another source)
new_data = {'column1': [value1, value2, ...],
'column2': [value1, value2, ...],
...}
new_df = pd.DataFrame(new_data)
# Concatenate the existing DataFrame with the new DataFrame
updated_df = pd.concat([existing_df, new_df], ignore_index=True)
# Write the updated DataFrame to the same Parquet file
table = pa.Table.from_pandas(updated_df)
pq.write_to_dataset(table, root_path='existing_file.parquet', compression='snappy', use_dictionary=True)
There are two solutions for out of memory data:
- Either use dask as the following:
import dask.dataframe as dd
# Read the existing Parquet file
existing_ddf = dd.read_parquet('existing_file.parquet')
# Create a new Dask DataFrame with new data (alternatively, read from another source)
new_data = {'column1': [value1, value2, ...],
'column2': [value1, value2, ...],
...}
new_ddf = dd.from_pandas(pd.DataFrame(new_data), npartitions=1)
# Concatenate the existing Dask DataFrame with the new Dask DataFrame
updated_ddf = dd.concat([existing_ddf, new_ddf], ignore_index=True)
# Write the updated Dask DataFrame to the same Parquet file (or a new file)
updated_ddf.to_parquet('updated_file.parquet', compression='snappy', write_options={'compression': 'snappy', 'use_dictionary': True})
Or you can write the file directly as a parquet without reading the old one. This is best practice, but write it in the same directory. Then when you read it read both files together as Parquet reader supports reading multiple files.