Is there a way to directly insert data from a parquet file into PostgreSQL database?

Question

I'm trying to restore some historic backup files that saved in parquet format, and I want to read from them once and write the data into a PostgreSQL database.

I know that backup files saved using spark, but there is a strict restriction for me that I cant install spark in the DB machine or read the parquet file using spark in a remote device and write it to the database using spark_df.write.jdbc. Everything needs to happen on the DB machine and in the absence of spark and Hadoop only using Postgres and Bash scripting.

my files structure is something like:

foo/
    foo/part-00000-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    foo/part-00001-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    foo/part-00002-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    ..
    ..

I expect to read data and schema from each parquet folder like foo, create a table using that schema and write the data into the shaped table, only using bash and Postgres CLI.

You can try the Parquet Foreign Data Wrapper https://github.com/adjust/parquet_fdw. You'll have to download the files from HDFS first. — Remus Rusanu, Nov 10 '19 at 08:07
@RemusRusanu It's quite interesting, thank you! I'm going to test it but the commits show that it is heavily under development yet. I'm looking for a solution based on processing the files using bash. — Javad Bahoosh, Nov 10 '19 at 08:51

score 10 · Accepted Answer · answered Nov 10 '19 at 14:41

10

You can using spark and converting parquet files to csv format, then moving the files to DB machine and import them by any tools.

spark.read.parquet("...").write.csv("...")

import pandas as pd
df = pd.read_csv('mypath.csv')
df.columns = [c.lower() for c in df.columns] #postgres doesn't like capitals or spaces

from sqlalchemy import create_engine
engine = create_engine('postgresql://username:password@localhost:5432/dbname')

df.to_sql("my_table_name", engine)

answered Nov 10 '19 at 14:41

Moein Hosseini

4,309
15
68
106

1

Thanks for your answer! eventually, I decided to convert parquet files to CSV using spark in another machine, ship CSV files to DB machine and propagate tables using SQL `COPY foo FROM '/path/to/csv/foo' WITH (FORMAT CSV)` statement. – Javad Bahoosh Nov 10 '19 at 14:54
This is one of the best answers I've seen to the question "easiest way to ingest csv files into Postgres using python" – Joey Baruch Jun 16 '21 at 01:13
5

Alternatively, you can even skip the whole reading into Spark/writing to CSV step by just using `pyarrow.parquet` and reading directly into pandas with the `ParquetDataset` function - that could save an entire write and read of the data. – bsplosion Jul 15 '21 at 20:57
2

Why not use `pd.read_parquet` here instead of `spark.read.parquet` ? – baxx Mar 12 '23 at 23:56

score 5 · Answer 2 · answered Feb 14 '23 at 22:14

5

I made a library to convert from parquet to Postgres’ binary format: https://github.com/adriangb/pgpq

answered Feb 14 '23 at 22:14

LoveToCode

788
6
14

Is there a way to directly insert data from a parquet file into PostgreSQL database?

2 Answers2