I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet as the engine since I need to stream the data from database and append to the same parquet file. Here is my code
import os
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text
sql = "SELECT TO_VARCHAR(CURRENT_DATE, 'YYYYMMDD') AS "REPORT_DATE" FROM DUMMY;"
def create_stream_enabled_connection(username, password, host, port):
conn_str = f"mydb://{username}:{password}@{host}:{port}"
engine = create_engine(conn_str, connect_args={'encrypt': 'True', 'sslValidateCertificate':'False'})
connection = engine.connect().execution_options(stream_results=True)
return connection
connection = create_stream_enabled_connection(username, password, host, port)
ROWS_IN_CHUNK = 500000
# Stream chunks from database
for dataframe_chunk in pd.read_sql(text(sql), connection, chunksize=ROWS_IN_CHUNK):
if os.stat(local_path).st_size == 0: # If file is empty
# write parquet file
dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet')
else:
# write parquet file
dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet', append=True)
Problem:
I am unable to get the logical type to be DATE and physical type to be INT32 in the output parquet from pandas to_parquet
function using fastparquet
as engine.
A few things that I have tried:
- If I take the REPORT_DATE coming from database as string and put it to parquet, then the logical type is STRING. Doesn't work for me.
- If I take the REPORT_DATE coming from database as string and convert it to datetime using the following code, then the dtype is
datetime64[ns]
and logical type in parquet isTimestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)
and physical type isINT64
. Doesn't work for me.
dataframe_chunk['report_date'] = pd.to_datetime(dataframe_chunk['report_date'], format='%Y%m%d')
I need the parquet logical type to be DATE
and physical type to be INT32
since this parquet will be loaded directly to bigquery and the REPORT_DATE column will go to a DATE
type column in bigquery. See the bigquery documentation here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions
- If I try to store STRING or DATETIME column in parquet and load it in bigquery, bigquery fails by saying that it expected the column with another type.
Any help would be much appreciated.