Unable to write parquet with DATE as logical type for a column from pandas

Question

I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet as the engine since I need to stream the data from database and append to the same parquet file. Here is my code

import os
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text

sql = "SELECT TO_VARCHAR(CURRENT_DATE, 'YYYYMMDD') AS "REPORT_DATE" FROM DUMMY;"

def create_stream_enabled_connection(username, password, host, port):
   conn_str = f"mydb://{username}:{password}@{host}:{port}"
   engine = create_engine(conn_str, connect_args={'encrypt': 'True', 'sslValidateCertificate':'False'})
   connection = engine.connect().execution_options(stream_results=True)
   return connection

connection = create_stream_enabled_connection(username, password, host, port)

ROWS_IN_CHUNK = 500000
# Stream chunks from database
for dataframe_chunk in pd.read_sql(text(sql), connection, chunksize=ROWS_IN_CHUNK):
   if os.stat(local_path).st_size == 0:  # If file is empty
      # write parquet file
      dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet')
   else:
      # write parquet file
      dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet', append=True)

Problem:

I am unable to get the logical type to be DATE and physical type to be INT32 in the output parquet from pandas to_parquet function using fastparquet as engine.

A few things that I have tried:

If I take the REPORT_DATE coming from database as string and put it to parquet, then the logical type is STRING. Doesn't work for me.
If I take the REPORT_DATE coming from database as string and convert it to datetime using the following code, then the dtype is datetime64[ns] and logical type in parquet is Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false) and physical type is INT64. Doesn't work for me.

dataframe_chunk['report_date'] = pd.to_datetime(dataframe_chunk['report_date'], format='%Y%m%d')

I need the parquet logical type to be DATE and physical type to be INT32 since this parquet will be loaded directly to bigquery and the REPORT_DATE column will go to a DATE type column in bigquery. See the bigquery documentation here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions

If I try to store STRING or DATETIME column in parquet and load it in bigquery, bigquery fails by saying that it expected the column with another type.

Any help would be much appreciated.

Posted the question on fastparquet github https://github.com/dask/fastparquet/issues/880 — Behroz Sikander, Aug 29 '23 at 10:21
Based on the answer from mdurant, i changed my code to use pyspark to write the parquet file with DATE was logicial type. — Behroz Sikander, Aug 29 '23 at 15:33

score 0 · Accepted Answer · answered Aug 29 '23 at 13:22

Answered in https://github.com/dask/fastparquet/issues/880#issuecomment-1697436417

The short story is: fastparquet cannot do this right now. I don't know how you can persuade bigquery to accept the current style of output, but I would think there must be a way, since the 64-bit format is valid parquet.

Unable to write parquet with DATE as logical type for a column from pandas

1 Answers1