0

I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet as the engine since I need to stream the data from database and append to the same parquet file. Here is my code

import os
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text

sql = "SELECT TO_VARCHAR(CURRENT_DATE, 'YYYYMMDD') AS "REPORT_DATE" FROM DUMMY;"

def create_stream_enabled_connection(username, password, host, port):
   conn_str = f"mydb://{username}:{password}@{host}:{port}"
   engine = create_engine(conn_str, connect_args={'encrypt': 'True', 'sslValidateCertificate':'False'})
   connection = engine.connect().execution_options(stream_results=True)
   return connection

connection = create_stream_enabled_connection(username, password, host, port)

ROWS_IN_CHUNK = 500000
# Stream chunks from database
for dataframe_chunk in pd.read_sql(text(sql), connection, chunksize=ROWS_IN_CHUNK):
   if os.stat(local_path).st_size == 0:  # If file is empty
      # write parquet file
      dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet')
   else:
      # write parquet file
      dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet', append=True)

Problem:

I am unable to get the logical type to be DATE and physical type to be INT32 in the output parquet from pandas to_parquet function using fastparquet as engine.

A few things that I have tried:

  • If I take the REPORT_DATE coming from database as string and put it to parquet, then the logical type is STRING. Doesn't work for me.
  • If I take the REPORT_DATE coming from database as string and convert it to datetime using the following code, then the dtype is datetime64[ns] and logical type in parquet is Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false) and physical type is INT64. Doesn't work for me.

dataframe_chunk['report_date'] = pd.to_datetime(dataframe_chunk['report_date'], format='%Y%m%d')

I need the parquet logical type to be DATE and physical type to be INT32 since this parquet will be loaded directly to bigquery and the REPORT_DATE column will go to a DATE type column in bigquery. See the bigquery documentation here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions

  • If I try to store STRING or DATETIME column in parquet and load it in bigquery, bigquery fails by saying that it expected the column with another type.

Any help would be much appreciated.

Behroz Sikander
  • 3,885
  • 3
  • 22
  • 36

1 Answers1

0

Answered in https://github.com/dask/fastparquet/issues/880#issuecomment-1697436417

The short story is: fastparquet cannot do this right now. I don't know how you can persuade bigquery to accept the current style of output, but I would think there must be a way, since the 64-bit format is valid parquet.

mdurant
  • 27,272
  • 5
  • 45
  • 74