I am creating parquet files in AWS Lambdas using fastparquet (smaller library than pyarrow. Easy to work with in lambdas). My parquet has int32, string and timestamp columns. I am getting strange error. Date and integer field is making me mad. Text is working fine.
I tried pyarrow engine where timestamp is not working when I read parquet using hive. You can see error number 1 in circle in the right hand side
I tried fastparquet engine to create parquet with times='int96' which I found here. With times parameter the timestamp worked but int type field is not working. You can see the error#2 in the right hand side. I double checked, tripple checked that all i_quality is present in the dataframe. I extracted the dataframe using parq tool and varified column name they are matching with hive schema
When I use fastparquet engine WITHOUT times='int96' then nither date_time nor integer field works. Errors are circled in the right hand side. Getting Error#2 when I query without date_time column in select statement and getting error#3 when I query with date_time column but without int column in the select statement
I am clueless. Any help is appreciated.
My hive table schema is given below
CREATE EXTERNAL TABLE db_name.table_name
(.
tag
string,
date_time
timestamp,
val
string,
quality
int,
i_quality
int,
i_source
string,
i_time
timestamp)
PARTITIONED BY (
cust
string,
ast
string,
date_stamp
string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://MY_S3_BUCKET/'
TBLPROPERTIES (.
'bucketing_version'='2',
'transient_lastDdlTime'='1591269825')
Edit 1:
I tried simple code
data = {
"ast": ["John Doe", "Jane Doe"],
"quality": [30, 25],
"tag": ["New York", "Los Angeles"],
"i_time": ["2023-06-30 00:00:34", "2023-06-30 00:00:34"]
}
df = pd.DataFrame(data)
df.to_parquet('/tmp/upload_me.parquet', engine='fastparquet', compression='snappy', times='int96', has_nulls=True)
The integer in the parquet file created by fastparuet engine is also not loaded by the above hive table. Very strange.
Edit 2 Our usecase was same as this post. Table structure was also same. Table has mix of string, int and timestamp which was the case for us. unfortunately the fastparquet dolution mentioned there didn't work for us. If it would have worked then our lambda size would have been way lesser than current size because the pyarrow size of very large.
After a day of research I found that fastparquet generated int column was not getting recognized by hive which was strange. However, pyarrow generated int and timestamp was recognized by hive.
THIS WORKED: Recognized both timestamp and int32 columns
df.to_parquet(<FILE_NAME>, engine='pyarrow', compression='snappy', use_deprecated_int96_timestamps=True)
BUT THIS DIDN'T WORKED: Recognized the timestamp but not int32 columns
df.to_parquet(<FILE_NAME>, engine='fastparquet', compression='snappy', times='int96')