I am creating an external table on top of orc files stored in hdfs. I have process_timestamp
column with a timestamp in the data files.
After creating a table, I performed MSCK REPAIR TABLE <TABLE-NAME>
. However, while querying the table timestamp column returns null values instead of the actual time stamps
from datetime import date
from pyspark.sql.functions import lit, to_date, from_unixtime, unix_timestamp
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("test_conn") \
.getOrCreate()
df = spark.createDataFrame([('Alice', 1)])
timestamp = today.strftime("%Y-%m-%d %H:%M:%S")
df = df.withColumn('process_timestamp', unix_timestamp(lit(timestamp), 'yyyy-MM-dd HH:mm:ss').cast('timestamp'))
process_timestamp
looks like this '2019-09-09 00:00:00'
in the orc file
The Schema also looks good too :
--process_timestamp: timestamp (nullable = true)
But when querying, null values are being returned process_timestamp - NULL
I tried setting up the serde property like below but no use.
ALTER TABLE <table_name> SET SERDEPROPERTIES ("timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss");
Please help me fix this issue. How can I return actual timestamp values from the external table when querying? Any help will be appreciated.