0

I am creating an external table on top of orc files stored in hdfs. I have process_timestamp column with a timestamp in the data files.

After creating a table, I performed MSCK REPAIR TABLE <TABLE-NAME>. However, while querying the table timestamp column returns null values instead of the actual time stamps

from datetime import date
from pyspark.sql.functions import lit, to_date, from_unixtime, unix_timestamp
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .appName("test_conn") \
        .getOrCreate()

df = spark.createDataFrame([('Alice', 1)])

timestamp = today.strftime("%Y-%m-%d %H:%M:%S")

df = df.withColumn('process_timestamp', unix_timestamp(lit(timestamp), 'yyyy-MM-dd HH:mm:ss').cast('timestamp'))

process_timestamp looks like this '2019-09-09 00:00:00' in the orc file

The Schema also looks good too :

--process_timestamp: timestamp (nullable = true)

But when querying, null values are being returned process_timestamp - NULL

I tried setting up the serde property like below but no use.

ALTER TABLE <table_name> SET SERDEPROPERTIES ("timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss"); 

Please help me fix this issue. How can I return actual timestamp values from the external table when querying? Any help will be appreciated.

SMaZ
  • 2,515
  • 1
  • 12
  • 26
  • You can check ORC files with ORC tool. https://orc.apache.org/docs/java-tools.html It will help you identify if the problem is in the data or in metadata. You can also add the field as string and see what comes back. Timestamp type is a bit more sensitive. – Istvan Sep 10 '19 at 21:24

1 Answers1

0

I fixed this issue by saving files directly in hdfs location as ORC and creating external table on top of the data. The issue is at saving the files to hdfs location.

df.coalesce(1).write.format('orc').mode('append').partitionBy('process_date').save(path)

In the above statement I have added partitionBy() and did an MSCK on the external table and it worked like a charm.