Semantics is determined based on the metadata. We'll need some imports:
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
example data:
val path = "/tmp/ts"
Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts")
.withColumn("ts", $"ts".cast("timestamp"))
.write.mode("overwrite").parquet(path)
and Hadoop configuration:
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)
Now we can access Spark metadata:
ParquetFileReader
.readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
.get(0)
.getParquetMetadata
.getFileMetaData
.getKeyValueMetaData
.get("org.apache.spark.sql.parquet.row.metadata")
and the result is:
String = {"type":"struct","fields: [
{"name":"id","type":"integer","nullable":false,"metadata":{}},
{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
Equivalent information can be stored in the Metastore as well.
According to the official documentation this is used to achieve compatibility with Hive and Impala:
Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
and can be controlled using spark.sql.parquet.int96AsTimestamp
property.