Spark's int96 time type

Question

When you create a timestamp column in spark, and save to parquet, you get a 12 byte integer column type (int96); I gather the data is split into 6-bytes for Julian day and 6 bytes for nanoseconds within the day.

This does not conform to any parquet logical type. The schema in the parquet file does not, then, give an indication of the column being anything but an integer.

My question is, how does Spark know to load such a column as a timestamp as opposed to a big integer?

Actually it's 8 + 4 bytes, not 6 + 6. There is a pull request to document this type, see https://github.com/apache/parquet-format/pull/49. — Zoltan, Mar 06 '17 at 17:32

zero323 · Accepted Answer · 2017-03-24T01:57:06.230

Semantics is determined based on the metadata. We'll need some imports:

import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration

example data:

val path = "/tmp/ts"

Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))
  .write.mode("overwrite").parquet(path)

and Hadoop configuration:

val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)

Now we can access Spark metadata:

ParquetFileReader
  .readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
  .get(0)
  .getParquetMetadata
  .getFileMetaData
  .getKeyValueMetaData
  .get("org.apache.spark.sql.parquet.row.metadata")

and the result is:

String = {"type":"struct","fields: [
  {"name":"id","type":"integer","nullable":false,"metadata":{}},
  {"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}

Equivalent information can be stored in the Metastore as well.

According to the official documentation this is used to achieve compatibility with Hive and Impala:

Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

and can be controlled using spark.sql.parquet.int96AsTimestamp property.

How can I write a INT96 timestamp type? (using `org.apache.parquet.hadoop.ParquetWriter`) -- I posted a SO question for this, but am lost and hope you can assist. https://stackoverflow.com/questions/54657496/how-to-write-timestamp-logical-type-int96-to-parquet-using-parquetwriter — James Wierzba, Feb 12 '19 at 19:40
@JamesWierzba TBH I've never gave the exact procedure much thought. Though if you follow the [relevant code from the Spark's Parquet writer](https://github.com/apache/spark/blob/d66a4e82eceb89a274edeb22c2fb4384bed5078b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L171-L178) you should get all the required details. — zero323, Feb 12 '19 at 22:46

Spark's int96 time type

1 Answers1

Linked