parquet int96 timestamp conversion to datetime/date via python

Question

TL;DR

I'd like to convert an int96 value such as ACIE4NxJAAAKhSUA into a readable timestamp format like 2020-03-02 14:34:22 or whatever that could be normally interpreted...I mostly use python so I'm looking to build a function that does this conversion. If there's another function that can do the reverse -- even better.

Background

I'm using parquet-tools to convert a raw parquet file (with snappy compression) to raw JSON via this commmand:

C:\Research> java -jar parquet-tools-1.8.2.jar cat --json original-file.snappy.parquet > parquet-output.json

Inside the JSON, I'm seeing these values as the timestamps:

{... "_id":"101836","timestamp":"ACIE4NxJAAAKhSUA"}

I've determined that the timestamp value of "ACIE4NxJAAAKhSUA" is really int96 (this is also confirmed with reading the schema of the parquet file too....

message spark_schema {
 ...(stuff)...
  optional binary _id (UTF8);
  optional int96 timestamp;
}

I think this is also known as Impala Timestamp as well (at least that's what I've gathered)

Further Issue Research

I've been searching everywhere for some function or info regarding how to "read" the int96 value (into python -- I'd like to keep it in that language since I'm most familiar with it) and output the timestamp -- I've found nothing.

Here's a very articles I've already looked into (that's related to this subject):

ParquetWriter research in SO here
Casting int96 via golan in SO here NOTE: this has a function that I could explore but I'm not sure how to dive in too deep

Regarding the depreciated int96 timestamp

Please don't ask me to stop using an old/depreciated timestamp format within a parquet file, I'm well aware of that with the research I've done so far. I'm a receiver of the file/data -- I can't change the format used on creation.

If there's another way to control the initial JSON output to deliver a "non int96" value -- I'd be interested in that as well.

Thanks so much for your help SO community!

score 5 · Accepted Answer · answered Mar 16 '20 at 19:01

parquet-tools will not be able to change format type from INT96 to INT64. What you are observing in json output is a String representation of the timestamp stored in INT96 TimestampType. You will need spark to re-write this parquet with timestamp in INT64 TimestampType and then the json output will produce a timestamp (in the format you desire).

You will need to set a specific config in Spark -

spark-shell --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS

2020-03-16 11:37:50 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.20:4040
Spark context available as 'sc' (master = local[*], app id = local-1584383875924).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

val sourceDf = spark.read.parquet("original-file.snappy.parquet")
2020-03-16 11:38:31 WARN  Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
sourceDf: org.apache.spark.sql.DataFrame = [application: struct<name: string, upgrades: struct<value: double> ... 3 more fields>, timestamp: timestamp ... 16 more fields]

scala> sourceDf.repartition(1).write.parquet("Downloads/output")

Parquet-tools will show the correct TimestampType

parquet-tools schema Downloads/output/part-00000-edba239b-e696-4b4e-8fd3-c7cca9eea6bf-c000.snappy.parquet 

message spark_schema {
  ...
  optional binary _id (UTF8);
  optional int64 timestamp (TIMESTAMP_MICROS);
  ...
}

And the json dump gives -

parquet-tools cat --json Downloads/output/part-00000-edba239b-e696-4b4e-8fd3-c7cca9eea6bf-c000.snappy.parquet

{..."_id":"101836", "timestamp":1583973827000000}

The timestamp recorded is in nano seconds. Hope this helps!

thank you so much -- this is extremely helpful! This work exactly as you mentioned. I was curious if there's a function [within python/some parquet-python-lib] to interpret the String representation (what's seen in the original JSON) and know how to back that into INT64 or timestamp_micros epoch....Thank you again! — Doug E Fresh, Mar 16 '20 at 23:23
One small addition to nice solution. The configuration of parquet can also change within the spark shell, after its instantiation with: `spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")` — Kyr, Jul 15 '21 at 11:09

score 1 · Answer 2 · answered Nov 01 '20 at 12:20

Doug, this code from arrow/cpp/src/parquet/types.h shows how the Int96 timestamps are stored internally:

constexpr int64_t kJulianToUnixEpochDays = INT64_C(2440588);
constexpr int64_t kSecondsPerDay = INT64_C(60 * 60 * 24);
constexpr int64_t kMillisecondsPerDay = kSecondsPerDay * INT64_C(1000);
constexpr int64_t kMicrosecondsPerDay = kMillisecondsPerDay * INT64_C(1000);
constexpr int64_t kNanosecondsPerDay = kMicrosecondsPerDay * INT64_C(1000);

MANUALLY_ALIGNED_STRUCT(1) Int96 { uint32_t value[3]; };
STRUCT_END(Int96, 12);

static inline void Int96SetNanoSeconds(parquet::Int96& i96, int64_t nanoseconds) {
  std::memcpy(&i96.value, &nanoseconds, sizeof(nanoseconds));
}

static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
  // We do the computations in the unsigned domain to avoid unsigned behaviour
  // on overflow.
  uint64_t days_since_epoch =
      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
  uint64_t nanoseconds = 0;

  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
  return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
}

parquet int96 timestamp conversion to datetime/date via python

TL;DR

Background

Further Issue Research

Regarding the depreciated int96 timestamp

2 Answers2

Linked