TL;DR
I'd like to convert an int96 value such as ACIE4NxJAAAKhSUA into a readable timestamp format like 2020-03-02 14:34:22 or whatever that could be normally interpreted...I mostly use python so I'm looking to build a function that does this conversion. If there's another function that can do the reverse -- even better.
Background
I'm using parquet-tools to convert a raw parquet file (with snappy compression) to raw JSON via this commmand:
C:\Research> java -jar parquet-tools-1.8.2.jar cat --json original-file.snappy.parquet > parquet-output.json
Inside the JSON, I'm seeing these values as the timestamps:
{... "_id":"101836","timestamp":"ACIE4NxJAAAKhSUA"}
I've determined that the timestamp value of "ACIE4NxJAAAKhSUA" is really int96 (this is also confirmed with reading the schema of the parquet file too....
message spark_schema {
...(stuff)...
optional binary _id (UTF8);
optional int96 timestamp;
}
I think this is also known as Impala Timestamp as well (at least that's what I've gathered)
Further Issue Research
I've been searching everywhere for some function or info regarding how to "read" the int96 value (into python -- I'd like to keep it in that language since I'm most familiar with it) and output the timestamp -- I've found nothing.
Here's a very articles I've already looked into (that's related to this subject):
- ParquetWriter research in SO here
- Casting int96 via golan in SO here NOTE: this has a function that I could explore but I'm not sure how to dive in too deep
Regarding the depreciated int96 timestamp
Please don't ask me to stop using an old/depreciated timestamp format within a parquet file, I'm well aware of that with the research I've done so far. I'm a receiver of the file/data -- I can't change the format used on creation.
If there's another way to control the initial JSON output to deliver a "non int96" value -- I'd be interested in that as well.
Thanks so much for your help SO community!