0

When reading a parquet file (using Scala) I read the timestamp field back as:

Int96Value{Binary{12 constant bytes, [0, 44, 84, 119, 54, 49, 0, 0, -62, -127, 37, 0]}}

How can I convert it to a date string?

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
user3821387
  • 95
  • 2
  • 7
  • Please be mindful with the tags you put on your question. There is no point to you attracting [tag:java] experts to a question about [tag:scala]. – Joe C Dec 09 '18 at 07:39
  • What is the library you are using to read the parquet file? – Mahmoud Hanafy Dec 09 '18 at 08:18
  • 1
    @JoeC I tagged it as Java as I am fine with a Java solution as well. Though your point is very valid and I should have clarified this in the question. – user3821387 Dec 09 '18 at 16:18
  • @MahmoudHanafy I am using parquet-avro version 1.9.0 to read the file. – user3821387 Dec 09 '18 at 19:12
  • Similar question for Go: https://stackoverflow.com/questions/53103762/cast-int96-timestamp-from-parquet-to-golang/53104516 – Zoltan Dec 14 '18 at 14:17

2 Answers2

3

I did some research for you. The Int96 format is quite specific a seems to be deprecated.

Here is a discussion about converting Int96 to Date.

Based on this, I created following piece of code:

  def main(args: Array[String]): Unit = {
    import java.util.Date
    import org.apache.parquet.example.data.simple.{Int96Value, NanoTime}
    import org.apache.parquet.io.api.Binary

    val int96Value = new Int96Value(Binary.fromConstantByteArray(Array(0, 44, 84, 119, 54, 49, 0, 0, -62, -127, 37, 0)))
    val nanoTime = NanoTime.fromInt96(int96Value)
    val nanosecondsSinceUnixEpoch = (nanoTime.getJulianDay - 2440588) * (86400 * 1000 * 1000 * 1000) + nanoTime.getTimeOfDayNanos
    val date = new Date(nanosecondsSinceUnixEpoch / (1000 * 1000))
    println(date)
  }

However, it prints Sun Sep 27 17:05:55 CEST 2093. I am not sure, if this is a date, that you expected.

Edit: using Instance as suggested:

val nanosInSecond = 1000 * 1000 * 1000;
val instant = Instant.ofEpochSecond(nanosecondsSinceUnixEpoch / nanosInSecond, nanosecondsSinceUnixEpoch % nanosInSecond)
println(instant) // prints 2093-09-27T15:05:55.933865216Z
ygor
  • 1,726
  • 1
  • 11
  • 23
  • 1
    You should prefer to create a modern [`Instant`](https://docs.oracle.com/javase/10/docs/api/java/time/Instant.html) over am outdated `Date`. The latter class has design problems and [the modern date and time API](https://docs.oracle.com/javase/tutorial/datetime/) is so much nicer to work with. And has nanosecond precision, BTW. – Ole V.V. Dec 10 '18 at 09:42
2

java.time supports Julian days.

Credits to ygor for doing the research and finding out how to interpret the 12 bytes of your array.

    byte[] int96Bytes = { 0, 44, 84, 119, 54, 49, 0, 0, -62, -127, 37, 0 };

    // Find Julian day
    int julianDay = 0;
    int index = int96Bytes.length;
    while (index > 8) {
        index--;
        julianDay <<= 8;
        julianDay += int96Bytes[index] & 0xFF;
    }

    // Find nanos since midday (since Julian days start at midday)
    long nanos = 0;
    // Continue from the index we got to
    while (index > 0) {
        index--;
        nanos <<= 8;
        nanos += int96Bytes[index] & 0xFF;
    }

    LocalDateTime timestamp = LocalDate.MIN
            .with(JulianFields.JULIAN_DAY, julianDay)
            .atTime(LocalTime.NOON)
            .plusNanos(nanos);
    System.out.println("Timestamp: " + timestamp);

This prints:

Timestamp: 2017-10-24T03:01:50

I’m not happy about converting your byte array to an int and a long by hand, but I don’t know Parquet will enough to use the conversions that are probably available there. Use them if you can.

It doesn’t matter which LocalDate we use as starting point since we are changing it to the right Julian day anyway, so I picked LocalDate.MIN just to pick one.

The way I read the documentation, Julian days are always in the local time zone, that is, no time zone is understood, and they always start at midday (not midnight).

Link: Documentation of JulianFields in java.time

Ole V.V.
  • 81,772
  • 15
  • 137
  • 161
  • 1
    This is what I ended up using. https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTimestampUtils.java#L45-L62 – user3821387 Dec 14 '18 at 22:54