9

I have a tool that uses a org.apache.parquet.hadoop.ParquetWriter to convert CSV data files to parquet data files.

Currently, it only handles int32, double, and string

I need to support the parquet timestamp logical type (annotated as int96), and I am lost on how to do that because I can't find a precise specification online.

It appears this timestamp encoding (int96) is rare and not well supported. I've found very little specification details online. This github README states that:

Timestamps saved as an int96 are made up of the nanoseconds in the day (first 8 byte) and the Julian day (last 4 bytes).

Specifically:

  1. Which parquet Type do I use for the column in MessageType schema? I assume I should use the primitive type, PrimitiveTypeName.INT96, but I'm not sure if there may be a way to specify a logical type?
  2. How do I write the data? i.e. In what format do I write the timestamp to the group? For an INT96 timestamp, I assume I must write some binary type?

Here is a simplified version of my code that demonstrates what I am trying to do. Specifically, take a look at the "TODO" comments, these are the two points in the code that correlate to the questions above.

List<Type> fields = new ArrayList<>();
fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.INT32, "int32_col", null));
fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.DOUBLE, "double_col", null));
fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.STRING, "string_col", null));

// TODO: 
//   Specify the TIMESTAMP type. 
//   How? INT96 primitive type? Is there a logical timestamp type I can use w/ MessageType schema?
fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.INT96, "timestamp_col", null)); 

MessageType schema = new MessageType("input", fields);

// initialize writer
Configuration configuration = new Configuration();
configuration.setQuietMode(true);
GroupWriteSupport.setSchema(schema, configuration);
ParquetWriter<Group> writer = new ParquetWriter<Group>(
  new Path("output.parquet"),
  new GroupWriteSupport(),
  CompressionCodecName.SNAPPY,
  ParquetWriter.DEFAULT_BLOCK_SIZE,
  ParquetWriter.DEFAULT_PAGE_SIZE,
  1048576,
  true,
  false,
  ParquetProperties.WriterVersion.PARQUET_1_0,
  configuration
);

// write CSV data
CSVParser parser = CSVParser.parse(new File(csv), StandardCharsets.UTF_8, CSVFormat.TDF.withQuote(null));
ArrayList<String> columns = new ArrayList<>(schemaMap.keySet());
int colIndex;
int rowNum = 0;
for (CSVRecord csvRecord : parser) {
  rowNum ++;
  Group group = f.newGroup();
  colIndex = 0;
  for (String record : csvRecord) {
    if (record == null || record.isEmpty() || record.equals( "NULL")) {
      colIndex++;
      continue;
    }


    record = record.trim();
    String type = schemaMap.get(columns.get(colIndex)).get("type").toString();
    MessageTypeConverter.addTypeValueToGroup(type, record, group, colIndex++);

    switch (colIndex) {
      case 0: // int32
        group.add(colIndex, Integer.parseInt(record));
        break;
      case 1: // double
        group.add(colIndex, Double.parseDouble(record));
        break;
      case 2: // string
        group.add(colIndex, record);
        break;
      case 3:
        // TODO: convert CSV string value to TIMESTAMP type (how?)
        throw new NotImplementedException();
    }
  }
  writer.write(group);
}
writer.close();
James Wierzba
  • 16,176
  • 14
  • 79
  • 120
  • 1
    FYI, it looks like `INT96` support is deprecated in Parquet from what I read in [this issue ticket](https://issues.apache.org/jira/browse/PARQUET-323). – Basil Bourque Feb 12 '19 at 19:55
  • @BasilBourque Yeah, I saw that. Unfortunately the consumer of the parquet files is enforcing this 96 bit timestamp encoding, so I need to figure out how to write this type. – James Wierzba Feb 12 '19 at 20:00
  • 1
    I do not know anything about Parquet or Hadoop, so I cannot post an Answer. But some tips that might help: Java primitives are limited to 64-bits for numbers, so use `BigInteger` class to manage a 96-bit number. The `Instant` class, and other *java.time* classes have nanosecond resolution. But they work internally by tracking a pair of numbers: a number of whole seconds since epoch 1970-01-01T00:00:00Z plus a number of nanoseconds for the fractional second. So you will have to do a bit of math to feed your total elapsed nanos into a pair of numbers. See `Instant.ofEpochSecond` & `.plusNanos`. – Basil Bourque Feb 12 '19 at 20:00

3 Answers3

6
  1. INT96 timestamps use the INT96 physical type without any logical type, so don't annotate them with anything.
  2. If you are interested in the structure of an INT96 timestamp, take a look here. If you would like to see sample code that converts to and from this format, take a look at this file from Hive.
Zoltan
  • 2,928
  • 11
  • 25
  • This is great, I will test this code and see if it works – James Wierzba Feb 13 '19 at 16:05
  • I also found this: https://github.com/apache/spark/blob/d66a4e82eceb89a274edeb22c2fb4384bed5078b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L171-L178 -- scala code for encoding int96 timestamp – James Wierzba Feb 13 '19 at 16:07
  • No one should not be writing timestamps using INT96. There are documented and supported timestamp types. Use those instead. – blue Feb 13 '19 at 16:59
  • 1
    @blue The OP is aware of that. He wrote: "Unfortunately the consumer of the parquet files is enforcing this 96 bit timestamp encoding, so I need to figure out how to write this type." – Zoltan Feb 13 '19 at 17:20
  • I ran into another similar issue, but for writing `null` values. Any chance you could take a look at my new question? https://stackoverflow.com/questions/55247724/how-can-i-write-null-value-to-parquet-using-org-apache-parquet-hadoop-parquetwri – James Wierzba Mar 19 '19 at 19:07
  • I had to convert from a base64 encoded version of this INT96 timestamp to an ISO8601 formatted string. I ended up writing this with the help of this answer. Thanks @Zoltan - https://gist.github.com/marklap/133f1dbd51113de460475321b467aa70 – marklap Aug 09 '21 at 21:19
1

I figured it out, using this code from spark sql as a reference.

The INT96 binary encoding is split into 2 parts: First 8 bytes are nanoseconds since midnight Last 4 bytes is Julian day

String value = "2019-02-13 13:35:05";

final long NANOS_PER_HOUR = TimeUnit.HOURS.toNanos(1);
final long NANOS_PER_MINUTE = TimeUnit.MINUTES.toNanos(1);
final long NANOS_PER_SECOND = TimeUnit.SECONDS.toNanos(1);

// Parse date
SimpleDateFormat parser = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Calendar cal = Calendar.getInstance(TimeZone.getTimeZone("UTC"));
cal.setTime(parser.parse(value));

// Calculate Julian days and nanoseconds in the day
LocalDate dt = LocalDate.of(cal.get(Calendar.YEAR), cal.get(Calendar.MONTH)+1, cal.get(Calendar.DAY_OF_MONTH));
int julianDays = (int) JulianFields.JULIAN_DAY.getFrom(dt);
long nanos = (cal.get(Calendar.HOUR_OF_DAY) * NANOS_PER_HOUR)
        + (cal.get(Calendar.MINUTE) * NANOS_PER_MINUTE)
        + (cal.get(Calendar.SECOND) * NANOS_PER_SECOND);

// Write INT96 timestamp
byte[] timestampBuffer = new byte[12];
ByteBuffer buf = ByteBuffer.wrap(timestampBuffer);
buf.order(ByteOrder.LITTLE_ENDIAN).putLong(nanos).putInt(julianDays);

// This is the properly encoded INT96 timestamp
Binary tsValue = Binary.fromReusedByteArray(timestampBuffer);

James Wierzba
  • 16,176
  • 14
  • 79
  • 120
0

For those using AvroParquetWriter and want to write INT96 physical type you can use

final Configuration conf = new Configuration();
conf.setStrings(WRITE_FIXED_AS_INT96,  "field_name");

and pass this configuration when building the AvroParquetWriter. Your avro schema has to be type fixed for field_name and type similar to:

"type":[
        "null",
        {
               "type":"fixed",
               "name":"INT96",
               "doc":"INT96 represented as byte[12]",
               "size":12
        }
]

Full example:

final String avroSchemaString = "{\n" +
        "   \"type\":\"record\",\n" +
        "   \"name\":\"userInfo\",\n" +
        "   \"namespace\":\"my.example\",\n" +
        "   \"fields\":[\n" +
        "      {\n" +
        "         \"name\":\"date_of_birth\",\n" +
        "         \"type\":[\n" +
        "            \"null\",\n" +
        "            {\n" +
        "               \"type\":\"fixed\",\n" +
        "               \"name\":\"INT96\",\n" +
        "               \"doc\":\"INT96 represented as byte[12]\",\n" +
        "               \"size\":12\n" +
        "            }\n" +
        "         ]\n" +
        "      }\n" +
        "   ]\n" +
        "}";
System.out.println("AvroSchema: " + avroSchemaString);

final Schema avroSchema = new Schema.Parser().parse(avroSchemaString);
System.out.println("Parsed AvroSchema: " + avroSchema);

final Path outputPath = new Path("/tmp/temp.parquet");
final Configuration conf = new Configuration();
// Comment this line and it will write as FIXED_LEN_BYTE_ARRAY of size 12
conf.setStrings(WRITE_FIXED_AS_INT96,  "date_of_birth");


final ParquetWriter<GenericData.Record> parquetWriter = AvroParquetWriter.<GenericData
                .Record>builder(outputPath)
        .withSchema(avroSchema)
        .withConf(conf)
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
        .build();
final GenericData.Record record = new GenericData.Record(avroSchema);

// Convert LocalDate to NanoTime or LocalDateTime to NanoTime
final LocalDate dateToday = LocalDate.now();
final NanoTime nanoTime = new NanoTime((int)JulianFields.JULIAN_DAY.getFrom(dateToday), 0L);
byte[] timestampBuffer = nanoTime.toBinary().getBytes();

// Should be 12
System.out.println(timestampBuffer.length);

GenericData.Fixed fixed = new GenericData.Fixed(avroSchema.getFields().get(0).schema(), timestampBuffer);
record.put("date_of_birth", fixed);
parquetWriter.write(record);

// Close the writer to flush records
parquetWriter.close();

It only works for version 1.12.3 of parquet-avro. GAV for that:

<dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.12.3</version>
</dependency>
Shubham Dhingra
  • 186
  • 3
  • 13