2

I need to serialize a uint64 into a Avro field.

However in the docs I only see signed integers:

The set of primitive type names is:

null: no value
boolean: a binary value
int: 32-bit signed integer
long: 64-bit signed integer
float: single precision (32-bit) IEEE 754 floating-point number
double: double precision (64-bit) IEEE 754 floating-point number
bytes: sequence of 8-bit unsigned bytes
string: unicode character sequence

What is the "canonical" way to serialize a uint64 in Avro? As bytes?

{
  "name": "payload",
  "type": "record",
  "fields": [
    {
      "name": "my_uint64",
      "type": "bytes"
    }
  ]
}

Edit:

Or should the data be encoded as a long and then be casted on the consumer side?

{
  "name": "payload",
  "type": "record",
  "fields": [
    {
      "name": "my_uint64",
      "type": "long"
    }
  ]
}

My problem with both approaches is that the receiver will have to know that some bytes/longs are in reality unit64 - however where do I store this information so that the consumer can rely on the schema?

My tendency is toward using bytes with a magic byte in front that indicates a uint64 within.

Has anyone had similar issues and came to a conclusion?

code-gorilla
  • 2,231
  • 1
  • 6
  • 21

1 Answers1

1

Mailing list from Avro recommends to use fixed type (https://avro.apache.org/docs/1.10.2/spec.html#Fixed) to support unsigned integers.

{
  "name": "payload",
  "type": "record",
  "fields": [
    {
      "name": "my_uint64",
      "type": {
        "name": "myFixed",
        "type": "fixed",
        "size": 8
      }
    }
  ]
}

See https://www.mail-archive.com/user@avro.apache.org/msg01731.html.

Xartrick
  • 219
  • 2
  • 11
  • Thanks for providing the reference (mailing list). I agree that this is the way to do it. Because of the fixed size, there is no size overhead when serializing data (compared to just sending bytes that also serializes the number of bytes). The only downside I still see, is that there is no flag for signed numbers. So if e.g. both uint128 and int128 would be defined, the consumer would need to know what kind of fixed integer is transmitted, e.g. by using the field name, a complement within the bytes or some external information source. – code-gorilla Jan 26 '23 at 11:47