0

I have a Kafka cluster running and I want to store L2-orderbook snapshots into a topic that have a dictionary of {key:value} pairs where the keys are of type float as the following example:

{
    'exchange': 'ex1',
    'symbol': 'sym1',
    'book': {
        'bid': {
            100.0: 20.0,
            101.0: 21.3,
            102.0: 34.6,
            ...,
        },
        'ask': {
            100.0: 20.0,
            101.0: 21.3,
            102.0: 34.6,
            ...,
        }
    },
    'timestamp': 1642524222.1160505
}

My schema proposal below is not working and I'm pretty sure it is because the keys in the 'bid' and 'ask' dictionaries are not of type string.

{
    "namespace": "confluent.io.examples.serialization.avro",
    "name": "L2_Book",
    "type": "record",
    "fields": [
        {"name": "exchange", "type": "string"},
        {"name": "symbol", "type": "string"},
        {"name": "book", "type": "record", "fields": {
            "name": "bid", "type": "record", "fields": {
                {"name": "price", "type": "float"},
                {"name": "volume", "type": "float"}
            },
            "name": "ask", "type": "record", "fields": {
                {"name": "price", "type": "float"},
                {"name": "volume", "type": "float"}
            }
        },
        {"name": "timestamp", "type": "float"}
    ]
}

KafkaError{code=_VALUE_SERIALIZATION,val=-161,str="no value and no default for bids"}

What would be a proper avro-schema here?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
CarloP
  • 99
  • 1
  • 12

2 Answers2

1

First, you have a typo. fields needs to be an array in the schema definition.

However, your bid (and ask) objects are not records. They are a map<float, float>. In other words, it does not have literal price and volume keys.

Avro has Map types, but the keys are "assumed to be strings".

You are welcome to try

{"name": "bid", "type": "map", "values": "float"}

Otherwise, you need to reformat your data payloads, for example as a list of objects

'bid': [
     {'price': 100.0, 'volume': 20.0},
     ...,
],

Along with

{"name": "bid", "type": "array", "items": {
  "type": "record",
  "name": "BidItem",
  "fields": [
    {"name": "price", "type": "float"},
    {"name": "volume", "type": "float"}
  ]
}}
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • I have now tested both proposals (using type "map" and using type "record" but with reformatted payload). In both cases I get the following message: `fastavro._schema_common.UnknownType: confluent.io.examples.serialization.avro.record` – CarloP Jan 19 '22 at 08:38
  • I don't see how you'd get that exception with a map. I'm not familiar with the Python library, but that exception makes it sound like the `name` (rather than the `type`) of one of your records is literally `record` and it's not using a different `namespace`. Did you also change the fields on the book record to be an array? – OneCricketeer Jan 19 '22 at 13:49
  • This is the schema that I tried: ` { "namespace": "confluent.io.examples.serialization.avro", "name": "L2_Book", "type": "record", "fields": [ {"name": "exchange", "type": "string"}, {"name": "symbol", "type": "string"}, {"name": "book", "type": "record", "fields": [ {"name": "bid", "type": "map", "values": "float"}, {"name": "ask", "type": "map", "values": "float"} ]}, {"name": "timestamp", "type": "float"}, {"name": "receipt_timestamp", "type": "float"} ] } ` – CarloP Jan 19 '22 at 14:45
  • Btw I'm using the `AvroSerializer` from `confluent_kafka.schema_registry.avro` – CarloP Jan 19 '22 at 14:56
  • Sure. Like I said, I'm not familiar with the fastavro library. You can install that module directly in Python and try using its parse methods to figure out the error since it doesn't seem to be related to Kafka https://fastavro.readthedocs.io/en/latest/schema.html#fastavro._schema_py.parse_schema You can report issues for it here https://github.com/fastavro/fastavro/issues – OneCricketeer Jan 19 '22 at 15:03
  • Thanks very much OneCricketeer, glad to have your support on this issue! – CarloP Jan 19 '22 at 15:18
  • I think you can bypass the Kafka serializer, too. Pre-serialize the data following the examples down in this section https://github.com/confluentinc/confluent-kafka-python#usage – OneCricketeer Jan 19 '22 at 15:38
  • I guess I'm making progress. At least I could get rid of the serialization error by considering a proper implementation of the complex types. Unfortunately, now I'm getting the following error: `KafkaError{code=_VALUE_SERIALIZATION,val=-161,str="must be string on field bid on field book"}` – CarloP Jan 19 '22 at 20:01
  • This is my current schema for the book: `{"name": "book", "type": { "name": "bids_asks", "type": "record", "fields": [ {"name": "bid", "type": { "type": "map", "values": "float" } }, {"name": "ask", "type": { "type": "map", "values": "float" } } ]} }` – CarloP Jan 19 '22 at 20:02
  • 1
    "must be string" sounds like an error on your map keys not being strings. – OneCricketeer Jan 19 '22 at 20:09
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241227/discussion-between-carlop-and-onecricketeer). – CarloP Jan 19 '22 at 20:15
0

I have finally figured out 2 working resolutions. In both cases I need to convert the original data.

The main lessons for me have been:

  1. avro maps need keys of type string
  2. avro complex types (e.g. maps and records) need to be defined properly:
{"name": "bid", "type"
      {"type": "array", "items": {
          ...

Special thanks to OneCricketeer for pointing me into the right direction! :-)

1) bids and asks as a map with the key being of type string

data example

{
    'exchange': 'ex1',
    'symbol': 'sym1',
    'book': {
        'bid': {
            "100.0": 20.0,
            "101.0": 21.3,
            "102.0": 34.6,
            ...,
        },
        'ask': {
            "100.0": 20.0,
            "101.0": 21.3,
            "102.0": 34.6,
            ...,
        }
    },
    'timestamp': 1642524222.1160505
}

schema

{
    "namespace": "confluent.io.examples.serialization.avro",
    "name": "L2_Book",
    "type": "record",
    "fields": [
        {"name": "exchange", "type": "string"},
        {"name": "symbol", "type": "string"},
        {"name": "book", "type": {
            "name": "book",
            "type": "record",
            "fields": [
                {"name": "bid", "type": {
                    "type": "map", "values": "float"
                    }
                }, 
                {"name": "ask", "type": {
                    "type": "map", "values": "float"
                    }
                }
            ]}
        },
        {"name": "timestamp", "type": "float"}
    ]
}

2) bids and asks as an array of records

data example

{
    'exchange': 'ex1',
    'symbol': 'sym1',
    'book': {
        'bid': [
            {"price": 100.0, "volume": 20.0,}
            {"price": 101.0, "volume": 21.3,}
            {"price": 102.0, "volume": 34.6,}
            ...,
        ],
        'ask': [
            {"price": 100.0, "volume": 20.0,}
            {"price": 101.0, "volume": 21.3,}
            {"price": 102.0, "volume": 34.6,}
            ...,
        ]
    },
    'timestamp': 1642524222.1160505
}

schema

{
    "namespace": "confluent.io.examples.serialization.avro",
    "name": "L2_Book",
    "type": "record",
    "fields": [
        {"name": "exchange", "type": "string"},
        {"name": "symbol", "type": "string"},
        {"name": "book", "type": {
            "name": "book",
            "type": "record", 
            "fields": [
                {"name": "bid", "type": {
                    "type": "array", "items": {
                        "name": "bid",
                        "type": "record",
                        "fields": [
                            {"name": "price", "type": "float"},
                            {"name": "volume", "type": "float"}
                        ]
                    }
                }},
                {"name": "ask", "type": {
                    "type": "array", "items": {
                        "name": "ask",
                        "type": "record",
                        "fields": [
                            {"name": "price", "type": "float"},
                            {"name": "volume", "type": "float"}
                        ]
                    }
                }}
            ]}},
        {"name": "timestamp", "type": "float"}
    ]
}
CarloP
  • 99
  • 1
  • 12