1

I'd like to write Avro records with Spark 2.2.0 where the schema has a namespace and some nested records inside.

{
    "type": "record",
    "name": "userInfo",
    "namespace": "my.example",
    "fields": [
        {
            "name": "username",
            "type": "string"
        },
        {
            "name": "address",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "address",
                    "fields": [
                        {
                            "name": "street",
                            "type": [
                                "null",
                                "string"
                            ],
                            "default": null
                        },
                        {
                            "name": "box",
                            "type": [
                                "null",
                                {
                                    "type": "record",
                                    "name": "box",
                                    "fields": [
                                        {
                                            "name": "id",
                                            "type": "string"
                                        }
                                    ]
                                }
                            ],
                            "default": null
                        }
                    ]
                }
            ],
            "default": null
        }
    ]
}

I need to write out records like:

{
    "username": "tom taylor",
    "address": {
        "my.example.address": {
            "street": {
                "string": "unknown"
            },
            "box": {
                "my.example.box": {
                    "id": "id1"
                }
            }
        }
    }
}

However when I read some Avro GenericRecords with spark-avro (4.0.0) and do some conversion (e.g: I'm adding a namespace) and would want to write out the output:

df.foreach {
    ...
    .write
    .option("recordName", "userInfo")
    .option("recordNamespace", "my.example")
    ...
}

then in the resulting GenericRecord the namespace of the nested records will contain the "full path" to that element from the parents. I.e instead of my.example.box I get my.example.address.box . When I try to read this record back with the schema of course there's a mismatch.

What is the right way to define the namespace for the writer?

Bruckwald
  • 797
  • 8
  • 23

0 Answers0