I'd like to write Avro records with Spark 2.2.0 where the schema has a namespace and some nested records inside.
{
"type": "record",
"name": "userInfo",
"namespace": "my.example",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "address",
"type": [
"null",
{
"type": "record",
"name": "address",
"fields": [
{
"name": "street",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "box",
"type": [
"null",
{
"type": "record",
"name": "box",
"fields": [
{
"name": "id",
"type": "string"
}
]
}
],
"default": null
}
]
}
],
"default": null
}
]
}
I need to write out records like:
{
"username": "tom taylor",
"address": {
"my.example.address": {
"street": {
"string": "unknown"
},
"box": {
"my.example.box": {
"id": "id1"
}
}
}
}
}
However when I read some Avro GenericRecords with spark-avro (4.0.0) and do some conversion (e.g: I'm adding a namespace) and would want to write out the output:
df.foreach {
...
.write
.option("recordName", "userInfo")
.option("recordNamespace", "my.example")
...
}
then in the resulting GenericRecord the namespace of the nested records will contain the "full path" to that element from the parents. I.e instead of my.example.box I get my.example.address.box . When I try to read this record back with the schema of course there's a mismatch.
What is the right way to define the namespace for the writer?