We are trying to create avro record with confluent schema registry. The same record we want to publish to kafka cluster.
To attach schema id to each records (magic bytes) we need to use--
to_avro(Column data, Column subject, String schemaRegistryAddress)
To automate this we need to build project in pipeline & configure databricks jobs to use that jar.
Now the problem we are facing in notebooks we are able to find a methods with 3 parameters to it.
But the same library when we are using in our build downloaded from https://mvnrepository.com/artifact/org.apache.spark/spark-avro_2.12/3.1.2 its only having 2 overloaded methods of to_avro
Is databricks having some other maven repository for its shaded jars?
NOTEBOOK output
import org.apache.spark.sql.avro.functions
println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/databricks/jars/----workspace_spark_3_1--vendor--avro--avro_2.12_deploy_shaded.jar
functions
.getClass()
.getMethods()
.filter(p=>p.getName.equals("to_avro"))
.foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))
LOCAL output
import org.apache.spark.sql.avro.functions
println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/<home-dir-path>/.gradle/caches/modules-2/files-2.1/org.apache.spark/spark-avro_2.12/3.1.2/1160ae134351328a0ed6a062183faf9a0d5b46ea/spark-avro_2.12-3.1.2.jar
functions
.getClass()
.getMethods()
.filter(p=>p.getName.equals("to_avro"))
.foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))
Versions
Databricks => 9.1 LTS
Apache Spark => 3.1.2
Scala => 2.12
Update from databricks support
Unfortunately we do not have a sharable jar supporting the functionalities in the DBR. There was a feature request to include this in DBConnect; however it was not implemented as we did not have enough upvote to implement the feature.
Since your use case is to automate creation of the Jar file and then submit this as Job in Databricks, we should be able to create a jar stub (dbr-avro-dummy.jar
) with a dummy implementation of the to_avro()
function with three parameters and use this jar as a dependency to fool the compiler of your actual Jar (for the Job).
This will avoid the compilation error while building the Jar and at run time, since its run on the Databricks environment, it will pick the actual avro Jar from the DBR
You may build the dummy Jar Stub using the package code below: (You will use the maven/sbt spark/scala dependency for the Column function)
package org.apache.spark.sql
import java.net.URL
package object avro {
def from_avro(data: Column, key: String, schemaRegistryURL: URL): Column = {
new Column("dummy")
}
}