3

We are trying to create avro record with confluent schema registry. The same record we want to publish to kafka cluster.

To attach schema id to each records (magic bytes) we need to use--
to_avro(Column data, Column subject, String schemaRegistryAddress)

To automate this we need to build project in pipeline & configure databricks jobs to use that jar.

Now the problem we are facing in notebooks we are able to find a methods with 3 parameters to it.
But the same library when we are using in our build downloaded from https://mvnrepository.com/artifact/org.apache.spark/spark-avro_2.12/3.1.2 its only having 2 overloaded methods of to_avro

Is databricks having some other maven repository for its shaded jars?

NOTEBOOK output

import org.apache.spark.sql.avro.functions

println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/databricks/jars/----workspace_spark_3_1--vendor--avro--avro_2.12_deploy_shaded.jar

functions
  .getClass()
  .getMethods()
  .filter(p=>p.getName.equals("to_avro"))
  .foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))

LOCAL output

import org.apache.spark.sql.avro.functions

println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/<home-dir-path>/.gradle/caches/modules-2/files-2.1/org.apache.spark/spark-avro_2.12/3.1.2/1160ae134351328a0ed6a062183faf9a0d5b46ea/spark-avro_2.12-3.1.2.jar

functions
  .getClass()
  .getMethods()
  .filter(p=>p.getName.equals("to_avro"))
  .foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))

Versions

Databricks => 9.1 LTS
Apache Spark => 3.1.2
Scala => 2.12


Update from databricks support

Unfortunately we do not have a sharable jar supporting the functionalities in the DBR. There was a feature request to include this in DBConnect; however it was not implemented as we did not have enough upvote to implement the feature.

Since your use case is to automate creation of the Jar file and then submit this as Job in Databricks, we should be able to create a jar stub (dbr-avro-dummy.jar) with a dummy implementation of the to_avro() function with three parameters and use this jar as a dependency to fool the compiler of your actual Jar (for the Job).

This will avoid the compilation error while building the Jar and at run time, since its run on the Databricks environment, it will pick the actual avro Jar from the DBR

You may build the dummy Jar Stub using the package code below: (You will use the maven/sbt spark/scala dependency for the Column function)

package org.apache.spark.sql
import java.net.URL

package object avro {
  def from_avro(data: Column, key: String, schemaRegistryURL: URL): Column = {
    new Column("dummy")
  }
}
Snigdhajyoti
  • 1,327
  • 10
  • 26

1 Answers1

2

No, these jars aren't published to any public repository. You may check if the databricks-connect provides these jars (you can get their location with databricks-connect get-jar-dir), but I really doubt in that.

Another approach is to mock it, for example, create a small library that will declare a function with specific signature, and use it for compilation only, don't include into the resulting jar.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks for the answer. If I can’t compile the program, how Im going to create a jar. Whats the industry standards on this? My usecase, I have validated my code changes with notebooks, now I want to create a jar & create a databricks job. – Snigdhajyoti Feb 10 '22 at 20:04
  • 1
    Theoretically we can mock that just for compilation (not for testing). – Alex Ott Feb 11 '22 at 08:42
  • Even with `databricks-connect get-jar-dir` it didn't work, it didn't come with `spark-avro` jar. But mocking it and excluding it from jar helped. – Snigdhajyoti Feb 14 '22 at 14:32
  • If you could write mocking as the solution I can accept the answer – Snigdhajyoti Feb 14 '22 at 14:33
  • Can you elaborate on mock example? I'm not really able to import package object to a needed scala file – Andrii Black Mar 06 '22 at 06:50
  • @AndriiBlack its not actually mock rather you create the class/object with same package and name with methods. write the method which is available on databricks. Then use that on your code. Compile will not throw any issues because you have that method on local. You exclude that class while you build your jar, and in runtime it will be find it because databricks has that class/object – Snigdhajyoti May 24 '22 at 14:41