1

I have an avro schema file and I need to create a table in Databricks through pyspark. I don't need to load the data, just want to create the table. The easy way is to load the JSON string and take the "name" and "type" from fields array. Then generate the CREATE SQL query. I want to know if there is any programmatic way to do that with any API. Sample schema -

{
  "type" : "record",
  "name" : "kylosample",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "registration_dttm",
    "type" : "string",
    "doc" : "Type inferred from '2016-02-03T07:55:29Z'"
  }, {
    "name" : "id",
    "type" : "long",
    "doc" : "Type inferred from '1'"
  }, {
    "name" : "first_name",
    "type" : "string",
    "doc" : "Type inferred from 'Amanda'"
  }, {
    "name" : "last_name",
    "type" : "string",
    "doc" : "Type inferred from 'Jordan'"
  }, {
    "name" : "email",
    "type" : "string",
    "doc" : "Type inferred from 'ajordan0@com.com'"
  }, {
    "name" : "gender",
    "type" : "string",
    "doc" : "Type inferred from 'Female'"
  }, {
    "name" : "ip_address",
    "type" : "string",
    "doc" : "Type inferred from '1.197.201.2'"
  }, {
    "name" : "cc",
    "type" : [ "null", "long" ],
    "doc" : "Type inferred from '6759521864920116'",
    "default" : null
  }, {
    "name" : "country",
    "type" : "string",
    "doc" : "Type inferred from 'Indonesia'"
  }, {
    "name" : "birthdate",
    "type" : "string",
    "doc" : "Type inferred from '3/8/1971'"
  }, {
    "name" : "salary",
    "type" : [ "null", "double" ],
    "doc" : "Type inferred from '49756.53'",
    "default" : null
  }, {
    "name" : "title",
    "type" : "string",
    "doc" : "Type inferred from 'Internal Auditor'"
  }, {
    "name" : "comments",
    "type" : "string",
    "doc" : "Type inferred from '1E+02'"
  } ]
}
Anirban Nag 'tintinmj'
  • 5,572
  • 6
  • 39
  • 59

1 Answers1

0

This does not appear to be available via Python API yet ... This is how I have done it in the past by creating an external table via Spark SQL pointing to your exported .avsc since you only want to create a table and not load any data ... example:

spark.sql("""
create external table db.table_name
STORED AS AVRO
LOCATION 'PATH/WHERE/DATA/WILL/BE/STORED'
TBLPROPERTIES('avro.schema.url'='PATH/TO/SCHEMA.avsc')
""")

The native Scala API in Spark 2.4 looks to have .avsc reader available now ... since you are using Databricks you can change your kernel in the notebook like %scala or %python or %sql ... Scala example:

import org.apache.avro.Schema

val schema = new Schema.Parser().parse(new File("user.avsc"))

spark
  .read
  .format("avro")
  .option("avroSchema", schema.toString)
  .load("/tmp/episodes.avro")
  .show()

Reference Docs for Spark 2.4 Avro Integration =>

https://spark.apache.org/docs/latest/sql-data-sources-avro.html#configuration

https://databricks.com/blog/2018/11/30/apache-avro-as-a-built-in-data-source-in-apache-spark-2-4.html

thePurplePython
  • 2,621
  • 1
  • 13
  • 34