Can we automate spark SQL query generation from AVRO schema?

Question

I'm working on project where everyday I need to deal with tons of AVRO files. To extract the data from AVRO I use sparkSQL. To achieve this first I need to printSchema and then I need to select the fields to see the data. I want to automate this process. Given any input AVRO I want to write a script which will automatically generated SparkSQL query(considering the struct and arrays in avsc file). I'm okay to write a script in Java or Python.

-- Sample input AVRO

root
|-- identifier: struct (nullable = true)
|    |-- domain: string (nullable = true)
|    |-- id: string (nullable = true)
|    |-- version: long (nullable = true)
alternativeIdentifiers: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- identifier: struct (nullable = true)
|    |    |    |    |-- domain: string (nullable = true)
|    |    |    |    |-- id: string (nullable = true)

-- Output I'm expecting

SELECT identifier.domain, identifier.id, identifier.version

It works but not as I wanted. Basically select * from displays the flatten fields in columns format and struct and arrays in array format like [col a, col b, col c]. My motive is to generate the automated query or fetch the field names with its field type and parent field — Teja, Jul 14 '20 at 18:41

score 0 · Answer 1 · answered Jul 14 '20 at 17:38

You can use something like this to generate list of columns based on the schema:

  import org.apache.spark.sql.types.{StructField, StructType}
  def getStructFieldName(f: StructField, baseName: String = ""): Seq[String] = {
    val bname = if (baseName.isEmpty) "" else baseName + "."
    f.dataType match {
      case StructType(s) =>
        s.flatMap(x => getStructFieldName(x, bname + f.name))
      case _ => Seq(bname + f.name)
    }
  }

Then it could be used on the real dataframe, like this:

val data = spark.read.json("some_data.json")
val cols = data.schema.flatMap(x => getStructFieldName(x))

as result, we're getting the sequence of strings, that we can use either to do a select:

import org.apache.spark.sql.functions.col
data.select(cols.map(col): _*)

or we can generate a comma-separated list that we can use in the spark.sql:

spark.sql(s"select ${cols.mkString(", ")} from table")

Hi Alex, thanks for the reply. I'm sorry its might be small questions. I'm slightly confused where do I need to pass on my avro schema file in your code? — Teja, Jul 14 '20 at 18:44

Can we automate spark SQL query generation from AVRO schema?

1 Answers1