1

I'm working on project where everyday I need to deal with tons of AVRO files. To extract the data from AVRO I use sparkSQL. To achieve this first I need to printSchema and then I need to select the fields to see the data. I want to automate this process. Given any input AVRO I want to write a script which will automatically generated SparkSQL query(considering the struct and arrays in avsc file). I'm okay to write a script in Java or Python.

-- Sample input AVRO

root
|-- identifier: struct (nullable = true)
|    |-- domain: string (nullable = true)
|    |-- id: string (nullable = true)
|    |-- version: long (nullable = true)
alternativeIdentifiers: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- identifier: struct (nullable = true)
|    |    |    |    |-- domain: string (nullable = true)
|    |    |    |    |-- id: string (nullable = true)

-- Output I'm expecting

SELECT identifier.domain, identifier.id, identifier.version
Teja
  • 31
  • 7
  • 1
    add one sample input & your expected output?? – Srinivas Jul 12 '20 at 16:30
  • simple `select * from table` doesn't work? – Alex Ott Jul 14 '20 at 17:06
  • It works but not as I wanted. Basically select * from displays the flatten fields in columns format and struct and arrays in array format like [col a, col b, col c]. My motive is to generate the automated query or fetch the field names with its field type and parent field – Teja Jul 14 '20 at 18:41

1 Answers1

0

You can use something like this to generate list of columns based on the schema:

  import org.apache.spark.sql.types.{StructField, StructType}
  def getStructFieldName(f: StructField, baseName: String = ""): Seq[String] = {
    val bname = if (baseName.isEmpty) "" else baseName + "."
    f.dataType match {
      case StructType(s) =>
        s.flatMap(x => getStructFieldName(x, bname + f.name))
      case _ => Seq(bname + f.name)
    }
  }

Then it could be used on the real dataframe, like this:

val data = spark.read.json("some_data.json")
val cols = data.schema.flatMap(x => getStructFieldName(x))

as result, we're getting the sequence of strings, that we can use either to do a select:

import org.apache.spark.sql.functions.col
data.select(cols.map(col): _*)

or we can generate a comma-separated list that we can use in the spark.sql:

spark.sql(s"select ${cols.mkString(", ")} from table")
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Hi Alex, thanks for the reply. I'm sorry its might be small questions. I'm slightly confused where do I need to pass on my avro schema file in your code? – Teja Jul 14 '20 at 18:44
  • @Teja the part that is data.schema.flatMap – Alex Ott Jul 14 '20 at 19:03