0

The following my dataframe schema

root
 |-- name: string (nullable = true)
 |-- addresses: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- street: string (nullable = true)
 |    |    |-- city: string (nullable = true)

I want to output name and city. The following is my spark streaming app which outputs name and addresses, but I want name and cities in the output. Appreciate your help. Thanks.

object PersonConsumer {
  import org.apache.spark.sql.{SQLContext, SparkSession}
  import com.example.protos.demo._

  def main(args : Array[String]) {

    val spark = SparkSession.builder.
      master("local")
      .appName("spark session example")
      .getOrCreate()

    import spark.implicits._

    val ds1 = spark.readStream.format("kafka").
      option("kafka.bootstrap.servers","localhost:9092").
      option("subscribe","person").load()

    val ds2 = ds1.map(row=> row.getAs[Array[Byte]]("value")).map(Person.parseFrom(_)).select($"name", $"addresses")

    ds2.printSchema()

    val query = ds2.writeStream
      .outputMode("append")
      .format("console")
      .start()

    query.awaitTermination()
  }
}
shylas
  • 99
  • 4
  • 13

2 Answers2

0

You can simple get the dataframe of name and city and then you can use it, for getting dataframe of name and city you can select both as follows

ds1.select("name","addresses.element.city")
Sandeep Purohit
  • 3,652
  • 18
  • 22
0

Thanks Sandeep. select("name","addresses.element.city") gives me error because adresses is a Seq[Address] and I want all the cities in the output.

Finally I wrote the following function to get all the cities..

    def getCities(addresses: Seq[Address]) : String = {
      var cities:String = ""
      if (addresses.size > 0) {
        cities = (for(a <- addresses) yield a.city.getOrElse("")).mkString(",")
//        cities = addresses.foldLeft("")((str,addr) => str  + addr.city.getOrElse(""))
      }
      cities
    }
shylas
  • 99
  • 4
  • 13