How to iterate through dataframe without converting to dataset in spark?

Question

I have a dataframe through which I want to iterate, but I dont want to convert dataframe to dataset. We have to convert spark scala code to pyspark and pyspark does not support dataset.

I have tried the following code with by converting to dataset

data in file:

abc,a
mno,b
pqr,a
xyz,b

val a = sc.textFile("<path>")

//creating dataframe with column AA,BB

val b = a.map(x => x.split(",")).map(x =>(x(0).toString,x(1).toString)).toDF("AA","BB") 

b.registerTempTable("test")

case class T(AA:String, BB: String)

//creating dataset from dataframe

val d = b.as[T].collect       

d.foreach{ x=>
    var m = spark.sql(s"select * from test where BB = '${x.BB}'")
    m.show()
}

Without converting to dataset it gives error i.e. with

val d = b.collect

d.foreach{ x=>
    var m = spark.sql(s"select * from test where BB = '${x.BB}'")
    m.show()
}

it gives error: error: value BB is not member of org.apache.spark.sql.ROW

Below is in Scala but instead try to use the filter argument. You will not need to change out of dataframe. The idea is that you will change DF1 into an Array and check if the array element is in the dataframe. To check element by element you will use a looping mechanism to check element by element. `val bArray = b.selectExpr("BB").rdd.map(x=>x.mkString).collect var iterator = 1 var m = b while(iterator <= bArray.length) { m = b.filter($"BB".isin(bArray(iterator - 1)) m.collect iterator = iterator + 1}` — afeldman, Mar 28 '19 at 20:24

Sarath Subramanian · Accepted Answer · 2019-04-01T16:05:14.160

You cannot loop dataframe as you have given in the above code. Use dataframe's rdd.collect to loop dataframe.

import spark.implicits._
val df = Seq(("abc","a"), ("mno","b"), ("pqr","a"),("xyz","b")).toDF("AA", "BB")
df.registerTempTable("test")
df.rdd.collect.foreach(x => {
     val BBvalue = x.mkString(",").split(",")(1)
     var m = spark.sql(s"select * from test where BB = '$BBvalue'")
     m.show()
})

Inside the loop I used mkString to convert an rdd row to string and then split the column values with comma and use the index of column for accessing the value. For example, in the above code I have used (1) which means, column BB column index is 2.

Please let me know if you have any questions.

How to iterate through dataframe without converting to dataset in spark?

1 Answers1