0

HI I am trying to iterate over pyspark data frame without using spark_df.collect() and I am trying foreach and map method is there any other way to iterate?

df.foreach(lambda x: print(x)) and

def func1(x):
    firstname=x.firstname
    lastName=x.lastName
    name=firstName+","+lastName
    gender=x.gender.lower()
    salary=x.salary*2
    return (name,gender,salary)```

```rdd2=df.rdd.map(lambda x: func1(x))```

is there any other way to iterate over data frame
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • 1
    see [`toLocalIterator()`](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toLocalIterator.html#pyspark-sql-dataframe-tolocaliterator) – samkart Aug 22 '22 at 05:21
  • py4j.security.Py4JSecurityException: Method public static java.lang.Object[] org.apache.spark.api.python.PythonRDD.toLocalIteratorAndServe(org.apache.spark.rdd.RDD,boolean) is not whitelisted on class class org.apache.spark.api.python.PythonRDD got this error – Jeevan Kande Aug 22 '22 at 08:20

2 Answers2

2

First of all, Spark is not made to do such kind of operations, like printing each records but more on distributed processing. Tune your process to work in a distributed fashion, like in terms of joins - that will unleash the power of spark.

If you want to process each record, UDFs (User defined functions) are a good way to do that. UDFs will be applied to each record once.

s510
  • 2,271
  • 11
  • 18
0

we can use this method to iterate over rows

pandasDF = df.toPandas()
for index, row in pandasDF.iterrows():
    print(row['itm_mtl_no'], row['itm_src_sys_cd']) ```