Trying to iterate over pyspark data frame without using spark_df.collect()

Question

HI I am trying to iterate over pyspark data frame without using spark_df.collect() and I am trying foreach and map method is there any other way to iterate?

df.foreach(lambda x: print(x)) and

def func1(x):
    firstname=x.firstname
    lastName=x.lastName
    name=firstName+","+lastName
    gender=x.gender.lower()
    salary=x.salary*2
    return (name,gender,salary)```

```rdd2=df.rdd.map(lambda x: func1(x))```

is there any other way to iterate over data frame

see [`toLocalIterator()`](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toLocalIterator.html#pyspark-sql-dataframe-tolocaliterator) — samkart, Aug 22 '22 at 05:21
py4j.security.Py4JSecurityException: Method public static java.lang.Object[] org.apache.spark.api.python.PythonRDD.toLocalIteratorAndServe(org.apache.spark.rdd.RDD,boolean) is not whitelisted on class class org.apache.spark.api.python.PythonRDD got this error — Jeevan Kande, Aug 22 '22 at 08:20

score 2 · Answer 1 · answered Aug 22 '22 at 08:33

First of all, Spark is not made to do such kind of operations, like printing each records but more on distributed processing. Tune your process to work in a distributed fashion, like in terms of joins - that will unleash the power of spark.

If you want to process each record, UDFs (User defined functions) are a good way to do that. UDFs will be applied to each record once.

score 0 · Answer 2 · answered Aug 22 '22 at 05:07

0

we can use this method to iterate over rows

pandasDF = df.toPandas()
for index, row in pandasDF.iterrows():
    print(row['itm_mtl_no'], row['itm_src_sys_cd']) ```

answered Aug 22 '22 at 05:07

Jeevan Kande

33
4

2

this uses `toPandas()`, which is kind of same as `collect` though. – samkart Aug 22 '22 at 05:22

Trying to iterate over pyspark data frame without using spark_df.collect()

2 Answers2