Reversing Group By in PySpark

Question

I am not sure about the correctness of the question itself. The solutions I've found for SQL do not work at Hive SQL or recursion is prohibited. Thus, I'd like to solve the problem in Pyspark and need a solution or at least ideas, how to tackle the problem.

I have an original table which looks like this:

+--------+----------+
|customer|nr_tickets|
+--------+----------+
|       A|         3|
|       B|         1|
|       C|         2|
+--------+----------+

This is how I want the table:

+--------+
|customer|
+--------+
|       A|
|       A|
|       A|
|       B|
|       C|
|       C|
+--------+

Do you have any suggestions?

Thank you very much in advance!

score 1 · Answer 1 · answered Sep 22 '20 at 03:50

1

For Spark2.4+, use array_repeat with explode.

from pyspark.sql import functions as F

df.selectExpr("""explode(array_repeat(customer,cast(nr_tickets as int))) as customer""").show()

#+--------+
#|customer|
#+--------+
#|       A|
#|       A|
#|       A|
#|       B|
#|       C|
#|       C|
#+--------+

answered Sep 22 '20 at 03:50

murtihash

8,030
1
14
26

Hi, thanks for your solution. I've tried it out, however it says that "array_repeat" is not known: Undefined function: 'array_repeat'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.Unfortunately checking my Sparkversion doesn't work either. but it should be above 2. – xmatadorx Sep 22 '20 at 07:12
ur spark version is less than 2.4. this approach will not work for u – murtihash Sep 22 '20 at 12:56
Stupid question, but how do I find out my spark version? 'sc.version' didn't work – xmatadorx Sep 23 '20 at 12:19

A.B · Answer 2 · 2020-09-21T18:16:57.093

0

You can make a new dataframe by iterating over rows(groups).

1st make list of Rows havingcustomer (Row(customer=a["customer"])) repeated nr_tickets times for that customer using range(int(a["nr_tickets"]))

df_list + [Row(customer=a["customer"]) for T in range(int(a["nr_tickets"]))]

you can store and append these in a list and later make a dataframe with it.

 df= spark.createDataFrame(df_list)

Overall,

from pyspark.sql import Row

df_list = []
for a in df.select(["customer","nr_tickets"]).collect():
  df_list = df_list + [Row(customer=a["customer"]) for T in range(int(a["nr_tickets"]))]
df= spark.createDataFrame(df_list)
df.show()

you can also do it with list comprehension as

from pyspark.sql import Row
from functools import reduce #python 3

df_list = [
[Row(customer=a["customer"])]*int(a["nr_tickets"]) 
for a in df.select(["customer","nr_tickets"]).collect() 
 ]

df= spark.createDataFrame(reduce(lambda x,y: x+y,df_list))
df.show()

Produces

+--------+
|customer|
+--------+
|       A|
|       A|
|       A|
|       B|
|       C|
|       C|
+--------+

edited Sep 21 '20 at 18:16

answered Sep 21 '20 at 17:47

A.B

20,110
3
37
71

Hi, thanks for your help. The list generated by the "Overall" version of your code is indeed correct. But the conversion to a df somehow doesn't work for me and throws this error when I df.show(): – xmatadorx Sep 22 '20 at 08:25
Py4JJavaError: An error occurred while calling o153.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 12, sdeb-hdpdn-q3014a.sys.schwarz, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/disk10/hadoop/yarn/local/usercache/bnem2103/appcache/application_1598371445148_44504/container_e247_1598371445148_44504_01_000003/pyspark.zip/pyspark/worker.py", line 125, in main ("%d.%d" % sys.version_info[:2], version)) – xmatadorx Sep 22 '20 at 08:25
Exception: Python in worker has different version 3.6 than that in driver 3.8, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. – xmatadorx Sep 22 '20 at 08:25
I get the same /similar error when using the "list comprehension" version. – xmatadorx Sep 22 '20 at 08:29
where are you running spark? – A.B Sep 22 '20 at 08:30
Please refer to this question for python version issue with spark driver https://stackoverflow.com/questions/38936150/spark-exception-python-in-worker-has-different-version-3-4-than-that-in-driver – A.B Sep 22 '20 at 08:39
I am running pyspark in Jupyter Notebook. The session itself runs on a Hadoop/Hive Cluster. Sorry I am not that into the technical details, version numbers etc. Just user =) – xmatadorx Sep 22 '20 at 08:45
i suggest to follow this answer, create env with conda and assign it to driver like metnioned in the answer here https://stackoverflow.com/a/38936300/3680831 – A.B Sep 22 '20 at 08:52

score 0 · Answer 3 · answered Sep 22 '20 at 07:19

in the meanwhile I have also found a solution by myself:

for i in range(1, max_nr_of_tickets):
    table = table.filter(F.col('nr_tickets') >= 1).union(test)
    table = table.withColumn('nr_tickets', F.col('nr_tickets') - 1)

Explanation: The DFs "table" and "test" are the same at the beginning. So "max_nr_of_tickets" is just the highest "nr_tickets". It works. I am only struggling with the format of the max number:

max_nr_of_tickets = df.select(F.max('nr_tickets')).collect()

I cannot use the result in the for loop's range as it is a list. So I manually enter the highest number. Any ideas how I could get the max_nr_of_tickets into the right format so the loops range will accept it?

Thanks

Reversing Group By in PySpark

3 Answers3