3

I'm stuck with this problem quite a while and probably making it bigger than really it is. I will try to simplify it.

I'm using pyspark and data frame functions along my code.

I already have a df as:

+--+-----+---------+
|id|col1 |col2     |
+--+-----+---------+
|1 |Hello|Repeat   |
|2 |Word |Repeat   |
|3 |Aux  |No repeat|
|4 |Test |Repeat   |
+--+-----+---------+

What I want to achieve is to repeat the df's rows when col2 is 'Repeat' increasing col1's values in value+1.

+--+-----+---------+------+
|id|col1 |col2     |col3  |
+--+-----+---------+------+
|1 |Hello|Repeat   |Hello1|
|1 |Hello|Repeat   |Hello2|
|1 |Hello|Repeat   |Hello3|
|2 |Word |Repeat   |Word1 |
|2 |Word |Repeat   |Word2 |
|2 |Word |Repeat   |Word3 |
|3 |Aux  |No repeat|Aux   |
|4 |Test |Repeat   |Test1 |
|4 |Test |Repeat   |Test2 |
|4 |Test |Repeat   |Test3 |
+--+-----+---------+------+

My first approach was to use withColumn operator to create a new column with udf's help:

my_func = udf(lambda words: (words + str(i + 1 for i in range(3))), StringType())
df = df\
    .withColumn('col3', when(col('col2') == 'No Repeat', col('col1'))
                            .otherwise(my_func(col('col1'))))

But when I evaluate this in a df.show(10,False) it's throw me an error. My guessing is because I just can't create more rows with withColumn function in that way.

So I decide to go for another approach with no success also. Using a rdd.flatMap:

test = df.rdd.flatMap(lambda row: (row if (row.col2== 'No Repeat') else (row.col1 + str(i+1) for i in range(3))))
print(test.collect())

But here I'm losing the df schema and I can not throw out the full row on the else condition, it only throw me the col1 words plus it's iterator.

Do you know any proper way to solve this?

At the end my problem is that I do not get a properly way to create more rows based on column values because I'm quite new in this world. Also answers that I found seems not to fit this problem.

All help will be appreciate.

Als
  • 95
  • 10

1 Answers1

4

One way is use a condition and assign an array , then explode,

import pyspark.sql.functions as F

(df.withColumn("test",F.when(df['col2']=='Repeat',
       F.array([F.lit(str(i)) for i in range(1,4)])).otherwise(F.array(F.lit(''))))
  .withColumn("col3",F.explode(F.col("test"))).drop("test")
  .withColumn("col3",F.concat(F.col("col1"),F.col("col3")))).show()

A neater version of the same as suggested by @MohammadMurtazaHashmi would look like:

(df.withColumn("test",F.when(df['col2']=='Repeat',
     F.array([F.concat(F.col("col1"),F.lit(str(i))) for i in range(1,4)]))
    .otherwise(F.array(F.col("col1"))))
    .select("id","col1","col2", F.explode("test"))).show()

+---+-----+---------+------+
| id| col1|     col2|  col3|
+---+-----+---------+------+
|  1|Hello|   Repeat|Hello1|
|  1|Hello|   Repeat|Hello2|
|  1|Hello|   Repeat|Hello3|
|  2| Word|   Repeat| Word1|
|  2| Word|   Repeat| Word2|
|  2| Word|   Repeat| Word3|
|  3|  Aux|No repeat|   Aux|
|  4| Test|   Repeat| Test1|
|  4| Test|   Repeat| Test2|
|  4| Test|   Repeat| Test3|
+---+-----+---------+------+
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    Thank you so much for your answer @anky. At the end I'm following a very similar approach like yours. I was avoiding to use explode function due performance, but just now explode should be fine. – Als Apr 16 '20 at 14:50
  • 2
    @anky you could also concat like this `F.array([F.concat(F.col("col1"),F.lit(str(i))) for i in range(1,4)]))` – murtihash Apr 16 '20 at 17:57
  • 1
    @MohammadMurtazaHashmi that is indeed neater . I was wondering such methods should exist as I am familiar to such operations using pandas but didnt know to do that in pyspark. Thanks for enlightening me :) – anky Apr 16 '20 at 18:09