I'm stuck with this problem quite a while and probably making it bigger than really it is. I will try to simplify it.
I'm using pyspark and data frame functions along my code.
I already have a df as:
+--+-----+---------+
|id|col1 |col2 |
+--+-----+---------+
|1 |Hello|Repeat |
|2 |Word |Repeat |
|3 |Aux |No repeat|
|4 |Test |Repeat |
+--+-----+---------+
What I want to achieve is to repeat the df's rows when col2 is 'Repeat' increasing col1's values in value+1.
+--+-----+---------+------+
|id|col1 |col2 |col3 |
+--+-----+---------+------+
|1 |Hello|Repeat |Hello1|
|1 |Hello|Repeat |Hello2|
|1 |Hello|Repeat |Hello3|
|2 |Word |Repeat |Word1 |
|2 |Word |Repeat |Word2 |
|2 |Word |Repeat |Word3 |
|3 |Aux |No repeat|Aux |
|4 |Test |Repeat |Test1 |
|4 |Test |Repeat |Test2 |
|4 |Test |Repeat |Test3 |
+--+-----+---------+------+
My first approach was to use withColumn operator to create a new column with udf's help:
my_func = udf(lambda words: (words + str(i + 1 for i in range(3))), StringType())
df = df\
.withColumn('col3', when(col('col2') == 'No Repeat', col('col1'))
.otherwise(my_func(col('col1'))))
But when I evaluate this in a df.show(10,False) it's throw me an error. My guessing is because I just can't create more rows with withColumn function in that way.
So I decide to go for another approach with no success also. Using a rdd.flatMap:
test = df.rdd.flatMap(lambda row: (row if (row.col2== 'No Repeat') else (row.col1 + str(i+1) for i in range(3))))
print(test.collect())
But here I'm losing the df schema and I can not throw out the full row on the else condition, it only throw me the col1 words plus it's iterator.
Do you know any proper way to solve this?
At the end my problem is that I do not get a properly way to create more rows based on column values because I'm quite new in this world. Also answers that I found seems not to fit this problem.
All help will be appreciate.