Pyspark - How to duplicate/triplicate rows?

Question

I need to "clone" or "duplicate"/"triplicate" every row from my dataframe.

I didn't find nothing about it, I just know that I need to use explode.

Example:

ID - Name
1     John
2     Maria
3     Charles

Output:

ID - Name
1     John
1     John
2     Maria
2     Maria
3     Charles
3     Charles

Thanks

Why don't you [union](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.union) the dataframe with itself? — cronoik, May 04 '20 at 21:54

murtihash · Answer 1 · 2020-05-05T20:04:48.763

5

You could use array_repeat with explode.(Spark2.4+)

For duplicate:

from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.array_repeat("Name",2)))

For triplicate:

df.withColumn("Name", F.explode(F.array_repeat("Name",3)))

For <spark2.4:

#duplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*2])))

#triplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*3])))

UPDATE:

In order to use another column Support to replicate a certain number of times for each row you could use this.(Spark2.4+)

df.show()

#+---+-------+-------+
#| ID|   Name|Support|
#+---+-------+-------+
#|  1|   John|      2|
#|  2|  Maria|      4|
#|  3|Charles|      6|
#+---+-------+-------+

from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(Support))"""))).show()

#+---+-------+-------+
#| ID|   Name|Support|
#+---+-------+-------+
#|  1|   John|      2|
#|  1|   John|      2|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#+---+-------+-------+

For spark1.5+, using repeat, concat, substring, split & explode.

from pyspark.sql import functions as F
df.withColumn("Name", F.expr("""repeat(concat(Name,','),Support)"""))\
  .withColumn("Name", F.explode(F.expr("""split(substring(Name,1,length(Name)-1),',')"""))).show()

edited May 05 '20 at 20:04

answered May 04 '20 at 19:10

murtihash

8,030
1
14
26

Hey @Mohammad, do you know if it's possible to multiply the number of rows given a condition, for example there is a support column with number 2,4,6 and I'd like to explode accordingly to these numbers – thalesthales May 05 '20 at 19:06
1

whats your spark version? and support column with 2,4,6 meaning, replicate 2times,4times,6times right? – murtihash May 05 '20 at 19:08
Meaning that I can't fix this parameter 2,4,6 it should be something that reads the column like df.withColumn("Name", F.explode(F.array_repeat("Name",F.col('parameter'))) – thalesthales May 05 '20 at 19:32
1

@thalesthales did u check my update?. Only way to keep it dynamic like that is to use an expression, and send an int value of parameter like `df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(parameter))""")))` – murtihash May 05 '20 at 19:36
Hi @murtihash - thank you for sharing this neat solution. What if I'd like to simply duplicate, triplicate, 4x, etc. ALL of the columns in a given dataframe? Seeing your 'UPDATE' example, I can think of adding another column with all of its values set to 2, 3 or 4 to duplicate/triplicate/quadruple all rows. But I'm wondering if there's a more elegant way (without having to add that new column). Alternatively, I can have a `for` loop for X amount of time and just do something like `df_old = df.union(df_old)`, but that's also kind of not that clean. Thank you in advance for your answer! – user1330974 Jul 01 '23 at 01:17

Pyspark - How to duplicate/triplicate rows?

1 Answers1