3

I need to "clone" or "duplicate"/"triplicate" every row from my dataframe.

I didn't find nothing about it, I just know that I need to use explode.

Example:

ID - Name
1     John
2     Maria
3     Charles

Output:

ID - Name
1     John
1     John
2     Maria
2     Maria
3     Charles
3     Charles

Thanks

thalesthales
  • 95
  • 1
  • 7
  • 2
    Why don't you [union](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.union) the dataframe with itself? – cronoik May 04 '20 at 21:54

1 Answers1

5

You could use array_repeat with explode.(Spark2.4+)

For duplicate:

from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.array_repeat("Name",2)))

For triplicate:

df.withColumn("Name", F.explode(F.array_repeat("Name",3)))

For <spark2.4:

#duplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*2])))

#triplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*3])))

UPDATE:

In order to use another column Support to replicate a certain number of times for each row you could use this.(Spark2.4+)

df.show()

#+---+-------+-------+
#| ID|   Name|Support|
#+---+-------+-------+
#|  1|   John|      2|
#|  2|  Maria|      4|
#|  3|Charles|      6|
#+---+-------+-------+

from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(Support))"""))).show()

#+---+-------+-------+
#| ID|   Name|Support|
#+---+-------+-------+
#|  1|   John|      2|
#|  1|   John|      2|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#+---+-------+-------+

For spark1.5+, using repeat, concat, substring, split & explode.

from pyspark.sql import functions as F
df.withColumn("Name", F.expr("""repeat(concat(Name,','),Support)"""))\
  .withColumn("Name", F.explode(F.expr("""split(substring(Name,1,length(Name)-1),',')"""))).show()
murtihash
  • 8,030
  • 1
  • 14
  • 26
  • Hey @Mohammad, do you know if it's possible to multiply the number of rows given a condition, for example there is a support column with number 2,4,6 and I'd like to explode accordingly to these numbers – thalesthales May 05 '20 at 19:06
  • 1
    whats your spark version? and support column with 2,4,6 meaning, replicate 2times,4times,6times right? – murtihash May 05 '20 at 19:08
  • Meaning that I can't fix this parameter 2,4,6 it should be something that reads the column like df.withColumn("Name", F.explode(F.array_repeat("Name",F.col('parameter'))) – thalesthales May 05 '20 at 19:32
  • 1
    @thalesthales did u check my update?. Only way to keep it dynamic like that is to use an expression, and send an int value of parameter like `df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(parameter))""")))` – murtihash May 05 '20 at 19:36
  • Hi @murtihash - thank you for sharing this neat solution. What if I'd like to simply duplicate, triplicate, 4x, etc. ALL of the columns in a given dataframe? Seeing your 'UPDATE' example, I can think of adding another column with all of its values set to 2, 3 or 4 to duplicate/triplicate/quadruple all rows. But I'm wondering if there's a more elegant way (without having to add that new column). Alternatively, I can have a `for` loop for X amount of time and just do something like `df_old = df.union(df_old)`, but that's also kind of not that clean. Thank you in advance for your answer! – user1330974 Jul 01 '23 at 01:17