How to merge consecutive duplicate rows in pyspark

Question

I have a dataframe in below format

Col-1Col-2
a   d1
a   d2
x   d3
a   d4
f   d5
a   d6
a   d7

I want to merge the values in col-2 by looking at consecutive duplicates in col1.We can see a has two occurrences of consecutive duplicates so.It should merge d1+d2 and d6+d7 separately.The datatype of these columns are strings,d1+d2 means concatenating string d1 with d2

The final output should be as shown below

Col-1Col-2
a   d1+d2
x   d3
a   d4
f   d5
a   d6+d7

What are the datatypes of your rows? Are they strings? Numeric? What does the `+` mean in `d1+d2`? — pault, May 14 '18 at 20:59
The datatypes are string. + means simple concatenation .d1+d2 is concatenating string d1 with d2 — vish, May 14 '18 at 21:12
You will probably need to add another column beforehand that indicates whether the previous row is a duplicate. From what I understand, it is unlikely that you will be able to preserve the ordering in an RDD after you perform some operation. See [here](https://stackoverflow.com/a/26047920/1586200). — Autonomous, May 14 '18 at 21:30

score 2 · Answer 1 · answered May 14 '18 at 21:31

You will need a column that defines the order of your DataFrame. If one does not already exist, you can create one using pyspark.sql.functions.monotonically_increasing_id.

import pyspark.sql.functions as f
df = df.withColumn("id", f.monotonically_increasing_id())

Next, you can use the technique described in this post to create segments for each set of consecutive duplicates:

import sys
import pyspark.sql.Window

globalWindow = Window.orderBy("id")
upToThisRowWindow = globalWindow.rowsBetween(-sys.maxsize-1, 0)

df = df.withColumn(
    "segment",
    f.sum(
        f.when(
            f.lag("Col-2", 1).over(globalWindow) != f.col("Col-2"),
            1
        ).otherwise(0)
    ).over(upToThisRowWindow)+1
)

Now you can group by the segment and aggregate using pyspark.sql.functions.collect_list to collect the values into a list and pyspark.sql.functions.concat() to concatenate the strings:

df = df.groupBy('segment').agg(f.concat(f.collect_list('Col-2'))).drop('segment')

@vish we use the lag function to check if the current row is equal to the previous. For each new value in "Col-1" a segment will be created. — pault, May 14 '18 at 22:40
Is there something wrong with this answer? Please explain the downvote. (I can assure you that I'm not into revenge downvotes.) It would be helpful if I could understand what is wrong/unclear so that I can fix/improve it. — pault, May 14 '18 at 22:42

How to merge consecutive duplicate rows in pyspark

1 Answers1