How to groupby by consective 1s in column in pyspark and keep groups with specific size

Question

I want to split my pyspark dataframe in groups with monotonically increasing trend and keep the groups with size greater than 10.

here i tried some part of code,

from pyspark.sql import functions as F, Window

df = df1.withColumn(
    "FLAG_INCREASE",
    F.when(
        F.col("x")
        > F.lag("x").over(Window.partitionBy("x1").orderBy("time")),
        1,
    ).otherwise(0),
)

i don't know how to do groupby by consective one's in pyspark... if anyone have better solution for this

same thing in pandas we can do like this :

df=df1.groupby((df1['x'].diff() < 0).cumsum())

how to convert this code to pyspark ?

example dataframe:

expected output

group1:

group2:

do you want a cumulative sum? see [this SO Q](https://stackoverflow.com/q/45946349/8279585) — samkart, Aug 11 '22 at 05:10
@samkart i want to split in groups based on consective one's — waqar ali, Aug 11 '22 at 05:13

score 2 · Accepted Answer · answered Aug 11 '22 at 05:56

I'll map out all the steps (and keep all columns in the output) to replicate (df1['x'].diff() < 0).cumsum() which is easy to calculate using a lag.

However, it is important that your data has an ID column that has the order of the dataframe because unlike pandas, spark does not retain dataframe's sorting (due to its distributed nature). For this example, I've assumed that your data has an ID column named idx, which is the index you printed in your example input.

# input data
data_sdf.show(5)

# +---+---+
# |idx|val|
# +---+---+
# |  0|  1|
# |  1|  2|
# |  2|  2|
# |  3|  2|
# |  4|  3|
# +---+---+
# only showing top 5 rows

# calculating the group column
data_sdf. \
    withColumn('diff_with_prevval', 
               func.col('val') - func.lag('val').over(wd.orderBy('idx'))
               ). \
    withColumn('diff_lt_0', 
               func.coalesce((func.col('diff_with_prevval') < 0).cast('int'), func.lit(0))
               ). \
    withColumn('diff_lt_0_cumsum', 
               func.sum('diff_lt_0').over(wd.orderBy('idx').rowsBetween(-sys.maxsize, 0))
               ). \
    show()

# +---+---+-----------------+---------+----------------+
# |idx|val|diff_with_prevval|diff_lt_0|diff_lt_0_cumsum|
# +---+---+-----------------+---------+----------------+
# |  0|  1|             null|        0|               0|
# |  1|  2|                1|        0|               0|
# |  2|  2|                0|        0|               0|
# |  3|  2|                0|        0|               0|
# |  4|  3|                1|        0|               0|
# |  5|  3|                0|        0|               0|
# |  6|  4|                1|        0|               0|
# |  7|  5|                1|        0|               0|
# |  8|  4|               -1|        1|               1|
# |  9|  3|               -1|        1|               2|
# | 10|  2|               -1|        1|               3|
# | 11|  1|               -1|        1|               4|
# | 12|  2|                1|        0|               4|
# | 13|  3|                1|        0|               4|
# | 14|  4|                1|        0|               4|
# | 15|  5|                1|        0|               4|
# | 16|  5|                0|        0|               4|
# | 17|  6|                1|        0|               4|
# +---+---+-----------------+---------+----------------+

You can now use the diff_lt_0_cumsum column in your groupBy() for further use.

+1, but it is [recommended](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.rowsBetween.html) to use `wd.unboundedPreceding` rather than sys.maxsize directly — viggnah, Aug 11 '22 at 06:09
@waqarali - rows between unbounded preceding and current row — samkart, Aug 11 '22 at 10:17

How to groupby by consective 1s in column in pyspark and keep groups with specific size

1 Answers1

Linked