0

I have pyspark dataframe with 3 fields as shown below (Unique Idn, visit time, new visit), I want to create another new fields (visit seq - yellow highlighted) based on every time new visit happens (as 1) for each unique idn.

enter image description here

So every time there is new visit for each unique identifier, it should increment the counter. Its kind of visit number. Example for Unique Idn, 11, visit start at 1/11 6:24 and ends at 1/11 6:26 and its tagged as visit 2 starting from 1/11 6:24 till 1/11 6:26.

user1403789
  • 77
  • 1
  • 5
  • Does this answer your question? [Python Spark Cumulative Sum by Group Using DataFrame](https://stackoverflow.com/questions/45946349/python-spark-cumulative-sum-by-group-using-dataframe) – Nick ODell May 23 '23 at 18:57
  • To expand on that a little: It seems like you want to partition by `unique idn`, then order by `visit time`, then take a cumulative sum of `new visit`. – Nick ODell May 23 '23 at 18:59
  • So every time there is new visit per unique identifier, it should increment the counter. so its kinds of visit number. example for Unique Idn, 11, visit start at 1/11 6:24 and ends at 1/11 6:26 and its tagged as visit 2 starting from 1/11 6:24 till 1/11 6:26. – user1403789 May 23 '23 at 19:43

1 Answers1

0

Ranking in PySpark can be achieved by using the Windows Function. For obtaining sequence number, the row_number function can be used.

Snippet:

from pyspark.sql import Window as W
import pyspark.sql.functions as F

window_spec = W.partitionBy(["unique idn"]).orderBy(F.col("visit time").asc()).rowsBetween(W.unboundedPreceding, Window.currentRow)

df_visit_seq = df.withColumn("visit_seq",F.sum(F.col("new visit")).over(window_spec))
Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77