0

I should start off by saying that I am quite new to both Python and PySpark, with most of my experience being in MS SQL, C#, VB.NET, etc.....

I have a dataframe to which I want to add a 'group_number' field. I need this number to increment based on a datetime field, and reset based on a value field. So I would expect output such as:

+-----+----------------+-------------+
|value|datetime        |group_number |
+-----+----------------+-------------+
|00001|2020-01-01 00:00|1            |
|00001|2020-01-01 02:10|2            |
|00001|2020-01-01 05:14|3            |
|00002|2020-01-01 00:03|1            |
|00002|2020-01-01 02:04|2            |
|00003|2020-01-01 03:03|1            |
+-----+----------------+-------------+

The datetime values are kind of irrelevant, in that they can start and end at different points and increment by different amounts within each group, I just need a number (1 to x) which orders each 'value' field chronologically.

I have written up a udf to try and do this, but I don't think it orders them properly and I just end up with mostly '1' values and the occasional '2'.

The udf definition is:

def createGroupID(value):
    global iterationCount
    global currentValue

    if value == currentValue:
        iterationCount = iterationCount + 1
        return iterationCount

    iterationCount = 1
    currentValue = value
    return iterationCount

The two global variables are initialised in the main application and the udf is being called as:

    createCountNumber = udf(createGroupID, StringType())
    newdf = df.withColumn("group_number", createCountNumber('value'))

If anyone can help me with this I'd be really grateful! Thanks a lot.

2 Answers2

1

Thanks to Anil for pointing me in the right direction....I found my way to the full solution at Groupby cumcount in PySpark

I needed to add the following:

    w = Window.partitionBy("value")
    df = df.withColumn("count", count("*").over(w))\
        .withColumn("group_number", row_number().over(w.orderBy("datetime")))

And now I have exactly what I needed!

Oh, I also needed to add a line to let me use all the functions in the above codeblock:

from pyspark.sql.functions import col, size, lit, udf, concat, row_number, count, when
  • I want to do a similar thing. I want to increment group_value when value is a specific number like '123' and increment otherwise. – user812142 May 11 '22 at 00:20
-1

Read the data as a pandas data frame. Then you can group by the values and do a cumulative count:

import pandas as pd
df = pd.read_excel(r'file_path')
df['seq'] = df.groupby(['value', 'datetime']).cumcount()+1
df
Anil Kumar
  • 385
  • 2
  • 17
  • Hi Anil - thanks for responding so quickly to my problem. I plan to migrate this over to AWS EMR at some point and I am not sure whether pandas is okay to use on that. Also, it is likely that my dataset could sometimes be really big. Are either of these likely to cause me a problem? (in the meantime, I am trying to test your suggestion out, thanks) – Simon Rathbone Jan 23 '20 at 15:44
  • For your requirement below link can be useful. Please check : https://stackoverflow.com/questions/41890485/aggregate-function-count-usage-with-groupby-in-spark – Anil Kumar Jan 23 '20 at 15:50
  • Hi Anil - thanks for the link. It is helpful but the one step I am missing is to increment the count in order of datetime rather than getting just a count of them.... As a side note, I managed to get the Pandas option coded up but I immediately hit a Java heap space error....not too surprising I guess with such a large dataset. If nothing else that reduces the options I have :) – Simon Rathbone Jan 27 '20 at 10:01
  • 1
    I just found this page: https://stackoverflow.com/questions/55617034/groupby-cumcount-in-pyspark which I *think* may be able to help me. I am going to wade through this to see if it works and will post back once I get somewhere. Thanks again Anil - your suggestions look to have pointed me in the right direction. Really appreciate it. – Simon Rathbone Jan 27 '20 at 10:05