I should start off by saying that I am quite new to both Python and PySpark, with most of my experience being in MS SQL, C#, VB.NET, etc.....
I have a dataframe to which I want to add a 'group_number' field. I need this number to increment based on a datetime field, and reset based on a value field. So I would expect output such as:
+-----+----------------+-------------+
|value|datetime |group_number |
+-----+----------------+-------------+
|00001|2020-01-01 00:00|1 |
|00001|2020-01-01 02:10|2 |
|00001|2020-01-01 05:14|3 |
|00002|2020-01-01 00:03|1 |
|00002|2020-01-01 02:04|2 |
|00003|2020-01-01 03:03|1 |
+-----+----------------+-------------+
The datetime values are kind of irrelevant, in that they can start and end at different points and increment by different amounts within each group, I just need a number (1 to x) which orders each 'value' field chronologically.
I have written up a udf to try and do this, but I don't think it orders them properly and I just end up with mostly '1' values and the occasional '2'.
The udf definition is:
def createGroupID(value):
global iterationCount
global currentValue
if value == currentValue:
iterationCount = iterationCount + 1
return iterationCount
iterationCount = 1
currentValue = value
return iterationCount
The two global variables are initialised in the main application and the udf is being called as:
createCountNumber = udf(createGroupID, StringType())
newdf = df.withColumn("group_number", createCountNumber('value'))
If anyone can help me with this I'd be really grateful! Thanks a lot.