Group by value within range in Azure Databricks

Question

Consider following data:

EventDate,Value
1.1.2019,11
1.2.2019,5
1.3.2019,6
1.4.2019,-15
1.5.2019,-20
1.6.2019,-30
1.7.2019,12
1.8.2019,20

I want to create groups of when these values are within thresholds:

 1. > 10
 2. <=10 >=-10
 3. >-10

The result should be with start and end of values in a certain state:

1.1.2019, 1.1.2019, [11]
1.2.2019, 1.3.2019, [5, 6]
1.4.2019, 1.6.2019, [-15, -20, -30]
1.7.2019, 1.8.2018, [12, 20]

I believe the answer is within the window function, but I am fairly new to databricks and I can't understand how to use it (yet).

Here is a working (python) solution based on looping through the dataframe as a list, however I would prefer a solution that works directly on the dataframe for performance.

from pyspark.sql.functions import *
import pandas as pd
STATETHRESHOLDCHARGE = 10
list = [{"eventDateTime":x["EventDate"], "value":x["Value"]} for x in dataframe.sort(dfArrayOneCast.EventDate).rdd.collect()]
cycles = []
previous = None
for row in list:
  currentState = 'charge'
  if row["value"] < STATETHRESHOLDCHARGE and row["value"] > (STATETHRESHOLDCHARGE * -1):
    currentState = 'idle'
  if row["value"] <= (STATETHRESHOLDCHARGE * -1):
    currentState = 'discharge'

  eventDateTime = row["eventDateTime"]
  if previous is None or previous["state"] != currentState:
    previous = {"start":row["eventDateTime"], "end":row["eventDateTime"], "values":[row["value"]], "timestamps":[row["eventDateTime"]], "state":currentState}
    cycles.append(previous)
  else:
    previous["end"] = row["eventDateTime"]
    previous["values"].append(row["value"])
    previous["timestamps"].append(row["eventDateTime"])

display(cycles)

Can you post what you have tried so far and how you have stored the data or can you provide at least an easy to reproduce setup with the dataset above someone can just paste into a notebook and run? — ChrisWue, Aug 21 '19 at 22:20

score 3 · Accepted Answer · answered Aug 29 '19 at 11:49

Assuming you have above data in df data frame, let's take this piece by piece

from pyspark.sql.functions import col, last, lag, udf, when, collect_list
from pyspark.sql.types import StringType
value = 'value'
date = 'EventDate'
valueBag = 'valueBag'

def bagTransform(v):
  if v > 10:
    return 'charging'
  elif v < -10:
    return 'discharging'
  else:
    return 'idle'

bagTransformUDF = udf(bagTransform, StringType())  

withBaggedValue = df.withColumn(valueBag, bagTransformUDF(col(value)))

So first we bagged values into ranges as you declared, now we can use lag to move a window over previous value:

from pyspark.sql import Window
windowSpec = Window.orderBy(date)
prevValueBag = 'prevValueBag'
bagBeginning = 'bagBeginning'

withLag = (withBaggedValue
  .withColumn(prevValueBag, lag(withBaggedValue[valueBag]).over(windowSpec)))

Now the fun part starts: we detect change points and temporarily assign there current event date or null:

withInitialBeginnings = withLag.withColumn(bagBeginning, when((col(prevValueBag) != col(valueBag)) | col(prevValueBag).isNull(), col(date)).otherwise(None))

and fill them in using last found value

withFilledBeginnings = (withInitialBeginnings.withColumn(bagBeginning, 
                 last(col(bagBeginning), ignorenulls=True)
                 .over(windowSpec)))
display(withFilledBeginnings)

with that set we can simply aggregate over starting point

aggregate = withFilledBeginnings.groupby(col(bagBeginning)).agg(collect_list(value))

display(aggregate)

If you also need the end date you can do similar preprocessing using pyspark.sql.functions.lead which works symmetrically to last but in forward direction.

This does wonders, pretty big performance improvement as well. I also have a threshold for cycles to be counted, I can filter out cycles that are outside this threshold, now I am stuck with cycles that are next to each other (if there is short idle cycle between to charge cycles I want to merge these). Is there an easy way to merge rows based if they are equal to previous? — ruffen, Sep 02 '19 at 13:09
I'm not sure if I understand you correctly - could you do a sample data and expected result? Generally, you can do a lot with data from lead / last and further aggregation or filtering. — Daniel, Sep 02 '19 at 17:10
I will try and share a result as soon as possible, but basically the logic is, if end - start < threshold, remove the bin. This opens up possibillity for two bins to have equal state if you have charge, idle, charge and remove idle state you will have charge, charge. Then I need to merge those two. I managed to solve it using your existing logic however, but you might have a nicer solution. — ruffen, Sep 02 '19 at 19:42

score 0 · Answer 2 · answered Aug 22 '19 at 08:58

I created a csv file for testing my sample code with Pandas in Python, the content of the test.csv file as below.

A,B
1.1.2019,11
1.2.2019,5
1.3.2019,6
1.4.2019,-15
1.5.2019,-20
1.6.2019,-30
1.7.2019,12
1.8.2019,20

Due to there is a limits that can not construct the pandas.IntervalIndex with some pandas.Intervals with different closed values like right, left, both, neither, so I converted your defined groups with thresholds to the equivalent groups with right closed, as below.

    Your defined groups        |   The equivalent groups
 1. > 10       : (10,inf]      | >10       :  (10, inf]
 2. <=10 >=-10 : [-10,10]      | <=10 >-11 :  (-11,10]
 3. <-10       : (-inf,-10)    | <=-11     :  (-inf, -11]

Here is my sample code.

import pandas as pd
import numpy as np

df = pd.read_csv('test.csv')
# print(df)
bins = np.array([-np.inf, -11, 10, np.inf])
"""
# Can not construct IntervalIndex with Intervals with different closed
>>> left = pd.Interval(left=-np.Inf, right=-10, closed='neither')
>>> center = pd.Interval(left=-10, right=10, closed='both')
>>> right = pd.Interval(left=10, right=np.Inf, closed='neither')
>>> pd.IntervalIndex([left, center, right])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "\lib\site-packages\pandas\core\indexes\interval.py", line 151, in __new__
    verify_integrity=verify_integrity)
  File "\lib\site-packages\pandas\core\arrays\interval.py", line 157, in __new__
    data, validate_closed=closed is None)
  File "pandas\_libs\interval.pyx", line 483, in pandas._libs.interval.intervals_to_interval_bounds
ValueError: intervals must all be closed on the same side
"""
# print(bins)
labels = ['left', 'center', 'right']
by = pd.cut(df['B'], bins, labels=labels)
# print(by)
groupby = df.groupby(by)
result = pd.DataFrame([(groupby['A'].min()[label], groupby['A'].max()[label], df['B'].take(groupby.indices.get(label)).get_values()) for label in labels], index=['<-10','>=-10 <=10','>10'], columns=['min_A', 'max_A', 'Bs'])
print(result)

The result is as below.

               min_A     max_A               Bs
<-10        1.4.2019  1.6.2019  [-15, -20, -30]
>=-10 <=10  1.2.2019  1.3.2019           [5, 6]
>10         1.1.2019  1.8.2019     [11, 12, 20]

Its close, but the problem with this solution is that I want the bin to be "closed" once the values have entered a new state. If -10 is 'negative' -10 to 10 is 'neutral' and 10+ is positive. I need to be able to get a list of bins like negative, neutral, negative, neutral, positive. That means in the bins list, there needs to be a >10 group first, then the <-10 and then >=-10 <=10 and then a >10 group in the end. If that makes sense? I have solved it now by iterating a list, so I will post this solution as soon as I can, but I would prefer a dataframe solution if possible. — ruffen, Aug 22 '19 at 10:25
@ruffen The `result` is a pandas dataframe. If you would prefer a pyspark dataframe, you just need to convert it via `spark_df = spark.createDataFrame(result)`. — Peter Pan, Aug 22 '19 at 10:33

Group by value within range in Azure Databricks

2 Answers2