Suppose I have a dataframe with the schema :
Application Name
Application ID
Application Version
Mean of Metric 1
Mean of Metric 2
Mean of Metric 3
Mean of Metric 4
... (20 Metric Columns)
Standard Deviation of Metric 1
Standard Deviation of Metric 2
Standard Deviation of Metric 3
Standard Deviation of Metric 4
...(20 Metric Columns)
The Application Version can be Version 1
or Version 2
. I need to verify if the mean of a particular metric name for a particular Application Name and Application ID, in case of Version 2
, falls within :
[Mean of Version 1 - Standard Deviation of Version 1, Mean of Version 1 + Standard Deviation of Version 1]
and add a resulting column for the Version 2 with the value as either True
or False
, what would be the best way to do so?
I have looked at GroupBy
, GroupByKey
, ReduceByKey
, CombineByKey
, but not able to come up with a way to do this.
Tried to use this as a reference but I seemingly cannot use any of the aggregation functions available and have to use a custom function here. Also need to do this for multiple columns.