0

Suppose I have a dataframe with the schema :

Application Name
Application ID
Application Version
Mean of Metric 1
Mean of Metric 2
Mean of Metric 3
Mean of Metric 4
... (20 Metric Columns)
Standard Deviation of Metric 1
Standard Deviation of Metric 2
Standard Deviation of Metric 3
Standard Deviation of Metric 4
...(20 Metric Columns)

The Application Version can be Version 1 or Version 2. I need to verify if the mean of a particular metric name for a particular Application Name and Application ID, in case of Version 2, falls within :

[Mean of Version 1 - Standard Deviation of Version 1, Mean of Version 1 + Standard Deviation of Version 1]

and add a resulting column for the Version 2 with the value as either True or False, what would be the best way to do so?

I have looked at GroupBy, GroupByKey, ReduceByKey, CombineByKey, but not able to come up with a way to do this.

Tried to use this as a reference but I seemingly cannot use any of the aggregation functions available and have to use a custom function here. Also need to do this for multiple columns.

  • To validate my understanding, so you want to compare the mean of Version-2 is within the mean/stdev of Version-1 for the same Application Name and ID, is that correct? – Islam Elbanna Jun 09 '23 at 10:50

0 Answers0