5

All,

Having reviewed StackOverflow and the wider internet, I am still struggling to efficiently calculate Percentiles using LINQ.

Where a percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. The below example attempts to convert a list of values, to an array where each (unique) value is represented with is associated percentile. The min() and max() of the list are necessarily the 0% and 100% of the returned array percentiles.

Using LINQPad, the below code generates the required output a VP[]:

enter image description here

This can be interpreted as: - At 0% the minimum value is 1 - At 100% the maximum value is 3 - At 50% between the minimum and maximum the value is 2

void Main()
{
    var list = new List<double> {1,2,3};
    double denominator = list.Count - 1;   
    var answer = list.Select(x => new VP
        {
            Value = x,
            Percentile = list.Count(y => x > y) / denominator
        })
        //.GroupBy(grp => grp.Value) --> commented out until attempted duplicate solution 
        .ToArray();
    answer.Dump();
}

public struct VP
{
    public double Value;
    public double Percentile;
}

However, this returns an incorrect VP[] when the "list" contains duplicate entries (e.g. 1,2,**2,**3) :

enter image description here

My attempts to group by unique values in the list (by including ".GroupBy(grp => grp.Value)") have failed to yield the desired result (Value =2, & Percentile = 0.666) :

enter image description here

All suggestions are welcome. Including whether this is an efficient approach given the repeated iteration with "list.Count(y => x > y)".

As always, thanks Shannon

shansen
  • 265
  • 4
  • 14
  • I am not clear as to what you are exactly trying to calculate, maybe I am a bit rusty on my math... could you maybe tell me exactly what you mean by proportion/percentile and how it related to the denominator. Thanks – TheCatWhisperer Dec 29 '14 at 02:54
  • Why is list.Count() being used to calculate the denominator? It's not really clear what you are trying to accomplish. – Rufus L Dec 29 '14 at 03:06
  • @TheCatWhisperer for each element in the original list (i.e. 1,2,3) I require the value and percentile (e.g. value = 2 marks the 50% point of the distribution. I am assembling a probability density function of sorts. Where the VP[] can be quickly referenced to determine 50% of elements are less than or equal to "2". – shansen Dec 29 '14 at 03:16
  • How are the percentiles distributed? Normally, Student, ect? – TheCatWhisperer Dec 29 '14 at 03:21
  • It would be really helpful if you re-worded your question and included a definition of what the list represents, and what Value and Proportion are. – Rufus L Dec 29 '14 at 03:22
  • Try calling Distinct() before your call to Select (e.g., list.Distinct().Select(...)). – ventaur Dec 29 '14 at 03:27
  • @RufusL, the minimum and maximum number in the list are the 0% and 100% marks in the Percentile distribution. The "denominator = list.Count() - 1" allows the percentile of the maximum number (i.e. "3") to be returned as 100% – shansen Dec 29 '14 at 03:32
  • Any reason not to just use `list.Max()` to get the largest number? to get the largest number? Is the list sorted? As you mentioned before, the list can have multiple identical entries. Should the list be truncated to have only unique values first (i.e. `list = list.Distinct().ToList();`)? Will that affect your output if you're using a smaller list? Try to restate the question with all the constraints (or lack of) so you can get a better answer. – Rufus L Dec 29 '14 at 03:36
  • @TheCatWhisperer, the percentiles would most closely represent a logistic distribution. However, I am looking to use the empirical distribution of percentiles rather that approximate with a statistical distribution. – shansen Dec 29 '14 at 03:37
  • I am not sure what you want. You just want to remove the duplicated rows? However, this is not an efficient approach if you got more data. For example, 10000 rows of data will incur 10000^2 iteration. – tia Dec 29 '14 at 03:38
  • @ventaur, using .Distinct().Select(...) on the the list (1,2,2,3) returns a VP[] which contains an element VP (value = 2, proportion = 0.333). This could be interpreted as 33% of elements are less than or equal to "2". Which is incorrect, 67% of elements are less than or equal to "2" for this calculation. – shansen Dec 29 '14 at 03:49
  • @tia, I am attempting to return a VP[] where the min and max values of the original list are associated with the percentiles 0% and 100% respectively. **Unique** values between the max and min values are returned with an associated percentile allowing for potential duplicates. – shansen Dec 29 '14 at 04:20
  • In the future, it would be much better if you edited your original question with more clarification, rather than (or in addition to) answering them in the comments! – Rufus L Dec 29 '14 at 05:48

3 Answers3

1

I'm not sure I understand the requirements of this question. When I ran the accepted answer's code I got this result:

original result

But if I change the input to this:

var dataSet = new List<double> { 1, 1, 1, 1, 2, 3, 3, 3, 2 };

...I then get this result:

updated result

With the line "The min() and max() of the list are necessarily the 0% and 100% of the returned array percentiles." it seems to me the OP is asking for the values to be from 0 to 1, but the updated result goes beyond 1.

It also seems wrong to me that the first value should be 0% as I'm not sure what that means in context to the data.

After reading the linked Wikipedia page it seems that the OP is actually trying to do the reverse calculation to computing the percentile value. In fact the article says that the percentile for 0 is undefined. That makes sense because a percentile of 0 would be the empty set of values - and what is the maximum value of an empty set?

The OP seems to be computing the percentile from the values. So, in that sense, and knowing that 0 is undefined, it seems that the most appropriate value to compute is the percentage of values that are equal to or below each distinct value in the set.

Now, if I use the Microsoft's Reactive Framework Team's Interactive Extensions (NuGet "Ix-Main") then I can run this code:

var dataSet = new List<double> { 1, 1, 1, 1, 2, 3, 3, 3, 2 };

var result =
    dataSet
        .GroupBy(x => x)
        .Scan(
            new VP()
            {
                Value = double.MinValue, Proportion = 0.0
            },
            (a, x) =>
                new VP()
                {
                    Value = x.Key,
                    Proportion = a.Proportion + (double)x.Count() / dataSet.Count
                });

I get this result:

result

This tells me that approximately 44% of the values are 1; that approximately 67% of the values are 1 or 2; and 100% of the values are either 1, 2, or 3.

This seems to me to be the most logical computation for the requirements.

Enigmativity
  • 113,464
  • 11
  • 89
  • 172
0
void Main()
{
    var list = new List<double> {1,2,3};
    double denominator = list.Count - 1;   
    var answer = list.OrderBy(x => x).Select(x => new VP
        {
            Value = x,
            Proportion = list.IndexOf(x) / denominator
        })
        .ToArray();
    answer.Dump();
}

public struct VP
{
    public double Value;
    public double Proportion;
}
Rob
  • 26,989
  • 16
  • 82
  • 98
TheCatWhisperer
  • 901
  • 2
  • 12
  • 28
  • thanks for the reply. It makes sense your approach is more efficient. However, where duplicate entries exist (e.g. 1,2,2,3) the issue persists where the value "2" is duplicated in the returned VP[] each with the percentile of 33%, instead of a unique "2" with a percentile of 67% – shansen Dec 29 '14 at 04:13
  • I do not see how it is possible for 2 to come up twice when there are only 3 items in the list. Perhaps it is a bug in linqpad? – TheCatWhisperer Dec 29 '14 at 22:40
0

This is how I did it. I changed a few of the variable names to make the context clearer.

var dataSet = new List<double> { 1, 2, 3, 2 };
double denominator = dataSet.Count - 1;
var uniqueValues = dataSet.Distinct();
var vp = dataSet.Select(value => new VP
{
    Value = value,
    Proportion = dataSet.Count(datum => value > datum) / denominator
});

var answer = uniqueValues.Select(u => new VP{
    Value = u,
    Proportion = vp.Where(v => v.Value == u).Select(x => x.Proportion).Sum()
});
  • Ed, thank you this is exactly what I was after. Apologies for my lack of clarity in defining the question. I will improve. – shansen Dec 29 '14 at 04:23
  • 2
    It appears to be highly inefficient. `Count(datum => value > datum)` will iterate over entire collection over and over again. – MarcinJuraszek Dec 29 '14 at 04:35