6

I have a dataframe that looks like the one below.

The weight column essentially represents the frequency of each item, so that for each location the weight sum will equal to 1

Please keep in mind that this is a simplified dataset, in reality there are more than 100 columns like value

d = {'location': ['a', 'a', 'b', 'b'],'item': ['x', 'y', 's', 'v'], 'value': [1, 5, 3, 7], 'weight': [0.9, 0.1, 0.8, 0.2]}
df = pd.DataFrame(data=d)
df
  location item value weight
0     a     x     1     0.9
1     a     y     5     0.1
2     b     s     3     0.8
3     b     v     7     0.2

I currently have code which will compute the grouped median, standard deviation, skew and quantiles for the unweighted data, I am using the below:

df = df[['location','value']]

df1 = df.groupby('location').agg(['median','skew','std']).reset_index()

df2 = df.groupby('location').quantile([0.1, 0.9, 0.25, 0.75, 0.5]).unstack(level=1).reset_index()

dfs = df1.merge(df2, how = 'left', on = 'location')

And the result is the following:

  location   value
             median skew      std  0.1  0.9 0.25 0.75  0.5
0      a         3  NaN  2.828427  1.4  4.6  2.0  4.0  3.0
1      b         5  NaN  2.828427  3.4  6.6  4.0  6.0  5.0

I would like to produce the exact same result data frame as the one above, however with weighted statistics using the weight column. How can I go about doing this?

One more important consideration to note, there are often times where value is null but it has a weight associated to it.

Mustard Tiger
  • 3,520
  • 8
  • 43
  • 68
  • Did my answer work as expected? Or is there anything you think is wrong with duplicating the observation? – Troll Nov 01 '21 at 04:14
  • @Mark, from what I understand yes, this is the right way to apply weights, just out of curiosity can you provide any references to indicate this is in fact the best practice/proper method? – Mustard Tiger Nov 03 '21 at 04:44
  • Please give me some time for research to give the best possible answer. – Troll Nov 03 '21 at 14:49
  • Could you add more information like what type of weights they are (frequency weights, probability weights, or merely importance weights)? You can refer to [this page](https://stats.idre.ucla.edu/other/mult-pkg/faq/what-types-of-weights-do-sas-stata-and-spss-support/) for more information on each type of weights. Depending on your answer, my approach could be decided whether it be valid or inaccurate. – Troll Nov 09 '21 at 15:57
  • Well, yes! If these are frequency weights then my approach is correct. Only a few adjustments are required. But I would like to know more about these null values. What would be the mean of `values = [3, nan]` with the corresponding weights `[0.1, 0.9]`? Is it 3 as if `nan` was ignored? – Troll Nov 10 '21 at 01:08
  • However, please don't divide the frequencies by the sum to make them sum to 1. Keep them as integers otherwise some statistics like `std`, `var`, and other related statistics would be inaccurate. For more information, you can have a look at [this post](https://stats.stackexchange.com/a/551573/335126) to learn more. – Troll Nov 10 '21 at 01:12
  • Yes it would be like the example you gave, the null values are ignored, but would that mean that the weights need to be re-factored/adjusted? so then [3,nan] with [0.1,0.9] would become [3] with [1]. how would the weights be adjusted to deal with nan using the method you provided below? just to mention again keep in mind my actual table has 100's of values so would each value column have to have its own weight column that reflects the presence of nan? – Mustard Tiger Nov 10 '21 at 01:37
  • @Mark with regards to "However, please don't divide the frequencies by the sum to make them sum to 1" i don't understand what you mean here? – Mustard Tiger Nov 10 '21 at 01:40
  • I mean frequencies should be counting numbers. It looks like the original frequency was divided by some kind of value and became floats instead of integers. – Troll Nov 10 '21 at 01:43
  • Anyway, I have found a way to deal with relative frequencies. I just need someone to confirm that it is correct. – Troll Nov 10 '21 at 14:22
  • I have updated my answer. Regarding the null values, they are automatically ignored. – Troll Nov 12 '21 at 15:36

2 Answers2

4

Because the weights are frequency weights, the most accurate method is to duplicate the observations according to the weights.

Adjusting the weights

Normally, frequencies are whole numbers. However, the frequencies here merely show how frequently an item appears relative to the other items of the same group. In this case, you can multiply all the weights by a value that makes the weights integers and use that value consistently throughout the dataset.

Here is a function that helps you choose the smallest possible set of weights to minimize memory usage and returns the weights as integers.

def adjust(weights):
    base = 10 ** max([len(str(i).split(".")[1]) for i in weights])
    scalar = base / np.gcd.reduce((weights * base).astype(int))
    weights = weights * scalar

    return weights

You can refer to the following question to understand how this function works.

df = pd.DataFrame({
    "location": ["a", "a", "b", "b"],
    "values": [1, 5, 3, 7],
    "weights": [0.9, 0.1, 0.8, 0.2]
})

df.loc[:, "weights"] = adjust(df["weights"])

Here are the weights after the adjustment.

>>> df
  location  value  weights
0        a      1      9.0
1        a      5      1.0
2        b      3      8.0
3        b      7      2.0

Duplicating the observations

After adjusting the weights, you need to duplicate the observations according to their weights.

df = df.loc[df.index.repeat(df["weights"])] \
    .reset_index(drop=True).drop("weights", axis=1)

You can refer to the following answer to understand how this process works.

Let's count the number of observations after being duplicated.

>>> df.count()
location    20
values      20

Performing Statistical Operations

Now, you can use groupby and aggregate using any statistical operations. The data is now weighted.

df1 = df.groupby("location").agg(["median", "skew", "std"]).reset_index()
df2 = df.groupby("location").quantile([0.1, 0.9, 0.25, 0.75, 0.5]) \
    .unstack(level=1).reset_index()

print(df1.merge(df2, how="left", on="location"))

This gives the following output.

  location values
           median      skew       std  0.1  0.9 0.25 0.75  0.5
0        a    1.0  3.162278  1.264911  1.0  1.4  1.0  1.0  1.0
1        b    3.0  1.778781  1.686548  3.0  7.0  3.0  3.0  3.0

Interpreting the weighted statistics

Let's follow the same process above but instead of giving the weights the smallest possible value, we will gradually duplicate the weights and see the results. Because the weights are at their minimum values, the greater sets of weights will be the multiples of the current set. The following line will be changed.

df.loc[:, "weights"] = adjust(df["weights"])
  • adjust(df["weights"]) * 2

      location values
               median      skew       std  0.1  0.9 0.25 0.75  0.5
    0        a    1.0  2.887939  1.231174  1.0  1.4  1.0  1.0  1.0
    1        b    3.0  1.624466  1.641565  3.0  7.0  3.0  3.0  3.0
    
  • adjust(df["weights"]) * 3

      location values
               median     skew       std  0.1  0.9 0.25 0.75  0.5
    0        a    1.0  2.80912  1.220514  1.0  1.4  1.0  1.0  1.0
    1        b    3.0  1.58013  1.627352  3.0  7.0  3.0  3.0  3.0
    
  • adjust(df["weights"]) * 4

      location values
               median      skew       std  0.1  0.9 0.25 0.75  0.5
    0        a    1.0  2.771708  1.215287  1.0  1.4  1.0  1.0  1.0
    1        b    3.0  1.559086  1.620383  3.0  7.0  3.0  3.0  3.0
    

Repeat this process several times and we will get the following graph. The statistics in this graph are not split into groups and there are some other statistics added to it for demonstration purposes.

Comparing different statistics

Some statistics like sample mean, median, and quantiles are always constant no matter how many times we duplicate the observations.

Some statistics, on the other hand, give different results depending on how many duplications we make. Let's call them inconsistent statistics for now.

There are two types of inconsistent statistics.

  1. Inconsistent statistics that are independent of the sample size

    For example: any statistical moments (mean, variance, standard deviation, skewness, kurtosis)

    Independent here does not mean "not having the sample size in the equation". Notice how sample mean also has the sample size in its equation but it is still independent of the sample size.

    For these types of statistics, you cannot compute the exact values because the answer may vary on different sample sizes. However, you can conclude, for example, the standard deviation of Group A is generally higher than the standard deviation of Group B.

  2. Inconsistent statistics that are dependent on the sample size

    For example: standard error of the mean and sum

    Standard error, however, depends on the sample size. Let's have a look at its equation.

    SEM Equation

    We can view standard error as the standard deviation per the square root of the sample size and therefore it is dependent on the sample size. Sum is also dependent on the sample size.

    For these types of statistics, we cannot conclude anything because we are missing an important piece of information: the sample size.

Troll
  • 1,895
  • 3
  • 15
  • 34
2

Instead of merging two groupby operations, use named aggregation after weighting the values:

  1. Generate weighted values using assign.
  2. Aggregate using {output_col: (input_col, agg_function), ...}.
dfs = df.assign(weighted=df.value * df.weight).groupby('location').agg(**{
    'median': ('weighted', 'median'),
    'skew': ('weighted', 'skew'),
    'std': ('weighted', 'std'),
    '0.1': ('weighted', lambda x: x.quantile(0.1)),
    '0.9': ('weighted', lambda x: x.quantile(0.9)),
    '0.25': ('weighted', lambda x: x.quantile(0.25)),
    '0.75': ('weighted', lambda x: x.quantile(0.75)),
    '0.5': ('weighted', lambda x: x.quantile(0.5)),
})

Output:

          median  skew       std   0.1   0.9  0.25  0.75  0.5
location                                                     
a            0.7   NaN  0.282843  0.54  0.86  0.60  0.80  0.7
b            1.9   NaN  0.707107  1.50  2.30  1.65  2.15  1.9
tdy
  • 36,675
  • 19
  • 86
  • 83