I have a Metric and I am trying to create new columns containing labels that describe the quantile each Metric value falls into. However, I am finding that, for the same Metric value, I receive multiple quantile labels when I expect them to only receive one.
I have a metric that looks like this:
Metric |
---|
5.0 |
5.0 |
6.0 |
5.0 |
NaN |
5.0 |
5.0 |
2.0 |
6.0 |
3.0 |
NaN |
2.0 |
5.0 |
5.0 |
5.0 |
2.0 |
I want to create new columns that contain the quantile labels for a range of quantiles. This is what I am doing:
var_metric = ['Metric']
# Quantile range I am interested in creating labels for
quantiles_list = [2, 3, 4, 5]
# Quantile labels
quantile_2_labels = ['Quantile_2_1','Quantile_2_2']
quantile_3_labels = ['Quantile_3_1','Quantile_3_2','Quantile_3_3']
quantile_4_labels = ['Quantile_4_1','Quantile_4_2','Quantile_4_3','Quantile_4_4']
quantile_5_labels = ['Quantile_5_1','Quantile_5_2','Quantile_5_3','Quantile_5_4','Quantile_5_5']
quantiles_labels_list = [quantile_2_labels, quantile_3_labels, quantile_4_labels, quantile_5_labels]
# Loops for creating label columns
for metric in var_metric:
for quantile, labels in zip(quantiles_list, quantiles_labels_list):
df[f'{metric}_Quantile_{quantile}'] = pd.qcut(df[metric].astype('Int64').rank(method='first'), q=quantile, labels=labels, retbins=False)
The result looks like this:
Metric | Metric_Quantile_2 | Metrtic_Quantile_3 | Metric_Quantile_4 | Metric_Quantile_5 |
---|---|---|---|---|
2.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_1 | Quantile_5_1 |
5.0 | Quantile_2_1 | Quantile_3_2 | Quantile_4_2 | Quantile_5_2 |
5.0 | Quantile_2_1 | Quantile_3_2 | Quantile_4_2 | Quantile_5_3 |
6.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_4 | Quantile_5_5 |
5.0 | Quantile_2_1 | Quantile_3_2 | Quantile_4_2 | Quantile_5_3 |
NaN | NaN | NaN | NaN | NaN |
5.0 | Quantile_2_2 | Quantile_3_2 | Quantile_4_3 | Quantile_5_3 |
5.0 | Quantile_2_2 | Quantile_3_2 | Quantile_4_3 | Quantile_5_4 |
2.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_1 | Quantile_5_1 |
6.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_4 | Quantile_5_5 |
3.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_2 | Quantile_5_2 |
NaN | NaN | NaN | NaN | NaN |
2.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_1 | Quantile_5_1 |
5.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_3 | Quantile_5_4 |
5.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_4 | Quantile_5_4 |
5.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_4 | Quantile_5_5 |
2.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_1 | Quantile_5_2 |
Using column Metric_Quantile_2
as an example, I expect all cases where Metric = 5.0
to receive the same label, however you can see in the second column that the labels for the 5.0
value shift between Quantile_2_1
and Quantile_2_2
. Any idea why?
Thanks!