Show percentiles of Variable A, while the classification of percentiles is based on Variable B

Question

I have a dataset that looks like the following:

INCOME	WEALTH
10.000	100000
15.000	111000
14.200	123456
12.654	654321

I have many more rows.

I now want to now find how much INCOME a household in a specific WEALTH percentile has. The following quantiles are relevant:

c(0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99)

I have always used the following code to get specific percentile values:

a <- quantile(WEALTH, probs = c(0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99))

But now I want to base my percentiles on WEALTH but get the respective INCOME. I have tried the following code but the results are not plausible:

df$percentile = ntile(df$WEALTH,100)
df <- df[df$percentile %in% c(1,5,10,25,50,75,90,95,99), ]

a <- df %>% 
  group_by(percentile) %>% 
  summarise(max = max(INCOME))

The results that I get a not consistent with other parts of the analysis that I have done. I assume that the percentile when using the "quantile" function are calculated differently that simply taking the maximum.

I’ve provided an answer that might explain your issue. If it doesn’t, we may need more detail to understand the problem - could you [edit] your post to specify your expected output, your actual output, and your actual data (or at least, enough of a subset to demonstrate the problem)? For the data, run `dput(df)` in R and paste the result into your question. Thanks! — zephryl, Nov 27 '22 at 15:07

score 2 · Accepted Answer · answered Nov 27 '22 at 16:18

Im not sure if i understood your question correctly, but the quantile has different methods of calculation. I for example always go for number 6, since this is what i was taought in my stat courses.

type: an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.

Read more about different types by using ?quantile commands (help on quantile)

score 0 · Answer 2 · answered Nov 27 '22 at 15:02

If you have fewer than 100 rows in your dataset, dplyr::ntile(x, 100) won’t yield accurate percentiles, but will only give you bins numbered through the total number of rows:

library(dplyr)

df %>% 
  mutate(percentile = ntile(WEALTH, 100))

# A tibble: 4 × 3
  INCOME WEALTH percentile
   <dbl>  <dbl>      <int>
1   10   100000          1
2   15   111000          2
3   14.2 123456          3
4   12.7 654321          4

To get true percentiles, you can rescale the result, manually or with scales::rescale():

library(scales)

df %>% 
  mutate(percentile = rescale(
    ntile(WEALTH, 100),
    c(1, 100)
  ))

# A tibble: 4 × 3
  INCOME WEALTH percentile
   <dbl>  <dbl>      <dbl>
1   10   100000          1
2   15   111000         34
3   14.2 123456         67
4   12.7 654321        100

Thank you for your help! I really appreciate it. However it seems to yield the same result as without rescaling. — Jakob, Nov 27 '22 at 18:48

Show percentiles of Variable A, while the classification of percentiles is based on Variable B

2 Answers2