4

I have a dataset that looks like the following:

INCOME WEALTH
10.000 100000
15.000 111000
14.200 123456
12.654 654321

I have many more rows.

I now want to now find how much INCOME a household in a specific WEALTH percentile has. The following quantiles are relevant:

c(0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99)

I have always used the following code to get specific percentile values:

a <- quantile(WEALTH, probs = c(0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99))

But now I want to base my percentiles on WEALTH but get the respective INCOME. I have tried the following code but the results are not plausible:

df$percentile = ntile(df$WEALTH,100)
df <- df[df$percentile %in% c(1,5,10,25,50,75,90,95,99), ]

a <- df %>% 
  group_by(percentile) %>% 
  summarise(max = max(INCOME))

The results that I get a not consistent with other parts of the analysis that I have done. I assume that the percentile when using the "quantile" function are calculated differently that simply taking the maximum.

akrun
  • 874,273
  • 37
  • 540
  • 662
Jakob
  • 43
  • 3
  • I’ve provided an answer that might explain your issue. If it doesn’t, we may need more detail to understand the problem - could you [edit] your post to specify your expected output, your actual output, and your actual data (or at least, enough of a subset to demonstrate the problem)? For the data, run `dput(df)` in R and paste the result into your question. Thanks! – zephryl Nov 27 '22 at 15:07

2 Answers2

2

Im not sure if i understood your question correctly, but the quantile has different methods of calculation. I for example always go for number 6, since this is what i was taought in my stat courses.

type: an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.

Read more about different types by using ?quantile commands (help on quantile)

RYann
  • 567
  • 1
  • 7
0

If you have fewer than 100 rows in your dataset, dplyr::ntile(x, 100) won’t yield accurate percentiles, but will only give you bins numbered through the total number of rows:

library(dplyr)

df %>% 
  mutate(percentile = ntile(WEALTH, 100))
# A tibble: 4 × 3
  INCOME WEALTH percentile
   <dbl>  <dbl>      <int>
1   10   100000          1
2   15   111000          2
3   14.2 123456          3
4   12.7 654321          4

To get true percentiles, you can rescale the result, manually or with scales::rescale():

library(scales)

df %>% 
  mutate(percentile = rescale(
    ntile(WEALTH, 100),
    c(1, 100)
  ))
# A tibble: 4 × 3
  INCOME WEALTH percentile
   <dbl>  <dbl>      <dbl>
1   10   100000          1
2   15   111000         34
3   14.2 123456         67
4   12.7 654321        100
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • Thank you for your help! I really appreciate it. However it seems to yield the same result as without rescaling. – Jakob Nov 27 '22 at 18:48