1

I have a dataset containing 42840 observations with a total of 119 unique months (Dataset$date). The idea is that i want to assign a quantile to every dataset$Value within each month, and 'rank' them from 1(lowest value) to 5(highest value).

 Date     Name(ID)    Value    Quantile (I want to add this column where i assign the values a quantile from 1 to 5)
 2009-03  1          35        (1-5)
 2009-04  1          20        ...
 2009-05  1          65        ...
 2009-03  2          24        ...
 2009-04  2          77        ...
 2009-03  3          110       ...
.
.
.
 2018-12  3          125       ...
 2009-03  56          24       ...
 2009-04  56          65       ...
 2009-03  57          26       ...
 2009-04  57          67       ...
 2009-03  58          99       ...

I've tried to use the Ntile function, which works great for the whole dataset but there doesn't seem to be a function where I can specify for a subset of date.

Any suggestions?

  • What package is the `Ntile` function from? Why can't you just subset your data using square bracket notation and then pass that new, subset data frame into your function? – divibisan Mar 21 '19 at 22:33

1 Answers1

1

You could use the base rank function with dplyr's group_by:

library(dplyr)

# Create some data
N <- 3
dat <- tibble(
  date = rep(1:12,N),
  value = runif(12*N, 0, 100)
)

# The rescale function we will use later to fit on your 1-5 scale
## Adapted From https://stackoverflow.com/questions/25962508/rescaling-a-variable-in-r
RESCALE <- function (x, nx1, nx2, minx, maxx) {
  nx = nx1 + (nx2 - nx1) * (x - minx)/(maxx - minx)
  return(ceiling(nx))
}

# What you want
dat %>% 
  group_by(date) %>% # Group the data by Date so that mutate fill compute the rank's for each Month
  mutate(rank_detail = rank(value), # ranks the values within each group
         rank_group = RESCALE(rank_detail, 1, 5, min(rank_detail), max(rank_detail)) ) %>%   # rescales the ranking to be on you 1 to 5 scale
  arrange(date)

# A tibble: 36 x 4
# # Groups:   date [12]
# date value rank_detail rank_group
# <int> <dbl>       <dbl>      <dbl>
# 1     1 92.7            3          5
# 2     1 53.6            2          3
# 3     1 47.8            1          1
# 4     2 24.6            2          3
# 5     2 72.2            3          5
# 6     2 11.5            1          1
Ismail Müller
  • 395
  • 1
  • 7
  • It seems that the value 1 in rank_group is only assigned if rank_detail is 1 as well. This results in 119 values in rank 1, which is the number of months (03/2009-12/2018 + Global). Based on the total number of values, each rank should have 8568 values. Do you have any suggestions for a solution? Otherwise, the code works great! – Sondre Fiskerstrand Mar 22 '19 at 10:03
  • Hi there. This is probably due to `return(ceiling(nx))` in the `RESCALE` function. Try `return(round(nx))` instead. – Ismail Müller Mar 24 '19 at 19:11
  • it now seems that the function isn't dividing the subset equally. for each month the quantiles are as follows: Quantile 1: 45 (Observations) Quantile 2: 90 (Observations) Quantile 3: 90 (Observations) Quantile 4: 90 (Observations) Quantile 5: 45 (Observations) You know why this is? – Sondre Fiskerstrand Mar 25 '19 at 10:23
  • Do you need the ranking of `Value` being done for each Month ? In this case, if you don't have the same number of observations for each month, then you won't have equal subsetting when you look at your entire dataset ! – Ismail Müller Mar 26 '19 at 18:43