1

The book R for Data Science goes over ranking creation functions, and I'm having trouble understanding the examples even after looking at the documentation.

Here is the example:

y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
#> [1]  1  2  2 NA  4  5
min_rank(desc(y))
#> [1]  5  3  3 NA  2  1 
row_number(y)
#> [1]  1  2  3 NA  4  5 
dense_rank(y)
#> [1]  1  2  2 NA  3  4 
percent_rank(y)
#> [1] 0.00 0.25 0.25   NA 0.75 1.00 
cume_dist(y)
#> [1] 0.2 0.6 0.6  NA 0.8 1.0

Questions: min_rank: - where is the 5 from? and why isn't NA last? min_rankd(desc()) -- why are there 2 3s and not 2 2's? row_number: still confused on NA positoning, and wouldn't there be 6 rows?

  • see https://stats.stackexchange.com/a/34010 – Waldi Jul 11 '22 at 18:39
  • First Q: You have 5 non-NA observations. 1 is the lowest, 2 is the 2nd lowest and appears twice, so 3 and 4 are the fourth and fifth lowest observations. – Jon Spring Jul 11 '22 at 19:03

1 Answers1

0

5 is there because while the two 2s are tied, the second 2 still counts as an extra rank.

It appears that min_rank and row_number are simply convenience analogs for rank with less customizability as to how NAs are handled. Instead of min_rank, you can use:

rank(y, ties.method = "min", na.last = c(TRUE, FALSE, NA, "keep"))

The rank documentation says the following:

na.last:

for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if "keep" they are kept with rank NA.

So it appears that the default for min_rank and row_number is just keep, but you can customize that to TRUE if you like.

As for your second question about desc:

desc outputs a negative version of the numeric vector as such:

[1] -1 -2 -2 NA -3 -4

So -1 is the highest (marked as 5), -2 is tied for second highest (marked as 3), et cetera. Let me know if this answers your question.

dcsuka
  • 2,922
  • 3
  • 6
  • 27
  • Thanks, @deechaz. I'm still a little confused - so the output numbers correspond to the rank, not the value of the vector in each ranking (1-5 since there are 5 numbers)? If the 5 is coming from the values of the ranking, why are there still two 2's? And for the descending ranking, I get that the values change to negative, but why are the -2's ranked as 3 instead of 4 then? Just to make sure I understand what you mean about the NA's, basically if you don't add an argument it will just remain NA in the same position it's in in the vector? – Nicole Petrovic Jul 12 '22 at 15:41
  • The output numbers are the respective ranks of the input y vector. The presence of two 2s is because of the ties.method argument in rank(). Setting to "min" (the default in min_rank() it appears) gives each 2 value the minimum rank of 2. Setting to "max" would give each a rank of 3. "first" and "last" give each 2 value separate ranks based on order. I encourage you to play around with the ties.method and na.last arguments in the rank function, with random vectors. The smallest value is given the ranking of 1, unless you use desc(). – dcsuka Jul 12 '22 at 17:44
  • The values in the desc() example are given 3 instead of 4 because the ties.method default in min_rank is "min". "max" would give 4s instead. For the NAs, it depends on the default argument of the function that you use. – dcsuka Jul 12 '22 at 17:45
  • So does that answer your question? – dcsuka Jul 22 '22 at 15:55
  • There is a little green check mark. – dcsuka Jul 22 '22 at 17:06