R - Ordering factors by another variable returns NAs: How to fix?

Question

I have a small tibble of animal_observations in R like this:

> animal_observations
# A tibble: 12 x 3
   SPECIES       n_detections detection_rate
   <fct>                <int>          <dbl>
 1 Badger                 203          0.190
 2 Blackbird              463          0.433
 3 Domestic cat           292          0.273
 4 Grey squirrel          788          0.736
 5 Hedgehog               179          0.167
 6 Nothing                960          0.897
 7 Pheasant               476          0.445
 8 Rabbit                 602          0.563
 9 Red fox                424          0.396
10 Roe Deer               621          0.580
11 Small rodent           198          0.185
12 Woodpigeon             381          0.356

Where n_detections is the number of times I've seen that animal, and detection_rate is how often that SPECIES of animal is seen (calculated elsewhere).

Here's the dput():

structure(list(SPECIES = structure(1:12, .Label = c("Badger", 
"Blackbird", "Domestic cat", "Grey squirrel", "Hedgehog", "Nothing", 
"Pheasant", "Rabbit", "Red fox", "Roe Deer", "Small rodent", 
"Woodpigeon"), class = "factor"), n_detections = c(203L, 463L, 
292L, 788L, 179L, 960L, 476L, 602L, 424L, 621L, 198L, 381L), 
    detection_rate = c(0.189719626168224, 0.432710280373832, 
    0.272897196261682, 0.736448598130841, 0.167289719626168, 
    0.897196261682243, 0.444859813084112, 0.562616822429907, 
    0.396261682242991, 0.580373831775701, 0.185046728971963, 
    0.35607476635514)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -12L))

I want to order my animals (SPECIES, which is a factor) by the detection_rate for downstream ggplot()ing (e.g. geom_col() where aes(x = SPECIES, y = detection_rate) and the columns will be ordered by detection_rate), etc., and here is the line I tried to run:

animal_observations$SPECIES <- factor(animal_observations$SPECIES,
                      levels = animal_observations[order(animal_observations$detection_rate, decreasing = F), "SPECIES"])

Strangely, here is the resulting tibble:

> animal_observations
# A tibble: 12 x 3
   SPECIES n_detections detection_rate
   <fct>          <int>          <dbl>
 1 NA               203          0.190
 2 NA               463          0.433
 3 NA               292          0.273
 4 NA               788          0.736
 5 NA               179          0.167
 6 NA               960          0.897
 7 NA               476          0.445
 8 NA               602          0.563
 9 NA               424          0.396
10 NA               621          0.580
11 NA               198          0.185
12 NA               381          0.356

As you can see all of the SPECIES have become NAs... What did I do wrong and how do I correct it so that the SPECIES factor is ordered ("levelled"?) by detection_rate so that in the output tibble all the animal names are retained in the SPECIES column? Thank you.

score 1 · Answer 1 · answered Jul 28 '18 at 15:08

1

It's this simple

library(dplyr)
animal_observations %>% arrange(desc(detection_rate))

answered Jul 28 '18 at 15:08

stevec

41,291
27
223
311

Thanks for your quick response, but after doing this then using `geom_col()` to plot `detection_rate` as a function of `SPECIES`, the plot is *not* ordered by `detection_rate`, it is still ordered by `SPECIES`. How do I define the levels of the factor `SPECIES` so that the plot *will* be ordered by `detection_rate`? – hpy Jul 28 '18 at 15:17
1

@hpy you may find [this](https://stackoverflow.com/questions/3253641/change-the-order-of-a-discrete-x-scale) useful – stevec Jul 28 '18 at 15:41

score 1 · Answer 2 · answered Jul 28 '18 at 15:17

Another option would be using order like this

# df = your dput()
with(df, df[order(detection_rate, decreasing = TRUE),])

and the output

# A tibble: 12 x 3
   SPECIES       n_detections detection_rate
   <fct>                <int>          <dbl>
 1 Nothing                960          0.897
 2 Grey squirrel          788          0.736
 3 Roe Deer               621          0.580
 4 Rabbit                 602          0.563
 5 Pheasant               476          0.445
 6 Blackbird              463          0.433
 7 Red fox                424          0.396
 8 Woodpigeon             381          0.356
 9 Domestic cat           292          0.273
10 Badger                 203          0.190
11 Small rodent           198          0.185
12 Hedgehog               179          0.167

Thanks! But similar to after trying the other answer, I then tried using `geom_col()` to plot `detection_rate` as a function of `SPECIES`, the plot is *not* ordered by `detection_rate`, it is still ordered by `SPECIES`. How do I define the levels of the factor `SPECIES` so that the plot *will* be ordered by `detection_rate`? — hpy, Jul 28 '18 at 15:21

andrew_reece · Accepted Answer · 2018-07-28T15:26:43.287

1

Use reorder() inside ggplot():

animal_observations %>% 
  ggplot(aes(reorder(SPECIES, detection_rate), detection_rate)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle=90))

UPDATE
To set the new order before getting into ggplot(), use mutate and order the existing factor to set a new one:

animal_observations %>% 
  mutate(species = factor(SPECIES, levels=SPECIES[order(detection_rate)])) %>%
  ggplot(aes(species, detection_rate)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle=90))

edited Jul 28 '18 at 15:26

answered Jul 28 '18 at 15:18

andrew_reece

20,390
3
33
58

Thank you so much, it worked for me, too! That said, is there a way to do this directly to `animal_observations` rather than within `ggplot` (this way the order/levels will apply to other potential operations as well)? – hpy Jul 28 '18 at 15:22
Yes it worked, thank you so much! It's great to know *both* ways of doing it which provides flexibility. – hpy Jul 28 '18 at 22:33

R - Ordering factors by another variable returns NAs: How to fix?

3 Answers3