Banding Variable in R

Question

For the following code:

x <- data.frame(year = c(1730, 1860, 1941, 2011))

century_bands <- data.frame(min_year = c(1700, 1800, 1900, 2000),
                            max_year = c(1799, 1899, 1999, 2099),
                            century_name = c("18th", "19th", "20th", "21st"))

I'd like, for each value in x, to work out the name of the century it falls into, using the information in century_bands. I can't imagine this is difficult to achieve but I can't figure it out. Can anyone help please? Is there a way using the dplyr package (which I use quite a lot) or perhaps some other technique?

This is just a very simple example of a real-life situation where the bands aren't in nice neat 100 year steps - so any shortcuts based on dividing the year by 100 etc. won't work unfortunately.

Thank you.

score 4 · Accepted Answer · answered Dec 26 '19 at 11:45

One option utilizing fuzzyjoin could be:

fuzzy_left_join(x, century_bands, 
                by = c("year" = "min_year",
                       "year" = "max_year"),
                match_fun = list(`>=`, `<=`)) 

  year min_year max_year century_name
1 1730     1700     1799         18th
2 1860     1800     1899         19th
3 1941     1900     1999         20th
4 2011     2000     2099         21st

G. Grothendieck · Answer 2 · 2019-12-27T15:35:47.400

Here are some approaches.

1) sqldf In SQL one can do a join on complex conditions. The syntax using between matches if the year is greater than or equal to the lower bound and less than or equal to the upper bound. For a particular year the left join will result in NA being used if there is no match (although such situation does not arise in the example in the question).

library(sqldf)
sqldf("select year, century_name from x
  left join century_bands on year between min_year and max_year")
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

2) findInterval This approach uses only base R. For each component of its first argument findInterval returns the number of values in its second argument which are less or equal to it. The second argument is assumed to be sorted in ascending order. The number returned by findInterval can be used to index into century_name. findInterval tends to be quite efficient.

transform(x, year_name = 
  with(century_bands, century_name[findInterval(year, min_year)]))
##   year year_name
## 1 1730      18th
## 2 1860      19th
## 3 1941      20th
## 4 2011      21st

Although this does not occur in the question if it were possible that year were outside all bands then we could extend this without changing the code by adding additional rows to century_bands associated with a century_name of NA or else we could extend findInterval liek this:

FindInterval <- function(x, vec, upper) {
  ifelse(x < vec[1] | x > upper, NA, findInterval(x, vec))
}
transform(x, year_name = 
  with(century_bands, century_name[FindInterval(year, min_year, max(max_year))]))

We could replace transform with mutate if using dplyr anyways; otherwise, using transform eliminates that dependency.

3) sapply Another base solution is

Match <- function(x) with(century_bands, century_name[x >= min_year & x <= max_year])
transform(x, century_name = sapply(year, Match))
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

This should be sufficient if all years are within the bands. This is the case in the example in the question but if this cannot be guaranteed then extend Match like this:

Match <- function(x) {
  Name <- with(century_bands, century_name[x >= min_year & x <= max_year])
  if (length(Name)) Name else NA
}

4) cut This base solution is similar to findInterval but it returns NA if year is not within any of the bands.

transform(x, year_name = with(century_bands, century_name[
    cut(year, c(min_year, max(max_year)), label = FALSE, include.lowest = TRUE)
]))

5) car::recode This function allows recoding of values as follows.

library(car)

recodes <- 
  "1700:1799='18th'; 1800:1899='19th'; 1900:1999='20th'; 2000:2099='21st'; else=NA"
transform(x, year_name = recode(year, recodes))
##   year year_name
## 1 1730      18th
## 2 1860      19th
## 3 1941      20th
## 4 2011      21st

To avoid hard coding the recodes string it could be derived from century_bands like this

recodes <- with(century_bands, 
  paste(sprintf("%d:%d='%s'", min_year, max_year, century_name), collapse = ";")
)
recodes <- paste0(recodes, "; else=NA")

6) expand bands We could expand the bands into individual years in which case we can simply perform a match. years which do not match any band result in an NA in the century_name.

century_bands2 <- with(century_bands, 
  stack(setNames(Map(seq, min_year, max_year), century_name)))
transform(x, century_name = with(century_bands2, ind[match(year, values)]))
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

6a) A tidyverse variation which is largely along the lines of this would be:

library(dplyr)
library(purrr)
library(tibble)
library(tidyr)

century_bands2 <- century_bands %>%
  { set_names(map2(.$min_year, .$max_year, seq), .$century_name) %>%
    as_tibble %>%
    pivot_longer(everything(), names_to = "century_name", values_to = "year")
  }

x %>% left_join(century_bands2, by = "year")
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

7) case_when. We could hard code the band definitions into a case_when:

library(dplyr)

x %>% mutate(century_name = case_when(
    year %in% 1700:1799 ~ "18th",
    year %in% 1800:1899 ~ "19th",
    year %in% 1900:1999 ~ "20th",
    year %in% 2000:2099 ~ "21st",
    TRUE ~ NA_character_))
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

7a) Another way to express this with case_when is:

x %>% mutate(century_name = case_when(
    year < 1700 ~ NA_character_,
    year < 1800 ~ "18th",
    year < 1900 ~ "19th",
    year < 2000 ~ "20th",
    year < 2100 ~ "21st",
    TRUE ~ NA_character_))
##   year century_name
## 1 1730         18th
## 2 1860         19th
## 3 1941         20th
## 4 2011         21st

score 2 · Answer 3 · answered Dec 26 '19 at 13:01

2

Since the max_year column seems to be redundant you also could easily do:

century_bands[colSums(sapply(x$year, function(x) `>=`(x, century_bands$min_year))), 3]
# [1] "18th" "19th" "20th" "21st"

answered Dec 26 '19 at 13:01

jay.sf

60,139
8
53
110

Banding Variable in R

3 Answers3