1

I have asked a similar question on here before about how to count unique values from a dataframe, but I need to use "lapply" instead because the way I used previously doesn't work or I cant get it to work with a list. I have also been told the using one of the apply functions would be better.

This represents my data:

species1 <- data.frame(var_1 = c("a","a","a","b", "b", "b"), var_2 = c("c","c","d", "d", "e", "e"))

species2 <- data.frame(var_1 = c("f","f","f","g", "g", "g"), var_2 = c("h","h","i", "i", "j", "j"))

all_species <- list()

all_species[["species1"]] <- species1
all_species[["species2"]] <- species2

I want to use lapply to get the number of unique rows for each of my lists, for example, I need an output like:

count_all_species <- list()
count_all_species[["species1"]] <- data.frame(var_1 = c("a", "b"), unique_number = c("2", "2"))

Then the same for the second list using the "lapply" function

Frank
  • 66,179
  • 8
  • 96
  • 180
Jack Dean
  • 163
  • 1
  • 7

2 Answers2

3

Here is an option with tidyverse. We loop through the list of data.frame (with map), grouped by 'var_1', summarise to get the number of distinct elements in 'var_2' (n_distinct)

library(dplyr)
library(purrr)
map(all_species, ~ .x %>%
                     group_by(var_1) %>% 
                     summarise(unique_number = n_distinct(var_2)))

Or use the distinct after looping through the list and then do a count

map(all_species, ~ .x %>% 
                     distinct() %>% 
                     dplyr::count(var_1))

Update

If the variable name changes, then we can use position in summarise_at

map(all_species, ~ .x %>%
                     group_by(var_1) %>% 
                     summarise_at(1, n_distinct))

Or another option is to convert the column name string to a symbol (rlang::sym) and then do the evaluation (!!)

map(all_species, ~ .x %>%
             group_by(var_1) %>% 
             summarise(unique_number = n_distinct(!! rlang::sym(names(.x)[2]))))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This works for the example provided above, what I didn't, but should have included is that the var_2 name changes slightly in my real life example e.g. species1, var_2 = hsapiens_gene_name and species2, var_2 = mmusculus_gene_name. This is why I would rather use lapply so I can use paste0 e.g. lapply(species_name_vector, function(s) paste0(s, "_geen_name")) – Jack Dean May 01 '18 at 15:38
  • @JackDean If the name changes, then you can use `summarise_at` – akrun May 01 '18 at 15:42
  • Thanks that works great! I'm assuming the "1" in `summarise_at` means summarise based on the first column? – Jack Dean May 01 '18 at 15:48
  • @JackDean The reason is that there is a group_by and mutate_at will consider indexing from the next column i.e. if there is no `group_by`, the column 2nd will be indexed with 2 `map(all_species, ~ .x %>% summarise_at(2, n_distinct))` – akrun May 01 '18 at 15:49
  • @JackDean BTW, i don't know if this indexing would change in the future or not, but it is a bit confusing – akrun May 01 '18 at 15:51
1

Table would be a simple base-R solution.

lapply(all_species, function(x) {
 apply(x, 2, table) 
  }
)
erocoar
  • 5,723
  • 3
  • 23
  • 45