Aggregations and frequencies from two variables in a large dataframe

Question

I know this probably is basic, but my efforts keep failing. Here's my problem.

I have a large dataframe with 50+ columns, two of which are "year" and "unique individual tag numbers". The dataframe has several 1000 entries, and most, but not all individual numbers are repeated over years. I need a synthax that:

Removes all lines with NAs in any of these two columns.
First task: Aggregates the number of unique individual numbers within each year
Second task: For each individual number, repeat the years it occurs

dataframe = df year variable = YEAR unique individual number = IND

Can you please provide an example of the input, and an example of the output desired? What does "repeat the years it occurs"? mean? — langtang, May 19 '23 at 13:39

score 3 · Answer 1 · answered May 19 '23 at 14:47

Without knowing exactly what your input data looks like and what your desired output is, here is a potential tidyverse solution using some toy data:

library(tidyverse)
df <- data.frame(year = c(2002, 2003, 2004, 2004),
                 id = c(1, 1, 2, 3))

For each year, counts how many unique IDs are there:

# number of unique IDs per year
df |> 
  group_by(year) |> 
  summarise(num_unique = n_distinct(id))

which gives:

# A tibble: 3 × 2
   year num_unique
  <dbl>      <int>
1  2002          1
2  2003          1
3  2004          2

For each ID, get a list of which year it occurs in:

# which year does each id occur in
df |> 
  group_by(id) |> 
  summarise(years_occur = paste(sort(unique(year)), collapse=", "))

which gives:

# A tibble: 3 × 2
     id years_occur
  <dbl> <chr>      
1     1 2002, 2003 
2     2 2004       
3     3 2004

What if I want to have the number of years the individual has been observed in a separate variable/column? — Dag, May 19 '23 at 23:38
You can adapt the code for the first answer: group by `id` and then summarise the number of distinct years. — nrennie, May 20 '23 at 12:33

Aggregations and frequencies from two variables in a large dataframe

1 Answers1