0

I know this probably is basic, but my efforts keep failing. Here's my problem.

I have a large dataframe with 50+ columns, two of which are "year" and "unique individual tag numbers". The dataframe has several 1000 entries, and most, but not all individual numbers are repeated over years. I need a synthax that:

  1. Removes all lines with NAs in any of these two columns.
  2. First task: Aggregates the number of unique individual numbers within each year
  3. Second task: For each individual number, repeat the years it occurs

dataframe = df year variable = YEAR unique individual number = IND

Dag
  • 569
  • 2
  • 5
  • 20
  • 1
    Can you please provide an example of the input, and an example of the output desired? What does "repeat the years it occurs"? mean? – langtang May 19 '23 at 13:39

1 Answers1

3

Without knowing exactly what your input data looks like and what your desired output is, here is a potential tidyverse solution using some toy data:

library(tidyverse)
df <- data.frame(year = c(2002, 2003, 2004, 2004),
                 id = c(1, 1, 2, 3))

For each year, counts how many unique IDs are there:

# number of unique IDs per year
df |> 
  group_by(year) |> 
  summarise(num_unique = n_distinct(id))

which gives:

# A tibble: 3 × 2
   year num_unique
  <dbl>      <int>
1  2002          1
2  2003          1
3  2004          2

For each ID, get a list of which year it occurs in:

# which year does each id occur in
df |> 
  group_by(id) |> 
  summarise(years_occur = paste(sort(unique(year)), collapse=", "))

which gives:

# A tibble: 3 × 2
     id years_occur
  <dbl> <chr>      
1     1 2002, 2003 
2     2 2004       
3     3 2004    
nrennie
  • 1,877
  • 1
  • 4
  • 14
  • What if I want to have the number of years the individual has been observed in a separate variable/column? – Dag May 19 '23 at 23:38
  • 1
    You can adapt the code for the first answer: group by `id` and then summarise the number of distinct years. – nrennie May 20 '23 at 12:33