0

the following problem occured in my work with a longitudinal dataframe.

There are several waves (years) in the study. In every wave there is an identification number (ID) for the same person and in addition some variables of interest that I want to bring together in one dataframe.

for example:

wave_1 <- data.frame(ID=c(1, 2, 3, 4, 5, 6),
                      income_wave_1=c(109, 106, 67, NA, 65, 190))
wave_2 <- data.frame(ID=c(1, 2, 3, 4, 5, 6),
                      income_wave_2=c(NA, 120, 34, 76, 69, 160))
wave_3 <- data.frame(ID=c(1, 2, 3, 4, 5, 6),
                      income_wave_3=c(34, 64, 78, NA, 170, 200))

at the end I want to calculate the mean in income of every person from wave_1, wave_2, wave_3. In every wave there are over 500 variables meassured for every person (ID). So I just want to extract the ID and the variable of interest (income). Untill now I did it manually via:

wave_1 <- read_dta("C:/User/all_waves/wave1.dta")
wave_2 <- read_dta("C:/User/all_waves/wave2.dta")
wave_3 <- read_dta("C:/User/all_waves/wave3.dta")
wave_1 <- wave_1 %>% select(ID, income_wave_1)
wave_2 <- wave_2 %>% select(ID, income_wave_2)
wave_3 <- wave_3 %>% select(ID, income_wave_3)

df_list_all_waves <- list(wave_1, wave_2, wave_3)

df_final_all_waves <- Reduce(function(x,y) full_join(x,y, all=TRUE), df_list_all_waves)

df_final <- df_final_all_waves %>% mutate(average_income_w1_to_w3 = rowMeans(df_final_all_waves [,2:4], na.rm=TRUE))  %>% select (ID,average_income_w1_to_w3 )


                   

thx a lot

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
Max Herre
  • 47
  • 5
  • It's unclear to me which part you're struggling to automate. Perhaps you are looking for bind_rows with purrr or lapply? https://stackoverflow.com/questions/62309884/apply-a-rbind-or-bind-rows-conditionally-to-a-list-based-on-name – Jon Spring Nov 02 '22 at 21:57
  • 1
    Or the averaging you might use `library(tidyverse); bind_rows(wave_1, wave_2, wave_3) %>% pivot_longer(-ID) %>% filter(!is.na(value)) %>% group_by(ID) %>% summarize(avg = mean(value))` – Jon Spring Nov 02 '22 at 21:59
  • thx. its more about importing all the waves at once withe the corresponding / selected column that have other names in every year. (uncome_wave_1, income_wave_2, income_wave_3) and bringing together in one final dataframe. Or is there no other way, than manually read_dta(" ") and select(ID, Inome) and finally merging all of them? – Max Herre Nov 03 '22 at 08:42

1 Answers1

2

Regarding your questions from the comment section, both tasks (importing and column selection) can be done by matching the pattern, without giving the exact file/column names.

Import: list.files(..., pattern = 'wave[0-9]*.dta$'): all files from given directory ending in 'wave.dta', where x is any combination of any number of digits (0 or more), plus map() as a way to automate file import and selection.

Column selection: tidyselect::starts_with() for the changing column names (every column starting with 'income_wave_', assuming that this only holds true for the column you are interested in). Have a look into the package for similar helpers.

library(tidyverse)
wave_files <- list.files(
  'C:/User/all_waves/', 
  full.names = TRUE, 
  pattern = 'wave[0-9]*.dta$'
)
map(wave_files, ~ {
      data <- read_dta(.x)
      select(data, ID, starts_with('income_wave_'))
    }) %>% 
  reduce(full_join, all=TRUE)

Created on 2022-11-03 by the reprex package (v2.0.1)

maike
  • 220
  • 1
  • 6
  • Really nice! great comment. in the commans line "pattern =" I have one question. The files are named like this: top09_w_file.dta, top10_w_file.dta as an example for the years 2009 and 2010. so I changed it like this pattern = 'top[0-9]*[0-9]*_w_file.dta$' works fine. For the second part, im struggling with select(data, ID,.....) because the variable of interest is called k09d16 and k10d16. I tried select(data,ID, ends_with('d16')). how can I change that? – Max Herre Nov 03 '22 at 11:42
  • Error in reduce(., full_join, all = TRUE) : could not find function "reduce" – Max Herre Nov 03 '22 at 11:43
  • reduce is from purrr, maybe try updating and make sure it's loaded (which it should be after the tidyverse library call). Otherwise, fell free to save the result of map, which return as list and re-use your code from the original post (base Reduce version). – maike Nov 03 '22 at 12:30
  • regarding the regular expression: for the file name 'top[0-9]*_w_file.dta$' would do (the * means zero or more). For the columns, you could do 'k_[0-9]{2}d16' which would match exactly two digits after k_. You could go again with [0-9]* to also allow eg 4 digits for the year identifier. But if thats only one column ending in d16 per data set and this is what you need, your solution should also work just fine. Hope this helps otherwise feel free to comment on what is not working correctly. – maike Nov 03 '22 at 12:35