2

I am trying to move data from one column to another, due to the underlying forms being filled out incorrectly.

In the form it asks for information on a household and asks for their age(AGE) and gender(SEX) for each member, allowing up to 5 people per household. However some users have filled in information for person 1,3 and 4, but not filled in any info for person 2 because they filled out person 2 incorrectly, crossed out the details and have filled person 2 details into the person 3 boxes etc.

The data looks like this (ref 1 and 5 are correct in this data, all others are incorrect)

df <- data.frame(
  ref = c(1, 2, 3, 4, 5, 6),
  AGE1 = c(45, 36, 26, 47, 24, NA),
  AGE2 = c(NA, 24, NA, 13, 57, 28),
  AGE3 = c(NA, NA, 35, NA, NA, 26),
  AGE4 = c(NA, NA, 15, 11, NA, NA),
  AGE5 = c(NA, 15, NA, NA, NA, NA),
  SEX1 = c("M", "F", "M", "M", "M", NA),
  SEX2 = c(NA, "M", NA, "F", "F", "F"),
  SEX3 = c(NA, NA, "M", NA, NA, "M"),
  SEX4 = c(NA, NA, "F", "F", NA, NA),
  SEX5 = c(NA, "F", NA, NA, NA, NA)
)

This is what the table looks like currently (I have replaced NA with - to make reading easier)

ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
1 45 - - - - M - - - -
2 36 24 - - 15 F M - - F
3 26 - 35 15 - M - M F -
4 47 13 - 11 - M F - F -
5 24 57 - - - M F - - -
6 - 28 26 - - - F M - -

but i would like it to look like this

ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
1 45 - - - - M - - - -
2 36 24 15 - - F M F - -
3 26 35 15 - - M M F - -
4 47 13 11 - - M F F - -
5 24 57 - - - M F - - -
6 28 26 - - - F M - - -

Is there a way of correcting this using dplyr? If not, is there another way in R of correcting the data

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
NiallMcS
  • 33
  • 4
  • Have you seen this? https://stackoverflow.com/questions/23285215/shifting-non-na-cells-to-the-left – Pete Aug 13 '21 at 11:20

4 Answers4

3

Here is a way using dplyr and tidyr. The approach involves pivoting the data to longer format, sorting the NA values to the end, renumbering the column names, and the pivoting to wide form again.

library(dplyr)
library(tidyr)

df <-  data.frame(ref, AGE1, AGE2, AGE3, AGE4, AGE5,
                  SEX1, SEX2, SEX3, SEX4, SEX5)
df %>% 
  mutate(across(starts_with("AGE"), as.character)) %>% 
  pivot_longer(2:11) %>%
  separate(name, into = c("cat", "num"), 3) %>%
  arrange(is.na(value)) %>%
  group_by(ref, cat) %>%
  mutate(num = seq_along(value)) %>%
  ungroup() %>%
  arrange(cat) %>%
  unite(name, cat, num, sep = "") %>%
  pivot_wider(id_cols = ref) %>%
  mutate(across(starts_with("AGE"), as.numeric))


# A tibble: 6 x 11
    ref  AGE1  AGE2  AGE3  AGE4  AGE5 SEX1  SEX2  SEX3  SEX4  SEX5 
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1     1    45    NA    NA    NA    NA M     NA    NA    NA    NA   
2     2    36    24    15    NA    NA F     M     F     NA    NA   
3     3    26    35    15    NA    NA M     M     F     NA    NA   
4     4    47    13    11    NA    NA M     F     F     NA    NA   
5     5    24    57    NA    NA    NA M     F     NA    NA    NA   
6     6    28    26    NA    NA    NA F     M     NA    NA    NA  
Ben Norris
  • 5,639
  • 2
  • 6
  • 15
  • This works well for getting the order the same as it was initially too. Why do you need to convert the age variables into a character and then back to a numeric? – NiallMcS Aug 13 '21 at 15:20
  • 1
    The pivot gets messy if the data types are not the same. – Ben Norris Aug 13 '21 at 20:10
3

Here's a way using dplyr and tidyr library.

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = -ref, 
               names_to = c('.value', 'num'), 
               names_pattern = '([A-Z]+)(\\d+)') %>%
  arrange(ref, AGE, SEX) %>%
  group_by(ref) %>%
  mutate(num = row_number()) %>%
  ungroup %>%
  pivot_wider(names_from = num, values_from = c(AGE, SEX)) 

#    ref AGE_1 AGE_2 AGE_3 AGE_4 AGE_5 SEX_1 SEX_2 SEX_3 SEX_4 SEX_5
#  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
#1     1    45    NA    NA    NA    NA M     NA    NA    NA    NA   
#2     2    15    24    36    NA    NA F     M     F     NA    NA   
#3     3    15    26    35    NA    NA F     M     M     NA    NA   
#4     4    11    13    47    NA    NA F     F     M     NA    NA   
#5     5    24    57    NA    NA    NA M     F     NA    NA    NA   
#6     6    26    28    NA    NA    NA M     F     NA    NA    NA    
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • This is so smart using `pivot_longer` + `pivot_wider`, cheers, upvoted! – ThomasIsCoding Aug 13 '21 at 11:42
  • Thomas, Ronak thanks both for your help. Because I want to keep the order of the people as the form was filled out (i.e. if I had p1,p3,p4 completed i want p3 to fill p2 and p4 to fill p3) i got rid of the arrange(ref,AGE,SEX) line of the code. It appears to work correctly, but I'm not sure if I'm missing something? The actual data has a lot more columns than what i put on here, would you recommend just selecting out the age and sex columns, doing this method and then left_joining back into the data by the reference number? – NiallMcS Aug 13 '21 at 13:10
1

Try the base code below

u1 <- reshape(
  setNames(df, sub("(\\d)", ".\\1", names(df))),
  direction = "long",
  idvar = "ref",
  varying = -1
)

u2  <- reshape(
  transform(
    u1[with(u1, order(is.na(AGE), is.na(SEX))), ],
    time = ave(time, ref, FUN = seq_along)
  ),
  direction = "wide",
  idvar = "ref"
)

out <- u2[match(names(df),sub("\\.","",names(u2)))]

and you will get

> out
    ref AGE.1 AGE.2 AGE.3 AGE.4 AGE.5 SEX.1 SEX.2 SEX.3 SEX.4 SEX.5
1.1   1    45    NA    NA    NA    NA     M  <NA>  <NA>  <NA>  <NA>
2.1   2    36    24    15    NA    NA     F     M     F  <NA>  <NA>
3.1   3    26    35    15    NA    NA     M     M     F  <NA>  <NA>
4.1   4    47    13    11    NA    NA     M     F     F  <NA>  <NA>
5.1   5    24    57    NA    NA    NA     M     F  <NA>  <NA>  <NA>
6.2   6    28    26    NA    NA    NA     F     M  <NA>  <NA>  <NA>

data

df <- data.frame(
  ref = c(1, 2, 3, 4, 5, 6),
  AGE1 = c(45, 36, 26, 47, 24, NA),
  AGE2 = c(NA, 24, NA, 13, 57, 28),
  AGE3 = c(NA, NA, 35, NA, NA, 26),
  AGE4 = c(NA, NA, 15, 11, NA, NA),
  AGE5 = c(NA, 15, NA, NA, NA, NA),
  SEX1 = c("M", "F", "M", "M", "M", NA),
  SEX2 = c(NA, "M", NA, "F", "F", "F"),
  SEX3 = c(NA, NA, "M", NA, NA, "M"),
  SEX4 = c(NA, NA, "F", "F", NA, NA),
  SEX5 = c(NA, "F", NA, NA, NA, NA)
)
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
0

Here is a solution using package dedupewider:

library(dedupewider)

df <- data.frame(
  ref = c(1, 2, 3, 4, 5, 6),
  AGE1 = c(45, 36, 26, 47, 24, NA),
  AGE2 = c(NA, 24, NA, 13, 57, 28),
  AGE3 = c(NA, NA, 35, NA, NA, 26),
  AGE4 = c(NA, NA, 15, 11, NA, NA),
  AGE5 = c(NA, 15, NA, NA, NA, NA),
  SEX1 = c("M", "F", "M", "M", "M", NA),
  SEX2 = c(NA, "M", NA, "F", "F", "F"),
  SEX3 = c(NA, NA, "M", NA, NA, "M"),
  SEX4 = c(NA, NA, "F", "F", NA, NA),
  SEX5 = c(NA, "F", NA, NA, NA, NA)
)

age_moved <- na_move(df, cols = names(df)[grepl("^AGE\\d$", names(df))]) # 'right' direction is by default

sex_moved <- na_move(age_moved, cols = names(df)[grepl("^SEX\\d$", names(df))])

sex_moved

#>   ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
#> 1   1   45   NA   NA   NA   NA    M <NA> <NA>   NA   NA
#> 2   2   36   24   15   NA   NA    F    M    F   NA   NA
#> 3   3   26   35   15   NA   NA    M    M    F   NA   NA
#> 4   4   47   13   11   NA   NA    M    F    F   NA   NA
#> 5   5   24   57   NA   NA   NA    M    F <NA>   NA   NA
#> 6   6   28   26   NA   NA   NA    F    M <NA>   NA   NA
gss
  • 1,334
  • 6
  • 11