0

I have a large dataframe with 3 blocks of incomplete data. I want to convert this dataframe from a wide to long format using R.

Example

df <- structure(list(V1 = 1234:1240, V2 = structure(1:7, .Label = c("text1","text2", "text3", "text4", "text5", "text6", "text7"), class = "factor"), V3 = structure(c(1L, 1L, 1L, 1L, NA, NA, NA), .Label = "constant1", class = "factor"), V4 = structure(c(1L, 1L, 2L, 3L, NA, NA, NA), .Label = c("VariableA1", "VariableA2", "VariableA3"), class = "factor"), V5 = structure(c(1L, 2L, 1L, 2L, NA, NA, NA), .Label = c("VariableA4", "VariableA5"), class = "factor"), V6 = structure(c(NA, NA, NA, 1L, 1L,NA, NA), .Label = "constant2", class = "factor"), V7 = structure(c(NA, NA, NA, 1L, 2L, NA, NA), .Label = c("VariableB1", "VariableB2"), class = "factor"), V8 = structure(c(NA, NA, 1L, NA, NA, 
    1L, 1L), .Label = "constant3", class = "factor"), V9 = structure(c(NA, NA, 1L, NA, NA, 1L, 2L), .Label = c("VariableC1", "VariableC2"), class = "factor"), V10 = structure(c(NA, NA, 1L, NA, NA, 2L, 1L), .Label = c("VariableC3", "VariableC4"), class = "factor")), class = "data.frame", row.names = c(NA,-7L))

The data looks like this at present

1234    text1   constant1   VariableA1  VariableA4      NA          NA         NA          NA          NA
1235    text2   constant1   VariableA1  VariableA5      NA          NA         NA          NA          NA
1236    text3   constant1   VariableA2  VariableA4      NA          NA      constant3   VariableC1  VariableC3
1237    text4   constant1   VariableA3  VariableA5  constant2   VariableB1     NA          NA          NA
1238    text5       NA          NA          NA      constant2   VariableB2     NA          NA          NA
1239    text6       NA          NA          NA              NA          NA  constant3   VariableC1  VariableC4
1240    text7       NA          NA          NA              NA          NA  constant3   VariableC2  VariableC3

What I want is

1234    text1   constant1   VariableA1  VariableA4
1235    text2   constant1   VariableA1  VariableA5
1236    text3   constant1   VariableA2  VariableA4
1236    text3   constant3   VariableC1  VariableC3
1237    text4   constant1   VariableA3  VariableA5
1237    text4   constant2   VariableB1  NA
1238    text5   constant2   VariableB2  NA
1239    text6   constant3   VariableC1  VariableC4
1240    text7   constant3   VariableC2  VariableC3

In the actual data the values in the column 1 and 2 are not as consistent as those included here. In columns 3 to 10 the constant and Variables could be 1 to 3 different character values.

This is the closest potential answer I can find so far

Edit The ultimate aim of the query is to consolidate a wide df into a long format. Columns 3:5, 6:7 and 8:10 could be viewed as blocks. The answer below works but its not elegant or consistent. I need to manually sort the column orders after each run.

AudileF
  • 436
  • 2
  • 10
  • To clarify, the accepted answer to the linked question above is closest to what you are trying to accomplish? Can you specify a bit more which variables ultimately belong in which columns, or how to interpret a single row from your target df? – Michael Roswell Sep 08 '19 at 21:08

1 Answers1

0

For some reason I am not entirely sure what the goal was. My output doesn't exactly match yours but I think it probably has the tools you need. Happy to edit but I couldn't see why the two variable columns in your final output were separate.

I used tidyr::gather to get all of the "variables" in one column and all the "constants" in a second. Then I used dplyr::group_by and dplyr::summarize to keep the unique combinations of variables, but I bet you could find something a bit more elegant in tidyverse. Ditto how I dropped the NA rows (which you may or may not want to do, I wasn't totally clear on that either).

f <- structure(list(V1 = 1234:1240, V2 = structure(1:7, .Label = c("text1","text2", "text3", "text4", "text5", "text6", "text7"), class = "factor"), V3 = structure(c(1L, 1L, 1L, 1L, NA, NA, NA), .Label = "constant1", class = "factor"), V4 = structure(c(1L, 1L, 2L, 3L, NA, NA, NA), .Label = c("VariableA1", "VariableA2", "VariableA3"), class = "factor"), V5 = structure(c(1L, 2L, 1L, 2L, NA, NA, NA), .Label = c("VariableA4", "VariableA5"), class = "factor"), V6 = structure(c(NA, NA, NA, 1L, 1L,NA, NA), .Label = "constant2", class = "factor"), V7 = structure(c(NA, NA, NA, 1L, 2L, NA, NA), .Label = c("VariableB1", "VariableB2"), class = "factor"), V8 = structure(c(NA, NA, 1L, NA, NA, 1L, 1L), .Label = "constant3", class = "factor"), V9 = structure(c(NA, NA, 1L, NA, NA, 1L, 2L), .Label = c("VariableC1", "VariableC2"), class = "factor"), V10 = structure(c(NA, NA, 1L, NA, NA, 2L, 1L), .Label = c("VariableC3", "VariableC4"), class = "factor")), class = "data.frame", row.names = c(NA,-7L))

library(tidyverse)

g<-f %>% gather(origcol, vari, 4,5,6,9,10) %>% #get all the things called "variables" into one column gather(oc2, constant, V3,V8) %>% # the things called "constants" filter(!(is.na(vari)&is.na(constant))) %>% select(-c(origcol, oc2)) %>% #filter out rows where both variable and constant are NA group_by(V1, V2, vari, constant) %>% summarize(dummycol=n()) %>% #might be a more elegant way to do this select(-dummycol)

h<-g[complete.cases(g),] #and probably an idiomatic tidyverse way to drop rows with NA, which you might not want to do quite like this anyways

Michael Roswell
  • 1,300
  • 12
  • 31
  • Hi Michael thanks for the response. In your answer I get 4 columns rather then 5. Im I am looking to treat the 3 constants and their variables as blocks. I want to gather them into a table like above. – AudileF Sep 08 '19 at 13:15
  • Hi AudileF, Do you mean that column 4 and 5 should stay separate, and also 9,10? In that case I would do something similar to above but gather in a few more steps like `gather(origcol, vari1, 4,6,9) %>% gather(oc2, vari2, 5,10)`. Does that help? – Michael Roswell Sep 08 '19 at 13:36