making dataframe by combining columns with lists with missing data, strsplit, without an index

Question

Apologies if this is obvious, I've found something for when there's an index or for when columns are missing. But I don't think either will work for this.

Example data:

df.test=data.frame( A=c("n,n,y,n"  ,"t", "j,k,k")
                   ,B=c("n,y,y,n"  ,"" , "k,k,k")
                   ,C=c("n,y,y,n,n","t", "j,k,j")
                   ,D=c(""         ,"" , "k,k,j")
                     )

df.test=lapply(df.test, function(x) as.character(x))
str(df.test)   # looks similar to my data

List of 4
 $ A: chr [1:3] "n,n,y,n" "t" "j,k,k"
 $ B: chr [1:3] "n,y,y,n" "" "k,k,k"
 $ C: chr [1:3] "n,y,y,n,n" "t" "j,k,j"
 $ D: chr [1:3] "" "" "k,k,j"

My aim is a dataframe:

A B C D
n n n NA
n y y NA
y y y NA
n n n NA
t NA t NA
j k j k
k k k k
k k j j

I'd like column A to be the reference, however it doesn't have unique values. However, it has the maximum number of values permitted from each list (I hope that makes sense). So, the fifth value in C list 1 should be dropped, ie, n y y n n -> n y y n. Also, missing values need to be added, (missing according to column A). The extra value in C is a bug from other software (which I don't have influence over). Other than those extra values they correspond to each other, eg the t's should be on the same row (if present).

The best I've done so far is to make a list of vectors, the lists have different lengths, so I can't put them together and they don't correspond.

df3=lapply(df.test, function(x) unlist(strsplit(x,',')))
str(df3)

List of 4
 $ A: chr [1:8] "n" "n" "y" "n" ...
 $ B: chr [1:7] "n" "y" "y" "n" ...
 $ C: chr [1:9] "n" "y" "y" "n" ...
 $ D: chr [1:3] "k" "k" "j"

bgoldst · Accepted Answer · 2016-07-24T01:42:43.840

Since you want column A to govern, we can first precompute the field lengths of the string elements in column A. This can be done with an initial strsplit() call on df.test[[1L]], and then taking the lengths().

Then, we can use lapply() to iterate over all columns and split them with strsplit(). To ensure we cap each split string vector according to the governing length from column A, we must index each split string vector from 1 to the length from column A, with the index vector computed via seq_len(). This can be done using Map() to iterate over both the split string vector and the precomputed lengths vector in parallel. Indexing the split string vector conveniently accomplishes two things: (1) caps the length, and (2) returns NA for tailing indexes not represented in the current column's split string vector. We can finally unlist() the resulting list of split string vectors to get a single column vector, and wrap the whole thing in as.data.frame() to coerce the list to data.frame.

ls1 <- lengths(strsplit(df.test[[1L]],','));
as.data.frame(lapply(df.test,function(x)
    unlist(Map(function(ss,l) ss[seq_len(l)],strsplit(x,','),ls1))
),stringsAsFactors=F);
##   A    B C    D
## 1 n    n n <NA>
## 2 n    y y <NA>
## 3 y    y y <NA>
## 4 n    n n <NA>
## 5 t <NA> t <NA>
## 6 j    k j    k
## 7 k    k k    k
## 8 k    k j    j

Wow. I'm a long way from coming up with anything like that. Worked on the example data and on the actual data. I managed to get a nested loop working, but glad I don't need to use it. Many thanks. — john, Jul 24 '16 at 01:58

score 0 · Answer 2 · answered Jul 24 '16 at 02:04

See the answer from bgoldst above.

Below is what I came up with, just in case it's of use to anyone. It works on the example data, but not on the actual data, was just looking into why when a far better answer was posted.

Thanks again.

df6=lapply(df.test, function(x) strsplit(x,','))

  df7=data.frame()
  df7=lapply(df7, function(x) as.character(x))

  for (i in 1:length(df6$A)){
    for (ii in 1:length(df6$A[[i]])){
      df7=rbind(df7,sapply(df6,function(x) x[[i]][ii]))
    }
  }

  str(df7)

  print(df7)

making dataframe by combining columns with lists with missing data, strsplit, without an index

2 Answers2