Extracting strings from different columns and tidying data in R

Question

I am trying to extract a strings of the movie type from a data set. The data is in the following format where the genre types are randomly distributed in the dataset by different reviewers.Luckily there are only 4 genre types (comedy, action, horror, scifi) in the dataset, but there are also repetitions. So I need to extract those strings from the dataset.

id  movie v1      v2           v3       v4         v5     v6  
1   LTR   comedy  highbudget   action   comedy     jj     horror
2   MI    newmovie  fiction     scifi    funny      xx    jhee

I am expecting an output of the following form.

id  movie   genretype1 genretype2 genretype3   genretype4
1   LTR     comedy     action     comedy       horror
2   MI      scifi      ---        ---          ---

Any suggestions?

Can you be a little more specific? I need to create new columns called genretype1, genretype2 etc and it should get the values from the columns (v1: v6). — user3570187, Aug 06 '15 at 20:03
@akrun this is just a dummy data. Grouping into one column doesn't have a meaning and it corresponds to the first genre type. — user3570187, Aug 06 '15 at 20:11
It's okay. Anyway, you got a solution. I hope I will a chance to help you in the future — akrun, Aug 06 '15 at 20:12
Cheers @akrun I think that solution won't work as i need to concatenate strings further for analysis. Let me know if you have a pseud solution! — user3570187, Aug 06 '15 at 20:19
Perhaps `lst1 <- apply(df1[-(1:2)], 1, function(x) types[match(x, types, nomatch=0)]); data.frame(df1[1:2], do.call(rbind,lapply(lst1, `length<-`, max(lengths(lst1)))))` where types is from SenorO's post — akrun, Aug 06 '15 at 20:26
Great the first line of code worked really well. I have trouble in binding the columns. I am getting the error "Error: unexpected symbol in "data.frame(dataset[1:2] do.call" — user3570187, Aug 06 '15 at 20:36
Ok, it is the backquotes, that caused the trouble. `do.call(rbind, lapply(lst1, 'length<-', max(lengths(lst1))))` — akrun, Aug 06 '15 at 20:37
I have no words to express my happiness! Thanks a lot @akrun — user3570187, Aug 06 '15 at 20:39
Hi @akrun! I hope you are doing great. Can you answer this question which i posted in a different thread? [link] (http://stackoverflow.com/questions/32478685/text-mining-pdf-files-issues-with-word-frequencies). I made it reproducible. Thanks so much!! — user3570187, Sep 10 '15 at 19:50

score 1 · Answer 1 · answered Aug 06 '15 at 20:03

1

This is how I would do it - it makes more sense to use a list, not a data.frame

> types = c("comedy", "action", "horror", "scifi")
> List = apply(df, 1, function(x) types[types %in% x[-c(1, 2)]])
> names(List) <- df$movie
> List
$LTR
[1] "comedy" "action" "horror"

$MI
[1] "scifi"

Alternatively, this solution could give you a tidy data.frame:

> Matrix = t(apply(df, 1, function(x) types %in% x[-c(1, 2)]))
> colnames(Matrix) = types
> cbind(df[,1:2], Matrix)
  id movie comedy action horror scifi
1  1   LTR   TRUE   TRUE   TRUE FALSE
2  2    MI  FALSE  FALSE  FALSE  TRUE

answered Aug 06 '15 at 20:03

Señor O

17,049
2
45
47

I need to have values listed like genretype1 genretype2 as i need to concatenate them for analysis. – user3570187 Aug 06 '15 at 20:15
Also when i list for LTR i am expecting comedy action comedy horror and not comedy action horror. – user3570187 Aug 06 '15 at 20:17

akrun · Accepted Answer · 2015-08-11T20:18:17.797

We can match the 'types' with each row of 'df1' excluding the 1st two identifier columns. The length of list elements in the 'lst1' may not be the same. We make the length equal by padding NA values to elements that have shorter length than the maximum length element, rbind the list elements and create a new data.frame.

 types <- c("comedy", "action", "horror", "scifi")
 lst1 <- apply(df1[-(1:2)], 1, function(x) 
                       types[match(x, types, nomatch=0)])
 res <- data.frame(df1[1:2], do.call(rbind, lapply(lst1, 
                             'length<-', max(lengths(lst1)))))
 res
 # id movie     X1     X2     X3     X4
 #1  1   LTR comedy action comedy horror
 #2  2    MI  scifi   <NA>   <NA>   <NA>

NOTE: We can change the column names if it is needed.

colnames(res)[-(1:2)] <- paste0('genretype', 1:4)

Extracting strings from different columns and tidying data in R

2 Answers2