1

I am trying to extract a strings of the movie type from a data set. The data is in the following format where the genre types are randomly distributed in the dataset by different reviewers.Luckily there are only 4 genre types (comedy, action, horror, scifi) in the dataset, but there are also repetitions. So I need to extract those strings from the dataset.

id  movie v1      v2           v3       v4         v5     v6  
1   LTR   comedy  highbudget   action   comedy     jj     horror
2   MI    newmovie  fiction     scifi    funny      xx    jhee

I am expecting an output of the following form.

id  movie   genretype1 genretype2 genretype3   genretype4
1   LTR     comedy     action     comedy       horror
2   MI      scifi      ---        ---          ---

Any suggestions?

user3570187
  • 1,743
  • 3
  • 17
  • 34
  • Can you be a little more specific? I need to create new columns called genretype1, genretype2 etc and it should get the values from the columns (v1: v6). – user3570187 Aug 06 '15 at 20:03
  • @akrun this is just a dummy data. Grouping into one column doesn't have a meaning and it corresponds to the first genre type. – user3570187 Aug 06 '15 at 20:11
  • It's okay. Anyway, you got a solution. I hope I will a chance to help you in the future – akrun Aug 06 '15 at 20:12
  • Cheers @akrun I think that solution won't work as i need to concatenate strings further for analysis. Let me know if you have a pseud solution! – user3570187 Aug 06 '15 at 20:19
  • 1
    Perhaps `lst1 <- apply(df1[-(1:2)], 1, function(x) types[match(x, types, nomatch=0)]); data.frame(df1[1:2], do.call(rbind,lapply(lst1, `length<-`, max(lengths(lst1)))))` where types is from SenorO's post – akrun Aug 06 '15 at 20:26
  • Great the first line of code worked really well. I have trouble in binding the columns. I am getting the error "Error: unexpected symbol in "data.frame(dataset[1:2] do.call" – user3570187 Aug 06 '15 at 20:36
  • 1
    Ok, it is the backquotes, that caused the trouble. `do.call(rbind, lapply(lst1, 'length<-', max(lengths(lst1))))` – akrun Aug 06 '15 at 20:37
  • 1
    I have no words to express my happiness! Thanks a lot @akrun – user3570187 Aug 06 '15 at 20:39
  • Hi @akrun! I hope you are doing great. Can you answer this question which i posted in a different thread? [link] (http://stackoverflow.com/questions/32478685/text-mining-pdf-files-issues-with-word-frequencies). I made it reproducible. Thanks so much!! – user3570187 Sep 10 '15 at 19:50

2 Answers2

1

This is how I would do it - it makes more sense to use a list, not a data.frame

> types = c("comedy", "action", "horror", "scifi")
> List = apply(df, 1, function(x) types[types %in% x[-c(1, 2)]])
> names(List) <- df$movie
> List
$LTR
[1] "comedy" "action" "horror"

$MI
[1] "scifi"

Alternatively, this solution could give you a tidy data.frame:

> Matrix = t(apply(df, 1, function(x) types %in% x[-c(1, 2)]))
> colnames(Matrix) = types
> cbind(df[,1:2], Matrix)
  id movie comedy action horror scifi
1  1   LTR   TRUE   TRUE   TRUE FALSE
2  2    MI  FALSE  FALSE  FALSE  TRUE
Señor O
  • 17,049
  • 2
  • 45
  • 47
1

We can match the 'types' with each row of 'df1' excluding the 1st two identifier columns. The length of list elements in the 'lst1' may not be the same. We make the length equal by padding NA values to elements that have shorter length than the maximum length element, rbind the list elements and create a new data.frame.

 types <- c("comedy", "action", "horror", "scifi")
 lst1 <- apply(df1[-(1:2)], 1, function(x) 
                       types[match(x, types, nomatch=0)])
 res <- data.frame(df1[1:2], do.call(rbind, lapply(lst1, 
                             'length<-', max(lengths(lst1)))))
 res
 # id movie     X1     X2     X3     X4
 #1  1   LTR comedy action comedy horror
 #2  2    MI  scifi   <NA>   <NA>   <NA>

NOTE: We can change the column names if it is needed.

colnames(res)[-(1:2)] <- paste0('genretype', 1:4)
akrun
  • 874,273
  • 37
  • 540
  • 662