0

I have originally tried to extract genres from the Kaggle IMDB data set:

https://www.kaggle.com/param1/d/deepmatrix/imdb-5000-movie-dataset/the-money-makers

The raw data for genres comes in a format like Action_Adventure_Comedy etc. From this I used str_split to map the genres to separate columns. The data comes out as such:

V1          V2          V3    
Action      Adventure   Comedy
Adventure   Comedy      Horror
Action      Adventure   Horror   

What I want to create is a 'Dummy Variable' for each genre on a separate column. This should scan V1 through V4 to see if it contains the value for the genre, and return either a 1 if it does or a zero if it doesn't. The output I'm wanting is as follows:

Action      Adventure   Comedy    Horror
1           1           1         0
0           1           1         1
1           1           0         1

Please note that because I'm only wanting to look at a single genre, and not multiple (e.g. Action and not Action_Adventure), I am unable to use model.matrix. Any help would be greatly appreciated.

Stu

Stu Richards
  • 141
  • 1
  • 11
  • Why do you want to do this? See `?lm`, `?formula`, and `?model.matrix`. – MichaelChirico Dec 07 '16 at 04:40
  • Hmm... The chosen duplicate isn't a great duplicate for this question (but I'm sure duplicates exist). I would probably take an approach like `t(table(unlist(mydf, use.names = FALSE), rep(seq(nrow(mydf)), ncol(mydf))))`.... – A5C1D2H2I1M1N2O1R2T1 Dec 07 '16 at 05:56

0 Answers0