0

I have a data frame with a Genre column that has rows like Action,Romance. I want to split those values and create a binary vector. If Action,Romance,Drama are all the possible genres, then the above mentioned row would be 1,1,0 in the output data frame.

I found this and this SO posts, and this CRAN doc covering cSplit_e, but when I use it I'm not getting a binary dataframe output, I'm getting the original data frame with a few values scrambled.

a = cSplit_e(df4, "Genre", sep = ",", mode = "binary", type = "character", drop=TRUE, fixed=TRUE,fill = 0)

Edit: The issue appears to be that it's adding the new columns to the old data frame, instead of creating a new frame. How can I get the Genres into their own frame?

> names(a)
 [1] "Title"             "Year"              "Rated"             "Released"          "Runtime"           "Genre"             "Director"          "Writer"            "Actors"           
[10] "Plot"              "Language"          "Country"           "Awards"            "Poster"            "Metascore"         "imdbRating"        "imdbVotes"         "imdbID"           
[19] "Type"              "tomatoMeter"       "tomatoImage"       "tomatoRating"      "tomatoReviews"     "tomatoFresh"       "tomatoRotten"      "tomatoConsensus"   "tomatoUserMeter"  
[28] "tomatoUserRating"  "tomatoUserReviews" "tomatoURL"         "DVD"               "BoxOffice"         "Production"        "Website"           "Response"          "Budget"           
[37] "Domestic_Gross"    "Gross"             "Date"              "Genre_Action"      "Genre_Adult"       "Genre_Adventure"   "Genre_Animation"   "Genre_Biography"   "Genre_Comedy"     
[46] "Genre_Crime"       "Genre_Documentary" "Genre_Drama"       "Genre_Family"      "Genre_Fantasy"     "Genre_Film-Noir"   "Genre_Game-Show"   "Genre_History"     "Genre_Horror"     
[55] "Genre_Music"       "Genre_Musical"     "Genre_Mystery"     "Genre_N/A"         "Genre_News"        "Genre_Reality-TV"  "Genre_Romance"     "Genre_Sci-Fi"      "Genre_Short"      
[64] "Genre_Sport"       "Genre_Talk-Show"   "Genre_Thriller"    "Genre_War"         "Genre_Western"    
James L.
  • 12,893
  • 4
  • 49
  • 60
  • 2
    You'll have to show a few sample rows where that doesn't work. This works, for example: `df4 <- data.frame(Genre = c("Action,Romance", "Action,Romance,Drama"));cSplit_e(df4, "Genre", ",", mode = "binary", type = "character", fill = 0, drop = TRUE)`. – A5C1D2H2I1M1N2O1R2T1 Feb 20 '18 at 15:07
  • Awesome thanks for that snippet! It made me double check the returned frame, and I found that the binary data is being added but its modifying the old frame instead of returning a new one. Is returning a new one possible? Should I just create a separate frame from the original `Genre` data and then use `cSplit_e` to modify it, separate of the rest of the data? – James L. Feb 20 '18 at 15:19
  • Lots of info in [this link](https://stackoverflow.com/questions/42387859/dummify-character-column-and-find-unique-values) – Sotos Feb 20 '18 at 15:19
  • @JamesL., are you just asking how to get just the "Genre" columns without all the other columns? Maybe: `a[startsWith(names(a), "Genre")]`? – A5C1D2H2I1M1N2O1R2T1 Feb 20 '18 at 15:24
  • Wow that's great, thank you! Write it up into an answer and I'll accept it – James L. Feb 20 '18 at 15:28

1 Answers1

1

The drop argument only applies to the column being split, not all of the other columns in the data.frame. Thus, to subsequently extract just the split columns, use the original column name and extract just those columns.

Example:

> a <- cSplit_e(df4, "Genre", ",", mode = "binary", type = "character", fill = 0, drop = TRUE)
> a
  id Genre_Action Genre_Drama Genre_Romance
1  1            1           0             1
2  2            1           1             1
> a[startsWith(names(a), "Genre")]
  Genre_Action Genre_Drama Genre_Romance
1            1           0             1
2            1           1             1

Where:

df4 <- structure(list(Genre = c("Action,Romance", "Action,Romance,Drama"), id = 1:2), 
  .Names = c("Genre", "id"), row.names = 1:2, class = "data.frame")
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485