So I thought you wanted columns of counts (not whether strings are contained) the first time I read the question (the previous edit), but it's sort of useful code anyway, so I left it. Here are options for both base R and the stringr
package:
First let's make a sample data.frame with similar data
# stringsAsFactors = FALSE would be smart here, but let's not assume...
df <- data.frame(x = c('a, b, c, a', 'b, b, c', 'd, a'))
which looks like
> df
x
1 a, b, c, a
2 b, b, c
3 d, a
Base R
Use strsplit
to make a list of vectors of separated strings, using as.character
to coerce factors to a useful form,
list <- strsplit(as.character(df$x), ', ')
then make a list of unique strings
lvls <- unique(unlist(list))
Making Contains Columns
Loop over the rows of the data.frame/list with sapply
. (All sapply
functions in this answer could be replaced with for
loops, but that's generally considered poor style in R for speed reasons.) Test if the unique strings are in each, and change to integer format. Set the result (t
ransposed) to a new column of df
, one for each unique string.
df[, lvls] <- t(sapply(1:nrow(df), function(z){as.integer(lvls %in% list[[z]])}))
> df
x a b c d
1 a, b, c, a 1 1 1 0
2 b, b, c 0 1 1 0
3 d, a 1 0 0 1
To keep values as Boolean TRUE
/FALSE
instead of integers, just remove as.integer
.
Making Count Columns
Loop over the rows of the data.frame/list with the outside sapply
, while the inner one loops over the unique strings in each, and counts the occurrences by summing TRUE
values. Set the result (t
ransposed) to a new column of df
, one for each unique string.
df[, lvls] <- t(sapply(1:nrow(df), function(z){
sapply(seq_along(lvls), function(y){sum(lvls[y] == list[[z]])})
}))
> df
x a b c d
1 a, b, c, a 2 1 1 0
2 b, b, c 0 2 1 0
3 d, a 1 0 0 1
stringr
stringr
can make these tasks much more straightforward.
First, find unique strings in df$x
. Split strings with str_split
(which can take a factor), flatten them into a vector with unlist
, and find unique ones:
library(stringr)
lvls <- unique(unlist(str_split(df$x, ', ')))
Making Contains Columns
str_detect
allows us to only loop over the unique strings, not rows:
df[, lvls] <- sapply(lvls, function(y){as.integer(str_detect(df$x, y))})
Making Count Columns
str_count
simplifies our syntax dramatically, again only looping over lvls
:
df[,lvls] <- sapply(lvls, function(y){str_count(df$x, y)})
Results for both are identical to those in base R above.