1

New to R here. I have a problem to solve: I need to create some new columns that count 1 if a sub-string appears one or more times in a string column. Like this:

Existing Column         New Col (True if apple)    New Col (True if banana)
apple, apple, orange            1                              0
banana, banana, orange          0                              1
apple, banana, orange           1                              1

Anyone can help me with this? Thank you very much in advance.

gogolaygo
  • 199
  • 1
  • 12
  • I tried to insert text but it came out wrong... how do I insert a table in my question? – gogolaygo Jan 25 '16 at 00:59
  • Copy and paste, then highlight it and hit the code button. Or just indent it all by 4 spaces. – alistaire Jan 25 '16 at 01:02
  • It may be useful to first take a look at how to create a df and paste its output here - for both sample input and sample output. You can use the 'braces' to format as code, which allows us all to view it in proper format. – Gopala Jan 25 '16 at 01:03
  • Ahh, got it thanks! Yeah as seen in the post, I want to create those new columns. – gogolaygo Jan 25 '16 at 01:22

3 Answers3

3

So I thought you wanted columns of counts (not whether strings are contained) the first time I read the question (the previous edit), but it's sort of useful code anyway, so I left it. Here are options for both base R and the stringr package:

First let's make a sample data.frame with similar data

# stringsAsFactors = FALSE would be smart here, but let's not assume...
df <- data.frame(x = c('a, b, c, a', 'b, b, c', 'd, a'))   

which looks like

> df
           x
1 a, b, c, a
2    b, b, c
3       d, a

Base R

Use strsplit to make a list of vectors of separated strings, using as.character to coerce factors to a useful form,

list <- strsplit(as.character(df$x), ', ')

then make a list of unique strings

lvls <- unique(unlist(list))

Making Contains Columns

Loop over the rows of the data.frame/list with sapply. (All sapply functions in this answer could be replaced with for loops, but that's generally considered poor style in R for speed reasons.) Test if the unique strings are in each, and change to integer format. Set the result (transposed) to a new column of df, one for each unique string.

df[, lvls] <- t(sapply(1:nrow(df), function(z){as.integer(lvls %in% list[[z]])}))

> df
           x a b c d
1 a, b, c, a 1 1 1 0
2    b, b, c 0 1 1 0
3       d, a 1 0 0 1

To keep values as Boolean TRUE/FALSE instead of integers, just remove as.integer.

Making Count Columns

Loop over the rows of the data.frame/list with the outside sapply, while the inner one loops over the unique strings in each, and counts the occurrences by summing TRUE values. Set the result (transposed) to a new column of df, one for each unique string.

df[, lvls] <- t(sapply(1:nrow(df), function(z){
    sapply(seq_along(lvls), function(y){sum(lvls[y] == list[[z]])})
}))

> df
           x a b c d
1 a, b, c, a 2 1 1 0
2    b, b, c 0 2 1 0
3       d, a 1 0 0 1

stringr

stringr can make these tasks much more straightforward.

First, find unique strings in df$x. Split strings with str_split (which can take a factor), flatten them into a vector with unlist, and find unique ones:

library(stringr)
lvls <- unique(unlist(str_split(df$x, ', ')))

Making Contains Columns

str_detect allows us to only loop over the unique strings, not rows:

df[, lvls] <- sapply(lvls, function(y){as.integer(str_detect(df$x, y))})

Making Count Columns

str_count simplifies our syntax dramatically, again only looping over lvls:

df[,lvls] <- sapply(lvls, function(y){str_count(df$x, y)})

Results for both are identical to those in base R above.

alistaire
  • 42,459
  • 4
  • 77
  • 117
2

So, without full details, it is very hard to know exactly what you are looking for. But, if you are looking for number of times a given string occurs and adding as a column to the original data, here is one approach that works (replicating your data input):

df <- data.frame(Fruit = c('apple,orange,orange', 'banana,banana,pear', 'apple,banana,orange'), stringsAsFactors = FALSE)

df$appleCount <- lapply(strsplit(df$Fruit, ','), function(x) sum('apple' == x))
df$bananaCount <- lapply(strsplit(df$Fruit, ','), function(x) sum('banana' == x))

This will only work when you know the specific strings you identified to add as columns. But, should give you an idea of how to split strings, count how many of a given are in that split list, etc. Hope this helps.

Output from the above code should be this:

                Fruit appleCount bananaCount
1 apple,orange,orange          1           0
2  banana,banana,pear          0           2
3 apple,banana,orange          1           1

If you are not looking for the count of times a given string occurs, but just only a true/false (0/1) of whether the string occurs or not, you can use this slightly modified code to get that result:

df <- data.frame(Fruit = c('apple,orange,orange', 'banana,banana,pear', 'apple,banana,orange'), stringsAsFactors = FALSE)
df$appleCount <- lapply(strsplit(df$Fruit, ','), function(x) 'apple' %in% x)
df$bananaCount <- lapply(strsplit(df$Fruit, ','), function(x) 'banana' %in% x)

The output will be then as follows:

            Fruit appleCount bananaCount
1 apple,orange,orange       TRUE       FALSE
2  banana,banana,pear      FALSE        TRUE
3 apple,banana,orange       TRUE        TRUE

If you really want 0/1, you can use as.integer to convert a logical column to integer value.

Gopala
  • 10,363
  • 7
  • 45
  • 77
  • Up voting and/or accepting answer is how you can thank on Stackoverflow. :) – Gopala Jan 25 '16 at 01:47
  • I did... but because my account is new, stack overflow won't let me do it just yet. Follow up question: what if I want to look for both apple or banana in the string? Which means either banana or apple is in the string, I want the new column to return true. – gogolaygo Jan 25 '16 at 01:47
  • You need to look up '|' and '&' operators to combine. For example, you can do: `sum('apple' == x | 'banana' == x)` as one possibility. – Gopala Jan 25 '16 at 01:49
2

Using "df" from @user3949008's answer, you can also try cSplit_e from my "splitstackshape" package:

library(splitstackshape)
cSplit_e(df, "Fruit", ",", type = "character", fill = 0)
#                 Fruit Fruit_apple Fruit_banana Fruit_orange Fruit_pear
# 1 apple,orange,orange           1            0            1          0
# 2  banana,banana,pear           0            1            0          1
# 3 apple,banana,orange           1            1            1          0

You can always drop the columns you're not interested in later.

If you're after counts, you can try mtabulate from "qdapTools":

library(qdapTools)
mtabulate(strsplit(df$Fruit, ","))
#   apple banana orange pear
# 1     1      0      2    0
# 2     0      2      0    1
# 3     1      1      1    0
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485