Using dplyr and stringr to replace all values starts with

Question

my df

> df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100))
>   df
          food      sold
1 fruit banana  99.47171
2  fruit apple  99.40878
3  fruit grape  99.28727
4        bread  99.15934
5         meat 100.53438

Now I want to replace all values in food that starts with "fruit" and then group by food and summarise sold with sum sold.

> df %>%
+     mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>% 
+     group_by(food) %>% 
+     summarise(sold = sum(sold))
Source: local data frame [3 x 2]

    food      sold
  (fctr)     (dbl)
1  bread  99.15934
2   meat 100.53438
3     NA 298.16776

Why is this command not working? It gives me NA instead of fruit?

Well, `food` is of factor type, convert it into character and then run your code. — Ronak Shah, May 04 '17 at 09:26

score 12 · Accepted Answer · edited Mar 07 '19 at 04:47

It is working for me, I think your data is in factors:

Using stringsAsFactors=FALSE while making the data as below or you can run options(stringsAsFactors=FALSE) in the R environment to avoid the same:

df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100),stringsAsFactors = FALSE)

df %>%
mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>% 
group_by(food) %>% 
summarise(sold = sum(sold))

Output:

 # A tibble: 3 × 2
       food      sold
      <chr>     <dbl>
    1 bread  99.67661
    2 fruit 300.28520
    3  meat  99.88566

score 3 · Answer 2 · answered May 04 '17 at 09:33

We can do this using base R without converting to character class by assigning the levels with 'fruit' to 'fruit' and use aggregate to get the sum

levels(df$food)[grepl("fruit", levels(df$food))] <- "fruit"
aggregate(sold~food, df, sum)
#   food      sold
#1 bread  99.41637
#2 fruit 300.41033
#3  meat 100.84746

data

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", 
                 "bread", "meat"), sold = rnorm(5, 100))

score 2 · Answer 3 · answered May 04 '17 at 12:45

Although the Q is tagged with dplyr and stringr I would like to propose an alternative solution using data.table because data.table deals with factors in a convenient and straightforward way:

library(data.table)
setDT(df)[food %like% "^fruit", food := "fruit"][, .(sold = sum(sold)), by = food]
#    food      sold
#1: fruit 300.41033
#2: bread  99.41637
#3:  meat 100.84746

Data

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), 
                 sold = rnorm(5, 100))

score 2 · Answer 4 · edited Jun 20 '20 at 09:12

Here are two alternative solutions which use forcats, stringr, and regular expressions to directly manipulate factor levels.

If I understand correctly, the issue was caused by food being a factor which is not handled appropriately by replace().

1. `fct_collapse()`

The fct_collapse() function is used to collapse all factor levels which start with "fruit " (note the trailing blank) into factor level "fruit":

library(dplyr)
library(stringr)
library(forcats)
df %>%
  group_by(food = fct_collapse(food, fruit = levels(food) %>% str_subset("^fruit "))) %>% 
  summarise(sold = sum(sold))

  food         sold
  <fct>       <dbl>
1 bread        99.4
2 egg fruits  100. 
3 fruit       300. 
4 fruity wine 100. 
5 meat        101.

Note that an enhanced sample data set is used which includes edge cases to better test the regular expression. Furthermore, the grouping variable is computed directly in group_by() which saves to call mutate() beforehand.

2. `str_replace()` with look-behind

There is an even shorter solution which uses str_replace() instead of replace() together with a more sophisticated regular expression. The regular exprresion uses a look-behind in order to delete all characters after the leading "fruit" (including the blank which follows "fruit"):

df %>%
  group_by(food = str_replace(food, "(?<=^fruit)( .*)", "")) %>% 
  summarise(sold = sum(sold))

The result is the same as above.

Enhanced data sample set

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", 
                          "meat", "egg fruits", "fruity wine"), 
                 sold = rnorm(7, 100))
df

          food      sold
1 fruit banana  99.45412
2  fruit apple 100.53659
3  fruit grape 100.41962
4        bread  99.41637
5         meat 100.84746
6   egg fruits 100.26602
7  fruity wine 100.44459

Wolfgang · Answer 5 · 2017-05-04T09:36:08.807

replace does not work as intended, because column food is a factor variable and fruit is an unknown level.

One possible solution is to define the dataframe column food with the correct factor levels

df <- data.frame(food = 
  factor(c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), 
    levels =c("fruit banana", "fruit apple", "fruit grape", "bread", "meat", "fruit") ), 
    sold = rnorm(5, 100))

Easier would of course be to set stringsAsFactors = FALSE

df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
             sold = rnorm(5, 100), 
             stringsAsFactors = FALSE)

Using dplyr and stringr to replace all values starts with

5 Answers5

data

Data

1. fct_collapse()

2. str_replace() with look-behind

Enhanced data sample set

1. `fct_collapse()`

2. `str_replace()` with look-behind