3

By way of example, see the extraction of Twitter handles below. The target is to have a character string that resembles tweets but has only handles separated by commas. str_replace_all yields empty vectors when no matches are found and that threw some unexpected errors further down the track.

library(purrr)
library(stringr)

tweets <- c(
  "",
  "This tweet has no handles",
  "This is a tweet for @you",
  "This is another tweet for @you and @me",
  "This, @bla, is another tweet for @me and @you"
)


mention_rx <- "@\\w+"

This was my first attempt:

map_chr(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
#> Error: Result 1 must be a single string, not a character vector of length 0

Then I played around with things:

mentions <- map(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))

mentions
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> character(0)
#> 
#> [[3]]
#> [1] "@you"
#> 
#> [[4]]
#> [1] "@you, @me"
#> 
#> [[5]]
#> [1] "@bla, @me, @you"

as.character(mentions)
#> [1] "character(0)"    "character(0)"    "@you"            "@you, @me"      
#> [5] "@bla, @me, @you"

Until it dawned on me that paste could also be used here:

map_chr(tweets, ~paste(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))
#> ""                ""                "@you"            "@you, @me"       "@bla, @me, @you"

My questions are:

  • Is there a more elegant way of getting there?
  • Why doesn't str_c behave the same as paste with an identical collapse argument?
  • Why don't as.character and map_chr recognise a character vector of length zero as equivalent to an empty string but paste does?

I found some good references on str(i)_c, paste, and the difference between them; but none of these addressed the situation with empty strings.

Fons MA
  • 1,142
  • 1
  • 12
  • 21

1 Answers1

2

You don't need to map over tweets, str_extract_all can handle vectors

library(stringr)
str_extract_all(tweets, mention_rx)

#[[1]]
#character(0)

#[[2]]
#character(0)

#[[3]]
#[1] "@you"

#[[4]]
#[1] "@you" "@me" 

#[[5]]
#[1] "@bla" "@me"  "@you"

Now if you need one comma-separated string then you can use map

purrr::map_chr(str_extract_all(tweets, mention_rx), toString)
#[1] ""    ""      "@you"     "@you, @me"      "@bla, @me, @you"

To answer the "why" questions, we can look at the documentation of paste and str_c functions.

From ?paste

Vector arguments are recycled as needed, with zero-length arguments being recycled to "".

From ?str_c

Zero length arguments are removed.

Hence, by default str_c removes zero-length arguments which makes the output a 0-length string which fails for map_chr but it works with map as map returns a list

map(tweets, ~str_c(str_extract_all(.x, mention_rx)[[1]], collapse = ", "))

#[[1]]
#character(0)

#[[2]]
#character(0)

#[[3]]
 #[1] "@you"

#[[4]]
#[1] "@you, @me"

#[[5]]
#[1] "@bla, @me, @you"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hey Ronak, thanks so much! I had been working with unequal numbers of patterns and strings so I got too used to this mapping over `stringr` functions... Not sure if it's justified even in that case. Do you have any inkling as to the possible answers for the "why" questions? – Fons MA Nov 06 '19 at 03:45
  • 1
    @FonsMA Updated the answer with some explanation to it. Hope it is helpful. – Ronak Shah Nov 06 '19 at 03:55