I am currently using the unnest_tokens()
function from the tidytext
package. It works exactly as I need it to, however, it removes ampersands (&) from the text. I would like it to not do that, but leave everything else unchanged.
For example:
library(tidyverse)
library(tidytext)
d <- tibble(txt = "Let's go to the Q&A about B&B, it's great!")
d %>% unnest_tokens(word, txt, token="words")
currently returns
# A tibble: 11 x 1
word
<chr>
1 let's
2 go
3 to
4 the
5 q
6 a
7 about
8 b
9 b
10 it's
11 great
but I'd like it to return
# A tibble: 9 x 1
word
<chr>
1 let's
2 go
3 to
4 the
5 q&a
6 about
7 b&b
8 it's
9 great
Is there a way to send an option to unnest_tokens()
to do this, or send in the regex that it currently uses and manually adjust it to not include the ampersand?