2

I am currently using the unnest_tokens() function from the tidytext package. It works exactly as I need it to, however, it removes ampersands (&) from the text. I would like it to not do that, but leave everything else unchanged.

For example:

library(tidyverse)
library(tidytext)

d <- tibble(txt = "Let's go to the Q&A about B&B, it's great!")
d %>% unnest_tokens(word, txt, token="words")

currently returns

# A tibble: 11 x 1
   word 
   <chr>
 1 let's
 2 go   
 3 to   
 4 the  
 5 q    
 6 a    
 7 about
 8 b    
 9 b    
10 it's 
11 great

but I'd like it to return

# A tibble: 9 x 1
  word 
  <chr>
1 let's
2 go   
3 to   
4 the  
5 q&a       
6 about
7 b&b
8 it's
9 great    

Is there a way to send an option to unnest_tokens() to do this, or send in the regex that it currently uses and manually adjust it to not include the ampersand?

RayVelcoro
  • 524
  • 6
  • 21

1 Answers1

2

We can use the token as regex

library(tidytext)
library(dplyr)
d %>% 
   unnest_tokens(word, txt, token="regex", pattern = "[\\s!,.]")
# A tibble: 9 x 1
#  word 
#  <chr>
#1 let's
#2 go   
#3 to   
#4 the  
#5 q&a  
#6 about
#7 b&b  
#8 it's 
#9 great
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This works, but it will leave in punctuation as well (for example if we added in another sentence, it would carry along the period). The punctuation removal for token="words" was quite good. Do you think my best bet is to send through token="regex" with pattern along the lines of = "[\\s,.]"? – RayVelcoro Apr 21 '20 at 20:24
  • @RayVelcoro can you please update your post with that new case so that I can test it – akrun Apr 21 '20 at 20:25
  • @RayVelcoro that seems to work `unnest_tokens(word, txt, token="regex", pattern = "[ ,.]")`, but it may require some more test cases – akrun Apr 21 '20 at 20:26