Tokenization in r tidytext, leaving in ampersands

Question

I am currently using the unnest_tokens() function from the tidytext package. It works exactly as I need it to, however, it removes ampersands (&) from the text. I would like it to not do that, but leave everything else unchanged.

For example:

library(tidyverse)
library(tidytext)

d <- tibble(txt = "Let's go to the Q&A about B&B, it's great!")
d %>% unnest_tokens(word, txt, token="words")

currently returns

# A tibble: 11 x 1
   word 
   <chr>
 1 let's
 2 go   
 3 to   
 4 the  
 5 q    
 6 a    
 7 about
 8 b    
 9 b    
10 it's 
11 great

but I'd like it to return

# A tibble: 9 x 1
  word 
  <chr>
1 let's
2 go   
3 to   
4 the  
5 q&a       
6 about
7 b&b
8 it's
9 great

Is there a way to send an option to unnest_tokens() to do this, or send in the regex that it currently uses and manually adjust it to not include the ampersand?

akrun · Accepted Answer · 2020-04-21T20:33:31.240

2

We can use the token as regex

library(tidytext)
library(dplyr)
d %>% 
   unnest_tokens(word, txt, token="regex", pattern = "[\\s!,.]")
# A tibble: 9 x 1
#  word 
#  <chr>
#1 let's
#2 go   
#3 to   
#4 the  
#5 q&a  
#6 about
#7 b&b  
#8 it's 
#9 great

edited Apr 21 '20 at 20:33

answered Apr 21 '20 at 20:01

akrun

874,273
37
540
662

This works, but it will leave in punctuation as well (for example if we added in another sentence, it would carry along the period). The punctuation removal for token="words" was quite good. Do you think my best bet is to send through token="regex" with pattern along the lines of = "[\\s,.]"? – RayVelcoro Apr 21 '20 at 20:24
@RayVelcoro can you please update your post with that new case so that I can test it – akrun Apr 21 '20 at 20:25
@RayVelcoro that seems to work `unnest_tokens(word, txt, token="regex", pattern = "[ ,.]")`, but it may require some more test cases – akrun Apr 21 '20 at 20:26

Tokenization in r tidytext, leaving in ampersands

1 Answers1