0

I'm working on text that has character combinations like "3/8" and "5/8" when referring to particular sizes of things and I'm making bigrams to help analyze the text. I'd like to not have the "/" character removed but am not finding a way to do that. Here is an example:

library(tidyverse)
library(tidytext)

tibble(text="My example is 3/8 pipe and 5/8 wrench") %>%
  unnest_tokens(bigrams,text,token="ngrams",n=2)

Here is the output:

# A tibble: 9 x 1
  bigrams   
  <chr>     
1 my example
2 example is
3 is 3      
4 3 8       
5 8 pipe    
6 pipe and  
7 and 5     
8 5 8       
9 8 wrench 

Thank you for your input.

Edit: I've found one way around this, but it is crude and would love to hear more elegant solutions.

library(tidyverse)
library(tidytext)
library(stringr)

tibble(text="My example is 3/8 pipe and 5/8 wrench") %>%
  mutate(text=str_replace_all(text,"\\/","forwardslash")) %>%
  unnest_tokens(bigrams,text,token="ngrams",n=2) %>%
  mutate(bigrams=str_replace_all(bigrams,"forwardslash","/"))

Output:

# A tibble: 7 x 1
  bigrams   
  <chr>     
1 my example
2 example is
3 is 3/8    
4 3/8 pipe  
5 pipe and  
6 and 5/8   
7 5/8 wrench
Nickerbocker
  • 117
  • 8
  • This is probably how I would approach this, if the only kind of punctuation I wanted to keep was this particular pattern. – Julia Silge Oct 15 '21 at 18:45

1 Answers1

1

We may also use chartr for replacement

library(tidytext)
tibble(text="My example is 3/8 pipe and 5/8 wrench") %>%
   mutate(text = chartr("/", "_", text)) %>% 
   unnest_tokens(bigrams, text, token = "ngrams",  n = 2) %>% 
   mutate(bigrams = chartr("_", "/", bigrams))

-output

# A tibble: 7 × 1
  bigrams   
  <chr>     
1 my example
2 example is
3 is 3/8    
4 3/8 pipe  
5 pipe and  
6 and 5/8   
7 5/8 wrench
akrun
  • 874,273
  • 37
  • 540
  • 662