Adding RegEx to specify character ngrams for a corpus in R

Question

I'm having trouble using a RegEx on a corpus.

I read in a couple of text documents that I converted to a corpus. I want to display it in a TermDocumentMatrix after some pre-processing.

First I want to specify them with the RegEx "(\b([a-z]*)\B)". For example for "the host" -> "th" "hos"

Then I want to use character n-grams with n = 1:3, so for the previous example -> t" "th", "h", "ho", "hos" Hence I want all characters that define the beginning of the word but do not include the last character of it.

My code so far is giving me a TermDocumentMatrix with n = 1:3 on the whole corpus. However all my approaches to add the RegEx so far haven't beeen working.

I was wondering if there's a way to include in: typedPrefix <- tokens()...

Here's the code:

# read documents 
  FILEDIR <- (path)
  txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
  my_corpus <- corpus(txts)

  #start processing 
  typedPrefix <- my_corpus 
  typedPrefix <- tokens(gsub("\\s", "_", typedPrefix), "character", ngrams=1:3, conc="", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
  dfm2 <- dfm(typedPrefix)
  tdm2 <- as.TermDocumentMatrix(t(dfm2), weighting=weightTf)
  as.matrix(tdm2)

  #write output file 
  write.csv2(as.matrix(tdm2), file = "typedPrefix.csv")

So are you looking for a regex that given "the host" will give you back "t" "th", "h", "ho", "hos"? — Francesco B., Mar 18 '18 at 13:56
Yes! if I'm not mistaken the RegEx should be "(\b([a-z]*)\B)" since it's " the host " (just as example) . So only the first n character of the word, and the word should be n+1 long. (\b is marking the beginning and \B the end of words, if I'm not mistaken. However, my main problem is the implementation of a RegEx in general to the corpus or DocumentTermMatrix — J.B., Mar 18 '18 at 14:19
I tested `(\b([a-z]*)\B)` [here](https://regex101.com/r/shMVLx/1) but it seems to return only the entire word deprived of the last letter... — Francesco B., Mar 18 '18 at 14:28
I thought I could get the other versions using ngrams=1:3. I see the question can be cofusing regarding this. I'll edit it. Thanks! — J.B., Mar 18 '18 at 14:37
@neznidalibor does is select all groups (1,2,3 letters...) or just the one he had already done? — Francesco B., Mar 18 '18 at 18:50

Adding RegEx to specify character ngrams for a corpus in R

0 Answers0