0

I'm having trouble using a RegEx on a corpus.

I read in a couple of text documents that I converted to a corpus. I want to display it in a TermDocumentMatrix after some pre-processing.

First I want to specify them with the RegEx "(\b([a-z]*)\B)". For example for "the host" -> "th" "hos"

Then I want to use character n-grams with n = 1:3, so for the previous example -> t" "th", "h", "ho", "hos" Hence I want all characters that define the beginning of the word but do not include the last character of it.

My code so far is giving me a TermDocumentMatrix with n = 1:3 on the whole corpus. However all my approaches to add the RegEx so far haven't beeen working.

I was wondering if there's a way to include in: typedPrefix <- tokens()...

Here's the code:

# read documents 
  FILEDIR <- (path)
  txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
  my_corpus <- corpus(txts)

  #start processing 
  typedPrefix <- my_corpus 
  typedPrefix <- tokens(gsub("\\s", "_", typedPrefix), "character", ngrams=1:3, conc="", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
  dfm2 <- dfm(typedPrefix)
  tdm2 <- as.TermDocumentMatrix(t(dfm2), weighting=weightTf)
  as.matrix(tdm2)

  #write output file 
  write.csv2(as.matrix(tdm2), file = "typedPrefix.csv")
J.B.
  • 13
  • 6
  • So are you looking for a regex that given "the host" will give you back "t" "th", "h", "ho", "hos"? – Francesco B. Mar 18 '18 at 13:56
  • Yes! if I'm not mistaken the RegEx should be "(\b([a-z]*)\B)" since it's " the host " (just as example) . So only the first n character of the word, and the word should be n+1 long. (\b is marking the beginning and \B the end of words, if I'm not mistaken. However, my main problem is the implementation of a RegEx in general to the corpus or DocumentTermMatrix – J.B. Mar 18 '18 at 14:19
  • I tested `(\b([a-z]*)\B)` [here](https://regex101.com/r/shMVLx/1) but it seems to return only the entire word deprived of the last letter... – Francesco B. Mar 18 '18 at 14:28
  • 1
    I thought I could get the other versions using ngrams=1:3. I see the question can be cofusing regarding this. I'll edit it. Thanks! – J.B. Mar 18 '18 at 14:37
  • `(\b(([a-z])*)\B)` seems to do the trick. – neznidalibor Mar 18 '18 at 15:52
  • @neznidalibor do you have a demo to test it? – Francesco B. Mar 18 '18 at 15:57
  • @FrancescoB. tested on your example on regex101 above. – neznidalibor Mar 18 '18 at 18:29
  • @neznidalibor does is select all groups (1,2,3 letters...) or just the one he had already done? – Francesco B. Mar 18 '18 at 18:50

0 Answers0