0

I am trying to figure the frequency of phrases made up from one to eight words. I have been reading about text mining for phrases here and elsewhere and have found out that using ngram tokenization will be the best way to go.

However, when I copy and paste text from a .txt file it either comes up with an unidentified symbol error for multiple lines. Is it possible to use the readLines function in place of X in an ngram_Tokenizer code? E.g.:

Bigram_Tokenizer<-function(X(readLines(file.choose())(Ngram_tokenizer(X(readLines(file.choose(),WekaControl(min=#,max=#) in the example given by tomkauffman at GitHubGist (1)?

When I copy the readLines printout it comes up with 'unexpected [ in [' Do I need to include the same text in both "X" entries?

Thank you, Ben M.

  • `readLines` is just a function for reading the lines of a file, not for tokenizing. Read in your text first, then use a tokenizer like `tokenizers::tokenize_ngrams` (accessible through `tidytext::unnest_tokens`, if you like). – alistaire May 16 '18 at 02:51
  • Thank you aistair, but do you mean I should use a `read text` or `read file` function then try the `tokenize::tokenize_ngrams` or `tidytext` functions? – Benjamin Mehrtens May 16 '18 at 05:45
  • Yes, use `readLines` or the like to import your text into R, and then a tokenizer to tokenize. – alistaire May 16 '18 at 06:52

0 Answers0