Compute ngrams for each row of text data in R

Question

I have a data column of the following format:

Text

Hello world  
Hello  
How are you today  
I love stackoverflow  
blah blah blahdy

I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately?

@TylerRinker Thank you Tyler. However, sapply didn't work. I used it like this: > trigram_title <- sapply(eta_dedup$title, textcnt(eta_dedup$title, method = "ngram")) Error in match.fun(FUN) : 'textcnt(eta_dedup$title, method = "ngram")' is not a function, character or symbol — Brian, Jul 09 '13 at 19:01
It's better to *show what you did* rather than mentioning it. — Arun, Jul 09 '13 at 20:20
May be this post helps you... http://stackoverflow.com/questions/37291984/find-the-most-frequently-occuring-words-in-a-text-in-r/37292306#37292306 — Manoj Kumar, Jun 26 '16 at 08:07

score 6 · Accepted Answer · answered Jul 09 '13 at 19:35

6

Is this what you're after?

library("RWeka")
library("tm")

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                Weka_control(min = 3, max = 3))
# Using Tyler's method of making the 'Text' object here
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)), 
                          control = list(tokenize = TrigramTokenizer))

inspect(tdm)

A term-document matrix (4 terms, 5 documents)

Non-/sparse entries: 4/16
Sparsity           : 80%
Maximal term length: 20 
Weighting          : term frequency (tf)

                      Docs
Terms                  1 2 3 4 5
  are you today        0 0 1 0 0
  blah blah blahdy     0 0 0 0 1
  how are you          0 0 1 0 0
  i love stackoverflow 0 0 0 1 0

answered Jul 09 '13 at 19:35

Ben

41,615
18
132
227

Thanks Ben. This allows me to easily compute token similarity between strings – Brian Jul 09 '13 at 20:39
I got the following error when trying the "tdm <-" line: Error in .jnew(name) : java.lang.ClassNotFoundException – Max Ghenis Jan 21 '14 at 05:34
Sounds like a problem with your installation of Java, perhaps the path isn't set properly. – Ben Jan 21 '14 at 06:35

Tyler Rinker · Answer 2 · 2013-07-09T19:16:09.920

4

Here's an ngram approach using the qdap package

## Text <- readLines(n=5)
## Hello world
## Hello
## How are you today
## I love stackoverflow
## blah blah blahdy

library(qdap)
ngrams(Text, seq_along(Text), 3)

It's a list and you can access the components with typical list indexing.

Edit:

As far as your first approach try it like this:

library(tau)
sapply(Text, textcnt, method = "ngram")

## sapply(eta_dedup$title, textcnt, method = "ngram")

edited Jul 09 '13 at 19:16

answered Jul 09 '13 at 19:05

Tyler Rinker

108,132
65
322
519

Thanks Tyler! I will explore your qdap package. I think for now I will use the RWeka/tm solution by Ben since it presents the data in a way where I can easily compute similarity. – Brian Jul 09 '13 at 20:38

score 3 · Answer 3 · answered Dec 10 '15 at 10:47

Here's how using the quanteda package:

txt <- c("Hello world", "Hello", "How are you today", "I love stackoverflow", "blah blah blahdy")

require(quanteda)
dfm(txt, ngrams = 3, concatenator = " ", verbose = FALSE)
## Document-feature matrix of: 5 documents, 4 features.
## 5 x 4 sparse Matrix of class "dfmSparse"
##   features
## docs    how are you are you today i love stackoverflow blah blah blahdy
##  text1           0             0                    0                0
##  text2           0             0                    0                0
##  text3           1             1                    0                0
##  text4           0             0                    1                0
##  text5           0             0                    0                1

ambodi · Answer 4 · 2016-07-10T22:29:19.560

I guess the OP wanted to use tau but others didn't use that package. Here's how you do it in tau:

data = "Hello world\nHello\nHow are you today\nI love stackoverflow\n  
blah blah blahdy"

bigram_tau <- textcnt(data, n = 2L, method = "string", recursive = TRUE)

This is gonna be as a trie but you can format it as more classic datam-frame type with tokens and size:

data.frame(counts = unclass(bigram_tau), size = nchar(names(bigram_tau)))
format(r)

I highly suggest using tau because it performs really well with large data. I have used it for creating bigrams of 1 GB and it was both fast and smooth.

Compute ngrams for each row of text data in R

4 Answers4

Linked