Keep all the text phrases for data frequency

Question

I have a data frame with only one column "text"

"text"
"User Interfaces"
"Twitter"
"Text Normalization"
"Term weighting"
"Teenagers"
"Team member replacement"

I would like to take a dataframe with the frequency of every phrase, like this:

 "User Interfaces",1
 "Twitter",1
 "Text Normalization",1
 "Term weighting",1
 "Teenagers",1
 "Team member replacement",1

in order to make it I use this:

library(tm) 
df <- read.csv("C:/Users/acel/Desktop/myphr.csv", header=TRUE, sep=",")
corpusD <- Corpus(VectorSource(df$text))
corpusD <- tm_map(corpusD, tolower)
corpusD <- tm_map(corpusD, removeWords, stopwords('english'))
corpusD <- tm_map(corpusD, removeNumbers)
corpusD <- tm_map(corpusD, stripWhitespace)
corpusD <- tm_map(corpusD, PlainTextDocument)
corpusD <- tm_map(corpusD, stemDocument, language = "english")
corpusC <- Corpus(VectorSource(corpusD))
matrixD <- TermDocumentMatrix(corpusC)
matrixD <- removeSparseTerms(matrixD, 0.75)
MatrixDfreq <- rowSums(as.matrix(matrixD))
MatrixDfreq<-sort(MatrixDfreq, decreasing = TRUE)
MatrixDtop30<- MatrixDfreq [1:30]

but when I check the result from MatrixDtop30 I see one word counted like user,1 and interface,1 instead of seeing "user interface",1

Any idea why this is happening?

You mean ***"Generate DTM of phrases rather than words"*** – smci May 12 '17 at 20:58 — smci, May 12 '17 at 20:58

Kristofersen · Accepted Answer · 2017-05-12T20:46:49.620

1

I think this would be a lot easier using data.table operations.

library(data.table)
df = data.frame(text = c("test", "test" ,"test" , "test2", "test3", "test2"))

> df
   text
1  test
2  test
3  test
4 test2
5 test3
6 test2

setDT(df)
df = df[ , .(Number = .N), by = .(text)]

> df
    text Number
1:  test      3
2: test2      2
3: test3      1

Edit

We can include stemming with this

library(data.table)
library(SnowballC)
df = data.frame(text = c("test", "testing" ,"test" , "test2", "test3", "test2"))

> df
     text
1    test
2 testing
3    test
4   test2
5   test3
6   test2

df$text = wordStem(df$text, language = "porter")

> df
   text
1  test
2  test
3  test
4 test2
5 test3
6 test2

setDT(df)
df = df[ , .(Number = .N), by = .(text)]

> df
    text Number
1:  test      3
2: test2      2
3: test3      1

edited May 12 '17 at 20:46

answered May 12 '17 at 20:37

Kristofersen

2,736
1
15
31

Thank you. Very nice work around but I would like to have the stemming and when I try to convert the corpusD to dataframe using this: `data.frame(text = sapply(corpusD, as.character), stringsAsFactors = FALSE)` I receive this error `Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 6, 7` that's why I tried the frequencies from tm – Keri May 12 '17 at 20:42
@Keri i dont have any experience using that package. An alternative would be to run `library(SnowballC); wordStem(df$text, language = "porter")` – Kristofersen May 12 '17 at 20:44
save it to the df obviously, but that will stem each word in the data.frame – Kristofersen May 12 '17 at 20:44
1

Avoid struggling with tm package's limitations and syntax quirks, use ***quanteda*** instead. See [my advice on using quanteda for DTMs](http://stackoverflow.com/a/42953161/202229) – smci May 12 '17 at 21:04

score 1 · Answer 2 · answered May 12 '17 at 21:16

In the example output you have it doesn't look like you're performing any transformations on the text such as lowercasing or removing stopwords and are just keeping the phrases as is? If so you can easily count the number of unique phrases using the tidyverse.

library(dplyr)
library(readr)

df <- data_frame(text = c("User Interfaces", "Twitter", "Text Normalization", "Term weighting", "Teenagers", "Team member replacement")
count(df, text)
                     text     n
                    <chr> <int>
1 Team member replacement     1
2               Teenagers     1
3          Term weighting     1
4                    text     1
5      Text Normalization     1
6                 Twitter     1
7         User Interfaces     1

or

text_df <- read_csv("C:/Users/acel/Desktop/myphr.csv")
count(text_df, text, sort = TRUE)

If you need perform transformations on the text look at the stringr and tidytext packages.

Keep all the text phrases for data frequency

2 Answers2