How to combine rows into one row in TermDocumentMatrix?

Question

Iam trying to combine rows into on row in TermDocumentMatrix

(I know every row represents each words)

ex) cabin, staff -> crews

Because 'cabin, staff and crew' mean samething, Iam trying to combine rows which represent 'cabin, staff' into one row which represent 'crew.

but, it doesn't work at all.

R said argument "weighting" is missing, with no default

The codes I typed is below

r=GET('http://www.airlinequality.com/airline-reviews/cathay-pacific-airways/')
base_url=('http://www.airlinequality.com/airline-reviews/cathay-pacific-airways/')
h<-read_html(base_url)

all.reviews = c()

for (i in 1:10){
print(i)
url = paste(base_url, 'page/', i, '/', sep="")
r = GET(url)
h = read_html(r)
comment_area = html_nodes(h, '.tc_mobile')
comments= html_nodes(comment_area, '.text_content')
reviews = html_text(comments)
all.reviews=c(all.reviews, reviews)} 

cps <- Corpus(VectorSource(all.reviews))
cps <- tm_map(cps, content_transformer(tolower)) 
cps <- tm_map(cps, content_transformer(stripWhitespace))
cps <- tm_map(cps, content_transformer(removePunctuation))
cps <- tm_map(cps, content_transformer(removeNumbers))
cps <- tm_map(cps, removeWords, stopwords("english"))

tdm <- TermDocumentMatrix(cps, control=list(
wordLengths=c(3, 20),
weighting=weightTf))

rows.cabin = grep('cabin|staff', row.names(tdm))
rows.cabin
# [1]  235 1594
count.cabin = as.array(rollup(tdm[rows.cabin,], 1)) 
count.cabin
#Docs
#Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26   27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#1 0 1 1 0 0 2 2 0 0  1  1  0  4  0  1  0  1  0  2  1  0  0  1  3  1  4  2  0  3  0  1  1  4  0  0  2  1  0  0  2  1  0  2  1  3  3  1
 #Docs
#Terms 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
#1  0  1  0  1  2  3  2  2  1  1  0  2  0  0  0  0  0  2  0  1  0  0  4  0  2  2  1  3  1  1  1  1  0  0  0  5  3  0  2  1  0  1  0  0
 #Docs
#Terms 92 93 94 95 96 97 98 99 100
#1  1  5  2  1  0  0  0  1   0
row.crews = grep('crews', row.names(tdm))
row.crews
#[1] 408
tdm[row.crews,] = count.cabin
rows.cabin = setdiff(rows.cabin, row.crews) # ok
tdm = tdm[-rows.cabin,] # ok

dtm = as.DocumentTermMatrix(tdm)
# Error in .TermDocumentMatrix(t(x), weighting) :
# argument "weighting" is missing, with no default

maybe it is not right approach to combine rows in TermDocumentMatrix

Please fix this codes or suggest better approach to solve this problem.

Thanks in advance.

score 0 · Answer 1 · edited May 23 '17 at 12:16

0

Hmm I wonder why you stick to your approach, which obviously does not work, instead of just copying+pasting+adjusting* my suggestion from here?

library(tm)
library(httr)
library(rvest)
library(slam)
# [...] # your code
inspect(tdm[grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE), 1:15])
#        Docs
# Terms   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#   cabin 0 0 0 0 0 1 1 0 0  1  0  0  3  0  0
#   crew  0 0 0 1 1 1 1 0 2  1  0  1  0  2  0
#   crews 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0
#   staff 0 1 1 0 0 1 1 0 0  0  1  0  1  0  1

dict <- list(
  "CREW" = grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE, value = TRUE)
)
terms <- Terms(tdm)
for (x in seq_along(dict)) 
  terms[terms %in% dict[[x]] ] <- names(dict)[x]
tdm <- slam::rollup(tdm, 1, terms, sum)
inspect(tdm[grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE), 1:15])
#       Docs
# Terms  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#   CREW 0 1 1 1 1 3 3 0 2  2  1  1  4  2  1

*I only adjusted the line inside the dict definition...

edited May 23 '17 at 12:16

Community

1
1

answered Oct 03 '16 at 11:33

lukeA

53,097
5
97
100

but there is still problem that after I combined rows in the way you suggested, I lost `weighting` ex)tf, tfidf in `TermDocumentMatrix`. I tried many things but, I don't know why and how to solve it. do you have any idea? – Ju Whan Lim Oct 12 '16 at 20:51
In theory: Add `attributes(tdm) <- c(attributes(tdm), list(weighting = unlist(attributes(weightTf)[c("name", "acronym")], F,F)))` and the standard term frequency weighting is back (name & acronym). – lukeA Oct 12 '16 at 21:00
since Iam new in LDA and R, It seems I had a lot of basic questions. But thanks to you, I solve the problem. I really appreciate!!! – Ju Whan Lim Oct 13 '16 at 05:00

How to combine rows into one row in TermDocumentMatrix?

1 Answers1