-1

I know this has been asked multiple times. For example

Finding 2 & 3 word Phrases Using R TM Package

However, I don't know why none of these solutions work with my data. The result is always one-gram word no matter how many ngram I chose (2, 3 or 4) for the ngram.

Could anybody know the reason why? I suspect the encoding is the reason.

Edited: a small part of the data.

comments <- c("Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into problem_70918\n", 
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into tm-247\n", 
"Merge branch 'php5.3-upgrade-sprint6-7' of git.internal.net:/git/pn-project/LegacyCodebase into release2012.08\n", 
"Merge remote-tracking branch 'dmann1/p71148-s3-callplan_mapping' into lcst-operational-changes\n", 
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into TASK-360148\n", 
"Merge remote-tracking branch 'grockett/rpr-pre' into rpr-lite\n"
)
cleanCorpus <- function(vector){
  corpus <- Corpus(VectorSource(vector), readerControl = list(language = "en_US"))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, tolower)
  #corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removePunctuation)
  #corpus <- tm_map(corpus, PlainTextDocument)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  return(corpus)
}
# this function is provided by a team member (in the link I posted above)
test <- function(keywords_doc){

  BigramTokenizer <-  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  # creating of document matrix
  keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))

  # remove sparse terms 
  keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.99)

  # Frequency of the words appearing
  keyword.freq <- rowSums(as.matrix(keywords_naremoval))
  subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
  frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq) 

  # Sorting of the words
  frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
  frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
  frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]

  # Printing of the words
  # wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
  return(frequentKeywordDF)
}

corpus <- cleanCorpus(comments)
t <- test(corpus)
> head(t)
             term freq
added       added    6
html         html    6
tracking tracking    6
common     common    4
emails     emails    4
template template    4

Thanks,

Duy Bui
  • 1,348
  • 6
  • 17
  • 38
  • 2
    It is helpful to post sample data and desired output. If the data set is large then post a small portion of it using `dput(head(df1))` – manotheshark Aug 23 '17 at 18:15
  • I have updated with a small portion of the data. I think the encoding could be the reason. I tried the using tm package with the Tokenization and it works with other datasets. P/S: why are there so many hates on this forum? – Duy Bui Aug 24 '17 at 08:37
  • 1
    please also share what code you are using to obtain ngrams – Imran Ali Aug 24 '17 at 08:48
  • That was updated. I have tried the method from the tm faq as well. – Duy Bui Aug 24 '17 at 09:13

1 Answers1

1

I haven't found the reason either, but if you are only interested in the counts regardless in which documents the bigrams occured, you could get them alternatively via this pipeline:

library(tm)
lilbrary(dplyr)
library(quanteda)

# ..construct the corpus as in your post ...

corpus %>% 
  unlist() %>%  
  tokens() %>%
  tokens_ngrams(2:2, concatenator = " ") %>%  
  unlist() %>%  
  as.data.frame() %>% 
  group_by_(".") %>%  
  summarize(cnt=n()) %>%
  arrange(desc(cnt))
knb
  • 9,138
  • 4
  • 58
  • 85
  • Your answer looks great. At least it helps me find the most frequent 2-gram words. What is 2:2 in the `token_ngrams`? Is that because the parameter is a vector of ngrams, therefore you have to put 2:2 to isolate it to 2-gram only? – Duy Bui Aug 24 '17 at 13:59
  • Not sure if there are more answers to this but yours solves my problem though it doesn't tell me what is wrong with the dataset. I, therefore, mark your answer as the right answer. Thanks – Duy Bui Aug 24 '17 at 15:59