-1

I want to generate topics from my text at the level of phrases, rather than at the level of words using LDA (latent Dirichlet allocation). How can I do that in R?

LDA interprets the documents as bag-of-words and produces topics with constituting words. For example, a sample output from text "Arsenal won FA cup in two consecutive years in 2014 and 2015. They are the kings of North London.", could yield topic [Arsenal - 50%, FA - 20%, cup - 10%, london - 10%, king - 10%]

I want it to return the topic at the level of phrases, i.e., [Arsenal, fa cup, north london]

carora3
  • 466
  • 1
  • 5
  • 19
  • Your question's a bit vague at the moment. Can you give more detail, preferably with some sample data and desired output? – Nick Kennedy Jun 22 '15 at 14:27
  • @NickK I've made the suggested change and have added an example. My question is very simple how to perform LDA at the level of phrases to get topics as a distribution of phrases rather than words. – carora3 Jun 22 '15 at 14:48
  • `openNLP` pkg has routines that will tag each word with grammar type (noun, adjective, etc) – hrbrmstr Jun 22 '15 at 15:00
  • I understand what openNLP does, and can geenrate phrases / chunks from there. But how can I make topicmodels to generate topics at the level of phrases? – carora3 Jun 22 '15 at 15:09

1 Answers1

2

I'm not aware of any way of pulling out the phrases automatically within R. However, it would be possible to change the input text such that the phrases were kept together with underscores or another character. For example, you could do the following:

example <- "Arsenal won FA cup in two consecutive years in 2014 and 2015. They are the kings of North London."

phrases <- c("FA cup", "North London")
phrasesNbsp <- gsub(" ", "_", phrases, fixed = TRUE)
for (i in 1:length(phrases)) {
  example <- gsub(phrases[i], phrasesNbsp[i], example, fixed = TRUE)
}
lda::lexicalize(example)
Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52