2

My ultimate goal is to create a quanteda dictionary to use for topic classification on text data.

However, my topic keywords are stored in a somewhat different format: I have a column of about 4000 keywords and a second column that specifies the topic each keyword belongs to. Note that there is no equal number of words for each topic. My data looks like this:

     keywords      topic
[1]  "one"         "number"
[2]  "two"         "number"
[3]  "three"       "number"
[4]  "triangle"    "form"
[5]  "circle"      "form"
[...]

How can I transform my keywords into a (quanteda) dictionary format, i.e. a list that contains named vectors for each topic that contain the keywords for the respective topic?

The list should look like this:

list(number = c("one","two","three"),
     form = c("triangle","circle"))

Any help much appreciated!

Find my approach so far bloew. But it doesn't appear right to me (or working):

# 1) Initialize an empty list of vectors that corresponds to my number of topics & add topic names ("topic_names" is just a vector type chr 1:88 that includes the topic names)

mydictionary <- vector(mode = "list", length = 88) 
names(mydictionary ) <- topic_names

# 2) Create a loop that checks for each keyword to match a topic and adds it to the respective vector of my dictionary

# I got it working for one keyword like this:
if (names(mydictionary [1]) == keyword_list$topic[1]) { # if topic of keyword matches topic vector name
  mydictionary[[1]] <- c(mydictionary[[1]], keyword_list$keywords[1]) #add keyword to topic vector
}

# However, I don't know how to transform this into a loop, since a loop has to check every index of keyword_list for every index of mydictionary and I don't know how to achieve this...

Julian
  • 25
  • 3

1 Answers1

1

If your data is in a data.frame like topics (see data section), you can quickly get the data in a list like you want. You can use the function split.

my_dictionary <- split(topics$keywords, topics$topic)
my_dictionary

$form
[1] "triangle" "circle"  

$number
[1] "one"   "two"   "three"

Data:

topics <- structure(list(keywords = c("one", "two", "three", "triangle", 
"circle"), topic = c("number", "number", "number", "form", "form"
)), class = "data.frame", row.names = c(NA, -5L))
phiver
  • 23,048
  • 14
  • 44
  • 56
  • 1
    Perfect, thank you! It worked straight away! I knew, I was approaching the whole thing way too complicated. – Julian Aug 12 '21 at 15:48
  • Just as a little follow-up question: when using split() I am loosing one topic. My list contains only 87 instead of 88 vectors. even though my original data.frame contained 88 unique topics. Any idea, why and how this is happening? – Julian Aug 12 '21 at 16:01
  • Do you have a topic with a NA value? If you do, the split will ignore that and remove it. – phiver Aug 12 '21 at 16:24
  • no, every topic has at least one assigned keyword – Julian Aug 13 '21 at 08:53
  • @Julian, Just for my curiosity, what was the issue? – phiver Aug 13 '21 at 10:56
  • 1
    it wasn't about split(). I checked for NAs in my keyword column but there where none. I checked the number of topics with unique() and got 88. I just didn't follow your advice correctly, there was indeed a topic with a NA value, i.e. I had only 87 topics in my data (though I expected to have 88 topics, because I inserted 88, but apparently lost one earlier in the process) – Julian Aug 13 '21 at 15:28