There's a get_dictionary()
function in the fastrtext
package, and I thought it would return all the words in the dictionary. However, when I set wordNgrams
to 2 or 3, it returned exactly the same list of words as what I got when setting wordNgrams
to 1. Can someone tell me what's going on here? Thanks!
Asked
Active
Viewed 68 times
-3
-
Welcome to stackoverflow.com please read https://stackoverflow.com/help/how-to-ask – fgamess Jul 23 '18 at 05:21
1 Answers
0
When you increase n in n-grams your fasttext classification algorithm is working on the same dictionary for all the cases. However instead of training on separate words ("I", "love", "NY"), it is training on concatenation of words ("I love", "love NY" - it is a bigram). For the sake of demonstration I trained on 5-grams (pentagrams;) of course the bigger index n in -grams the longer computations but syntactic structure is captured better.
library(fastrtext)
data("train_sentences")
data("test_sentences")
# prepare data
tmp_file_model <- tempfile()
train_labels <- paste0("__label__", train_sentences[,"class.text"])
train_texts <- tolower(train_sentences[,"text"])
train_to_write <- paste(train_labels, train_texts)
train_tmp_file_txt <- tempfile()
writeLines(text = train_to_write, con = train_tmp_file_txt)
test_labels <- paste0("__label__", test_sentences[,"class.text"])
test_texts <- tolower(test_sentences[,"text"])
test_to_write <- paste(test_labels, test_texts)
# learn model 1 1-grams
library(microbenchmark)
microbenchmark(execute(commands = c("supervised", "-input", train_tmp_file_txt,
"-output", tmp_file_model, "-dim", 20, "-lr", 1,
"-epoch", 20, "-wordNgrams", 1, "-verbose", 1)), times = 5)
# mean time: 1.229228 seconds
model1 <- load_model(tmp_file_model)
# learn model 2 5-grams)
microbenchmark(execute(commands = c("supervised", "-input", train_tmp_file_txt,
"-output", tmp_file_model, "-dim", 20, "-lr", 1,
"-epoch", 20, "-wordNgrams", 5, "-verbose", 1)), times = 5)
# mean time: 2.659191
model2 <- load_model(tmp_file_model)
str(get_dictionary(model1))
# chr [1:5060] "the" "</s>" "of" "to" "and" "in" "a" "that" "is" "for" ...
str(get_dictionary(model2))
# chr [1:5060] "the" "</s>" "of" "to" "and" "in" "a" "that" "is" "for" ...

Artem
- 3,304
- 3
- 18
- 41