-3

There's a get_dictionary() function in the fastrtextpackage, and I thought it would return all the words in the dictionary. However, when I set wordNgrams to 2 or 3, it returned exactly the same list of words as what I got when setting wordNgrams to 1. Can someone tell me what's going on here? Thanks!

Artem
  • 3,304
  • 3
  • 18
  • 41
Sky
  • 11

1 Answers1

0

When you increase n in n-grams your fasttext classification algorithm is working on the same dictionary for all the cases. However instead of training on separate words ("I", "love", "NY"), it is training on concatenation of words ("I love", "love NY" - it is a bigram). For the sake of demonstration I trained on 5-grams (pentagrams;) of course the bigger index n in -grams the longer computations but syntactic structure is captured better.

library(fastrtext)

data("train_sentences")
data("test_sentences")

# prepare data
tmp_file_model <- tempfile()

train_labels <- paste0("__label__", train_sentences[,"class.text"])
train_texts <- tolower(train_sentences[,"text"])
train_to_write <- paste(train_labels, train_texts)
train_tmp_file_txt <- tempfile()
writeLines(text = train_to_write, con = train_tmp_file_txt)

test_labels <- paste0("__label__", test_sentences[,"class.text"])
test_texts <- tolower(test_sentences[,"text"])
test_to_write <- paste(test_labels, test_texts)

# learn model 1 1-grams
library(microbenchmark)
microbenchmark(execute(commands = c("supervised", "-input", train_tmp_file_txt,
                     "-output", tmp_file_model, "-dim", 20, "-lr", 1,
                     "-epoch", 20, "-wordNgrams", 1, "-verbose", 1)), times = 5)

# mean time: 1.229228 seconds

model1 <- load_model(tmp_file_model)

# learn model 2 5-grams)
microbenchmark(execute(commands = c("supervised", "-input", train_tmp_file_txt,
                     "-output", tmp_file_model, "-dim", 20, "-lr", 1,
                     "-epoch", 20, "-wordNgrams", 5, "-verbose", 1)), times = 5)

# mean time: 2.659191

model2 <- load_model(tmp_file_model)
str(get_dictionary(model1))
# chr [1:5060] "the" "</s>" "of" "to" "and" "in" "a" "that" "is" "for" ...
str(get_dictionary(model2))
# chr [1:5060] "the" "</s>" "of" "to" "and" "in" "a" "that" "is" "for" ...
Artem
  • 3,304
  • 3
  • 18
  • 41