2

I would like to use a pre trained model with text2vec. My understanding was that the benefit here is that these models have been trained on a huge volume of data already, e.g. Google News Model.

Reading the text2vec documentation it looks like the getting started code reads in text data then trains a model with it:

library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
  download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
  unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)

The documentation then proceeds to show one how to create tokens and a vocab:

# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

Then, this looks like the step to fit the model:

glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)

My question is, is the well know Google pre trained word2vec model usable here without the need to rely on my own vocab or my own local device to train the model? If yes, how could I read it in and use it in r?

I think I'm misunderstanding or missing something here? Can I use text2vec for this task?

Doug Fir
  • 19,971
  • 47
  • 169
  • 299

2 Answers2

2

At the moment text2vec doesn't provide any functionality for downloading/manipulating pre-trained word embeddings. I have a drafts to add such utilities to the next release.

But on other side you can easily do it manually with just standard R tools. For example here is how to read fasttext vectors:

con = url("https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.vec.gz", "r")
con = gzcon(con)
wv = readLines(con, n = 10)

Then you need just to parse it - strsplit and rbind are your friends.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Hi I'm revisiting this again. May I ask you to hold my hand here? When you say "Then, you just need to parse it -strsplit and rbind are your friends" could you expand on that? Example, after running your 3 lines of code I can see vectors e.g. ```wv[2] shows "die -0.0006 -0.2826 -0.0075 0.0096 0.0292 -0.0061```. If I wanted to transform a column of text data in a data frame that I'm working on into a word vector, would the idea be that I read in all of con (deleting n = 10) and then what? Maybe create a list of words using regex to extract "die" and then the mean of the vector of numbers? – Doug Fir Nov 06 '18 at 18:21
1

This comes a bit late, but might be of interest for other users. Taylor Van Anne provides a little tutorial how to use pretrained GloVe vector models with text2vec here: https://gist.github.com/tjvananne/8b0e7df7dcad414e8e6d5bf3947439a9

thieled
  • 23
  • 4
  • Welcome to StackOverflow! Please edit your answer to include the _relevant_ parts of the tutorial you linked, as well as an *explanation* —in your own words—of whatever you've copied. Answers that only reference off-site material are generally removed, because if the link breaks the answer becomes less than useless. – Das_Geek Sep 17 '19 at 18:56