How do I get the word-embedding matrix from ft_word2vec (sparklyr-package)?

Question

I have another question in the word2vec universe. I am using the 'sparklyr'-package. Within this package I call the ft_word2vec() function. I have some trouble understanding the output: For each number of sentences/paragraphs I am providing to the ft_word2vec() function, I always get the same amount of vectors. Even, if I have more sentences/paragraphs than words. For me, that looks like I get the paragraph-vectors. Maybe a Code-example helps to understand my problem?

# add your spark_connection here as 'spark_connection = '

# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
  "It is followed by the second sentence",
  "At the end there is the last sentence"))

# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)

# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")

# split data into test and trainings sets
partitions <- sc_FK_data %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 123456) 
FK_train <- partitions$training
FK_test <- partitions$test

# given a trainings data set (FK_train) with a column "tokens" (for each row = a list of strings)
mymodel = ft_word2vec(
  FK_train,
  input_col = "tokens",
  output_col = "word2vec",
  vector_size = 15,
  min_count = 1,
  max_sentence_length = 4444,
  num_partitions = 1,
  step_size = 0.1,
  max_iter = 10,
  seed = 123456,
  uid = random_string("word2vec_"))

# I tried to get the data from spark with:
myemb = mymodel %>% sparklyr::collect()

Has somebody had similar experiences? Can someone explain what exactly the ft_word2vec() function returns? Do you have an example on how to get the word embedding vectors with this function? Or does the returned column indeed contain the paragraph vectors?

score 2 · Accepted Answer · answered Dec 10 '20 at 07:34

my colleague found a solution! If you know how to do it, the instructions really begin to make sense!

# add your spark_connection here as 'spark_connection = '

# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
  "It is followed by the second sentence",
  "At the end there is the last sentence"))

# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)

# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")

# split data into test and trainings sets
partitions <- sc_FK_data %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 123456) 
FK_train <- partitions$training
FK_test <- partitions$test

# CHANGES FOLLOW HERE:
# We have to use the spark connection instead of the data. For me this was the confusing part, since i thought no data -> no model.
# maybe we can think of this step as an initialization
mymodel = ft_word2vec(
  spark_connection,
  input_col = "tokens",
  output_col = "word2vec",
  vector_size = 15,
  min_count = 1,
  max_sentence_length = 4444,
  num_partitions = 1,
  step_size = 0.1,
  max_iter = 10,
  seed = 123456,
  uid = random_string("word2vec_"))

# now that we have our model initialized, we add the word-embeddings to the model
w2v_model = ml_fit(w2v_model, sc_FK_EMB)

# now we can collect the embedding vectors
emb = word2vecmodel$vectors %>% collect()

How do I get the word-embedding matrix from ft_word2vec (sparklyr-package)?

1 Answers1