I have another question in the word2vec universe. I am using the 'sparklyr'-package. Within this package I call the ft_word2vec() function. I have some trouble understanding the output: For each number of sentences/paragraphs I am providing to the ft_word2vec() function, I always get the same amount of vectors. Even, if I have more sentences/paragraphs than words. For me, that looks like I get the paragraph-vectors. Maybe a Code-example helps to understand my problem?
# add your spark_connection here as 'spark_connection = '
# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
"It is followed by the second sentence",
"At the end there is the last sentence"))
# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)
# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")
# split data into test and trainings sets
partitions <- sc_FK_data %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 123456)
FK_train <- partitions$training
FK_test <- partitions$test
# given a trainings data set (FK_train) with a column "tokens" (for each row = a list of strings)
mymodel = ft_word2vec(
FK_train,
input_col = "tokens",
output_col = "word2vec",
vector_size = 15,
min_count = 1,
max_sentence_length = 4444,
num_partitions = 1,
step_size = 0.1,
max_iter = 10,
seed = 123456,
uid = random_string("word2vec_"))
# I tried to get the data from spark with:
myemb = mymodel %>% sparklyr::collect()
Has somebody had similar experiences? Can someone explain what exactly the ft_word2vec() function returns? Do you have an example on how to get the word embedding vectors with this function? Or does the returned column indeed contain the paragraph vectors?