5

Running the Spark's example for Word2Vec, I realized that it takes in an array of string and gives out a vector. My question is, shouldn't it return a matrix instead of a vector? I was expecting one vector per input word. But it returns one vector period!

Or maybe it should have accepted string, instead of an array of strings (one word) as input. Then, yeah sure, it could return one vector as output. But accepting an array of strings and returning one single vector does not make sense to me.

[UPDATE]

Per @Shaido's request, here's the code with my minor change to print the schema for the output:

public class JavaWord2VecExample {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("JavaWord2VecExample")
                .getOrCreate();

        // $example on$
        // Input data: Each row is a bag of words from a sentence or document.
        List<Row> data = Arrays.asList(
                RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
                RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
                RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
        );
        StructType schema = new StructType(new StructField[]{
                new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
        });
        Dataset<Row> documentDF = spark.createDataFrame(data, schema);

        // Learn a mapping from words to Vectors.
        Word2Vec word2Vec = new Word2Vec()
                .setInputCol("text")
                .setOutputCol("result")
                .setVectorSize(7)
                .setMinCount(0);

        Word2VecModel model = word2Vec.fit(documentDF);
        Dataset<Row> result = model.transform(documentDF);

        for (Row row : result.collectAsList()) {
            List<String> text = row.getList(0);
            System.out.println("Schema: " + row.schema());
            Vector vector = (Vector) row.get(1);
            System.out.println("Text: " + text + " => \nVector: " + vector + "\n");
        }
        // $example off$

        spark.stop();
    }
}

And it prints:

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] => 
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] => 
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] => 
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]

Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Mehran
  • 15,593
  • 27
  • 122
  • 221

2 Answers2

8

This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...

To start with, how exactly individual word embeddings should be combined is not in principle a feature of the Word2Vec model itself (which is about, well, individual words), but an issue of concern to "higher order" models, such as Sentence2Vec, Paragraph2Vec, Doc2Vec, Wikipedia2Vec etc (you could name a few more, I guess...).

Having said that, it turns out indeed that a very first approach in combining word vectors in order to get vector representations of larger pieces of text (phrases, sentences, tweets etc) is indeed to simply average the vector representations of the constituent words, as Spark ML does.

Starting from the practitioner community, we have:

How to concatenate word vectors to form sentence vector (SO answer):

There are at least three common ways to combine embedding vectors; (a) summing, (b) summing & averaging or (c) concatenating. [...] See gensim.models.doc2vec.Doc2Vec, dm_concat and dm_mean - it allows you to use any of those three options

Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):

So what’s first thing that comes to your mind when you have word vectors and need to calculate sentence vector.

Just average them?

Yes that’s what we are going to do here. enter image description here

Sentence2Vec (Github repo):

Word2Vec can help to find other words with similar semantic meaning. However, Word2Vec can only take 1 word each time, while a sentence consists of multiple words. To solve this, I write the Sentence2Vec, which is actually a wrapper to Word2Vec. To obtain the vector of a sentence, I simply get the averaged vector sum of each word in the sentence.

It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.

An expected counter-argument here is that blog posts and SO answers are arguably not that credible sources; what about the researchers and the relevant scientific literature? Well, it turns out that this simple averaging is far from uncommon here, too:

From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

enter image description here

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

enter image description here


It should be clear by now that the particular design choice in Spark ML is far from arbitrary, or even uncommon; I have blogged about what certainly seem as absurd design choices in Spark ML (see Classification in Spark 2.0: “Input validation failed” and other wondrous tales), but it seems that this is not such a case...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again. – Mehran Nov 28 '18 at 23:30
2

To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:

+----------+-----------------------------------------------------------------+
|word      |vector                                                           |
+----------+-----------------------------------------------------------------+
|heard     |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148]  |
|are       |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147]  |
|neat      |[0.13949351012706757,0.08127426356077194,0.15970033407211304]    |
|classes   |[0.03703496977686882,0.05841822177171707,-0.02267565205693245]   |
|I         |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541]    |
|Logistic  |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196]  |
|Spark     |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could     |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097]  |
|use       |[0.08246973901987076,0.002503493567928672,-0.0796264186501503]   |
|Hi        |[0.16548289358615875,0.06477408856153488,0.09229831397533417]    |
|models    |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case      |[0.11626788973808289,0.10363516956567764,-0.07028932124376297]   |
|about     |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245]  |
|Java      |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178]  |
|wish      |[0.11882393807172775,0.13347993791103363,0.14399205148220062]    |
+----------+-----------------------------------------------------------------+

So each word does have it's own representation. However, what happens when you input a sentence (array of strings) to the model is that all the vectors of the words in the sentence get averaged together.

From the github implementation:

/**
  * Transform a sentence column to a vector column to represent the whole sentence. The transform
  * is performed by averaging all word vectors it contains.
  */
 @Since("2.0.0")
 override def transform(dataset: Dataset[_]): DataFrame = {
 ...

This can easily be confirmed, for example:

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]

The first element is computed by taking the average of the first element of the vectors of the five involved words,

(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5

which equals -0.011055880039930344.

Shaido
  • 27,497
  • 23
  • 70
  • 73
  • I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless! – Mehran Nov 13 '18 at 13:42
  • 1
    @Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix. – Shaido Nov 14 '18 at 01:08
  • I believe what you mean is that each row should hold one word (a column of type `String[]` with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something. – Mehran Nov 14 '18 at 01:22
  • 1
    @Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents. – Shaido Nov 14 '18 at 02:28
  • The reason why I asked this question is that this is absolutely unbelievable. AFAIK, what I'm asking is the default behaviour of Word2Vec and that's what it is designed for. It seems someone has added one extra step and messed up a very useful model. I mean this averaging step could still be added if they had returned a matrix. Going from matrix to vector (if someone needs it) should be implemented outside of the model. Anyways, thanks. – Mehran Nov 14 '18 at 04:04
  • 1
    @Mehran I concur (and +1 for your question), but IMHO this is not reason for *not* accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers... – desertnaut Nov 28 '18 at 18:24
  • @desertnaut I appreciate the time and effort but if I accept the given answer, it means I'm happy with the one I've got and I expect no more which is not true. I'm hoping someone could come up with an answer explaining the rationale. It's not a matter of points, I'm looking for someone to prove me wrong. Maybe my expectations are wrong. The given answer explains the "how" not the "why" as asked in the OP. – Mehran Nov 28 '18 at 18:42
  • @Mehran although the question as you have just framed it is arguably off-topic for SO, I have attempted an answer; hope it helps... – desertnaut Nov 28 '18 at 23:08