Spark 2.x - Running Logistic with word2vec or HashingTF

Question

I am running a Logistic Regression with the following code given at https://spark.apache.org/docs/2.2.0/ml-pipeline.html (Example: Pipeline)

Original Code from Link...

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

val training = spark.createDataFrame(Seq(
    (0L, "a b c d e spark", 1.0),
    (1L, "b d", 0.0),
    (2L, "spark f g h", 1.0),
    (3L, "hadoop mapreduce", 0.0)
  )).toDF("id", "text", "label")
  val test = spark.createDataFrame(Seq(
    (4L, "spark i j k"),
    (5L, "l m n"),
    (6L, "spark hadoop spark"),
    (7L, "apache hadoop")
  )).toDF("id", "text")

  // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
  val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
  val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
  val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001)
  val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF,lr))
  val model = pipeline.fit(training)
  model.transform(test).show()

There are two things happening - Can someone help explain

If I reduce the setNumFeatures to 10. i.e. setNumFeatures(10), then the algorithm predicts id 5 in the test to be 1. I was thinking that this might be because of hashing collision.
When i change my code to word2vec instead of hashingTF

val tokenizer = new Tokenizer(
).setInputCol("text"
).setOutputCol("words")

val word2Vec = new Word2Vec(
).setInputCol(tokenizer.getOutputCol
).setOutputCol("features").setVectorSize(1000).setMinCount(0)

val lr = new LogisticRegression(
).setMaxIter(10).setRegParam(0.001)

val pipeline = new Pipeline(
).setStages(Array(tokenizer, word2Vec,lr))

val model = pipeline.fit(training)

model.transform(test).show()

This also gives me id 5 prediction as 1, even at VectorSize 1000. I also noticed that the column "features" is all zeros for id = 5. When i change the test data to the following it predicts correctly

val test = spark.createDataFrame(Seq(
    (4L, "spark i j k"),
    (5L, "l d"),
    (6L, "spark hadoop spark"),
    (7L, "apache hadoop")
  )).toDF("id", "text")

Questions 1. What is the best way to run a logisticRegression in an situation where my test data might not contain words in the train data. 2. In a case like this would hashingTF be better than word2vec 3. What is the logic to set - setNumFeatures and setVectorSize

Spark 2.x - Running Logistic with word2vec or HashingTF

0 Answers0