I am new to Spark ML. Spark ML has MinHash implementation for Jaccard Distance. Please see the doc https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance. In the sample code, input data for comparison are from vectors. I have no question about the sample code. But When I use the text docs as input and then convert them to vectors via word2Vec, I got 0 jaccard distance. Do not know what's wrong in my codes. Something I did not understand. Thanks in advance for any help.
SparkSession spark = SparkSession.builder().appName("TestMinHashLSH").config("spark.master", "local").getOrCreate();
List<Row> data1 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" "))));
List<Row> data2 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Scala".split(" "))),
RowFactory.create(Arrays.asList("I wish python could also use case classes".split(" "))));
StructType schema4word = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) });
Dataset<Row> documentDF1 = spark.createDataFrame(data1, schema4word);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(30).setMinCount(0);
Word2VecModel w2vModel1 = word2Vec.fit(documentDF1);
Dataset<Row> result1 = w2vModel1.transform(documentDF1);
List<Row> myDataList1 = new ArrayList<>();
int id = 0;
for (Row row : result1.collectAsList()) {
List<String> text = row.getList(0);
Vector vector = (Vector) row.get(1);
myDataList1.add(RowFactory.create(id++, vector));
}
StructType schema1 = new StructType(
new StructField[] { new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) });
Dataset<Row> df1 = spark.createDataFrame(myDataList1, schema1);
Dataset<Row> documentDF2 = spark.createDataFrame(data2, schema4word);
Word2VecModel w2vModel2 = word2Vec.fit(documentDF2);
Dataset<Row> result2 = w2vModel2.transform(documentDF2);
List<Row> myDataList2 = new ArrayList<>();
id = 10;
for (Row row : result2.collectAsList()) {
List<String> text = row.getList(0);
Vector vector = (Vector) row.get(1);
System.out.println("Text: " + text + " => \nVector: " + vector + "\n");
myDataList2.add(RowFactory.create(id++, vector));
}
Dataset<Row> df2 = spark.createDataFrame(myDataList2, schema1);
MinHashLSH mh = new MinHashLSH().setNumHashTables(5).setInputCol("features").setOutputCol("hashes");
MinHashLSHModel model = mh.fit(df1);
// Feature Transformation
System.out.println("The hashed dataset where hashed values are stored in the column 'hashes':");
model.transform(df1).show();
// Compute the locality sensitive hashes for the input rows, then perform
// approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed
// dataset, e.g.
// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
System.out.println("Approximately joining df1 and df2 on Jaccard distance smaller than 0.6:");
model.approxSimilarityJoin(df1, df2, 1.6, "JaccardDistance")
.select(col("datasetA.id").alias("id1"), col("datasetB.id").alias("id2"), col("JaccardDistance"))
.show();
// $example off$
spark.stop();
From Word2Vec, I got the different vectors for different docs. I would expect to get some non zero values for JaccardDistance when comparing two different docs. But instead, I got all 0s. The following shows what I got when I run the program:
Text: [Hi, I, heard, about, Scala] => Vector: [0.005808539432473481,-0.001387741044163704,0.007890049391426146,... ,04969391227]
Text: [I, wish, python, could, also, use, case, classes] => Vector: [-0.0022146602132124826,0.0032128597667906433,-0.00658524181926623,...,-3.716901264851913E-4]
Approximately joining df1 and df2 on Jaccard distance smaller than 0.6: +---+---+---------------+ |id1|id2|JaccardDistance| +---+---+---------------+ | 1| 11| 0.0| | 0| 10| 0.0| | 2| 11| 0.0| | 0| 11| 0.0| | 1| 10| 0.0| | 2| 10| 0.0| +---+---+---------------+