I've used LSH after ALS algorithm using pyspark and all seems works fine till I accidentally saw that I had some lost rows during the exploring. All was implemented with help of Spark LSH documentation example https://spark.apache.org/docs/latest/ml-features.html#tab_scala_28
When I specifically try to find the row where idA == 1 - I can do it. When I do repartition(1).write.csv or sorting --> all the rows with idA == 1 isn't in the table. May someone explain how is that possible?
I've used python API for Spark Version v2.2.0, python version is 3.6
A little bit code
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",
bucketLength=10.0, numHashTables=3)
table = model.approxSimilarityJoin(Pred_Factors, Pred_Factors, threshold=10.0, distCol="EuclideanDistance") \
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("EuclideanDistance")).cache()
P.S I've even tried to write the file into csv and search for these id and EuclidianDistance - as you can see that's all unsuccessful. These lost ids are truly too much (that's not only for id = 1). Maybe I don't understand some specifics of LSH algorithm but I can't find the logic of spark LSH behavior by myself.