0

I've used LSH after ALS algorithm using pyspark and all seems works fine till I accidentally saw that I had some lost rows during the exploring. All was implemented with help of Spark LSH documentation example https://spark.apache.org/docs/latest/ml-features.html#tab_scala_28

When I specifically try to find the row where idA == 1 - I can do it. When I do repartition(1).write.csv or sorting --> all the rows with idA == 1 isn't in the table. May someone explain how is that possible?

I've used python API for Spark Version v2.2.0, python version is 3.6

A little bit code

brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",
                                    bucketLength=10.0, numHashTables=3)
table = model.approxSimilarityJoin(Pred_Factors, Pred_Factors, threshold=10.0, distCol="EuclideanDistance") \
            .select(col("datasetA.id").alias("idA"),
                    col("datasetB.id").alias("idB"),
                    col("EuclideanDistance")).cache()

enter image description here

P.S I've even tried to write the file into csv and search for these id and EuclidianDistance - as you can see that's all unsuccessful. These lost ids are truly too much (that's not only for id = 1). Maybe I don't understand some specifics of LSH algorithm but I can't find the logic of spark LSH behavior by myself.

Ivan Shelonik
  • 1,958
  • 5
  • 25
  • 49

1 Answers1

0

Here you used random partition because of this you got the problem. So now you have to used partitionBy(('idA') otherwise you can used table.orderBy('idA') for proper result.

Sahil Desai
  • 3,418
  • 4
  • 20
  • 41