Cannot find the rows using sorting, writing after LSH

Question

I've used LSH after ALS algorithm using pyspark and all seems works fine till I accidentally saw that I had some lost rows during the exploring. All was implemented with help of Spark LSH documentation example https://spark.apache.org/docs/latest/ml-features.html#tab_scala_28

When I specifically try to find the row where idA == 1 - I can do it. When I do repartition(1).write.csv or sorting --> all the rows with idA == 1 isn't in the table. May someone explain how is that possible?

I've used python API for Spark Version v2.2.0, python version is 3.6

A little bit code

brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",
                                    bucketLength=10.0, numHashTables=3)
table = model.approxSimilarityJoin(Pred_Factors, Pred_Factors, threshold=10.0, distCol="EuclideanDistance") \
            .select(col("datasetA.id").alias("idA"),
                    col("datasetB.id").alias("idB"),
                    col("EuclideanDistance")).cache()

P.S I've even tried to write the file into csv and search for these id and EuclidianDistance - as you can see that's all unsuccessful. These lost ids are truly too much (that's not only for id = 1). Maybe I don't understand some specifics of LSH algorithm but I can't find the logic of spark LSH behavior by myself.

score 0 · Answer 1 · answered Oct 24 '17 at 12:40

0

Here you used random partition because of this you got the problem. So now you have to used partitionBy(('idA') otherwise you can used table.orderBy('idA') for proper result.

answered Oct 24 '17 at 12:40

Sahil Desai

3,418
4
20
41

Cannot find the rows using sorting, writing after LSH

1 Answers1