PySpark ApproxSimilarityJoin Missing Results

Question

I am trying to do a similarity join between two dataframes by applying MinHashLSH on the bigrams of metaphone representations of names. This works well in most cases but does not appear to handle short substring cases.

For example, I want to look up names with metaphones similar to "LTSNKK"
The result of the approx similarity join looks like this:

| Metaphone        | Confidence |  
|------------------|------------|  
| LTSNKK           | 0.000      |  
| MLTSNKK          | 0.166      |  
| LTSNK            | 0.199      |  
| PLTSSNKK         | 0.285      |  
| LTSNKT           | 0.333      |  
| AFLNKNKPRSNLTRNR | 0.812      |

However, there is another name that does not get caught by the join, "LTS". I expected that "LTS" would appear with a confidence somewhere around 0.2, but that is not happening.

My join is configured with a max confidence of 1.0, raising the limit to higher values has had no effect.

approxSimilarityJoin(hashedInputFrame, hashedReferenceFrame, 1.0, "confidence")

Is there some hidden limit to pyspark's approx similarity join that would cause it to ignore "LTS" but consider "LTSNK"?

Why "confidence"? It's the distance (of the two strings in your case). https://doc.lucidworks.com/fusion-server/4.0/spark-guide/2.2/api/java/org/apache/spark/ml/feature/MinHashLSHModel.html#approxSimilarityJoin-org.apache.spark.sql.Dataset-org.apache.spark.sql.Dataset-double-java.lang.String- — Giovanni, Nov 21 '19 at 11:31
That's an artifact of the examples I found online. A better name would be JaccardDistance, but I wasn't as aware of that at the time. — Daniel Bishop, Nov 22 '19 at 17:46

PySpark ApproxSimilarityJoin Missing Results

0 Answers0