PySpark Cosine Similarity between two vectors of TF-IDF values the Cosine Similarity using SparseMatrix + koalas or Pandas API on Spark

Question

I do try to implement this Name Matching Cosine Similarity approach/functions get_matches_df in pyspark and pandas_on_spark(koalas) and struggling with optimizing this function (I do try to avoid conversion toPandas() for dataframes because will overload driver so I want to optimize this function and to scale it, so basically a batch approach will work perfect as in this example, or use pandas_udfs or simple UDFs that takes 1 vector and 2 dataframes:

>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
...     return pdf[pdf.a > 1]  # allow arbitrary length
...
>>> psdf.pandas_on_spark.apply_batch(pandas_plus)

this is the function I do work on optimizing (everything else I converted and created custom tfidfvectorizer, scaling cosine, pyspark sparsematrix generator and all I have left to optimize is this part (because uses loc and not sure how does work, I don't mind to have it behave as pandas aka all dataframe to driver but ideally will be

def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()
    
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size
    
    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)
    
    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]
    
    return pd.DataFrame({'left_side': left_side,
                          'right_side': right_side,
                           'similairity': similairity})

Are pandas udf a possible solution to your problem? I give you some potentially useful links: [1](https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), [2](https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/udf-python-pandas), [3](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html) — Ric S, Dec 16 '21 at 08:22
@RicS, thank you, I do have it done using koalas and works as expected but I do force in “batch” format for name_vector for very large datasets due the “pandas format” index. Didn’t considered pandasUDFs due index (currently I get name_vector with index made as column and joining them but that is a behavior pure pandas and not an efficient for large data, unless I take batches and distribute matches). — n1tk, Dec 16 '21 at 18:32

PySpark Cosine Similarity between two vectors of TF-IDF values the Cosine Similarity using SparseMatrix + koalas or Pandas API on Spark

0 Answers0