0

I am using Pyspark 2.4.0.

I have a float list as such :

[0.6067762380157111,
 0.4595708424660512,
 0.20093402090021173,
 0.5736288504883545,
 0.46593043507957116,
 0.5734057882715504,
 0.6940067723754003,
 0.30921836906829625,
 0.768595041322314...]

that are the result of phonetic transcription, from a list of words.

I have multiple Pyspark DataFrames as such :

+------------------------+--------------------------+---------------------------------------------+
|com                     |split                     |phoned                                       |
+------------------------+--------------------------+---------------------------------------------+
|sans option             |[sans, option]            |[0.6832268970698724, 0.6248699945979845]     |
|                        |[]                        |[]                                           |
|fermer l hiver          |[fermer, l, hiver]        |[0.3154196179245309, 0.5, 0.3828829842720629]|
+------------------------+--------------------------+---------------------------------------------+

The idea would be to find, for every value in "phoned" in all of my Pyspark DataFrames if one value in the array is "close" to a value in the list (ie with a given threshold, that I can change).

So for every DataFrame, to "loop" through all the values in the column "phoned", loop through the given array, get the difference between the value and every element of the list and when a difference is below the threshold, get the given value in another column. If possible, I would like to get all the values that are the given threshold.

For example, if my list of words would be this one :

["sans",
 "opssion",
 "test",
 "ferme",...]

The phonetic transcription version would be :

[0.6832268970698724,
0.625416705239052,
0.7390120210368145,
0.3154165838771569,...]

And the result I would like, for a threshold of 0.01,

+------------------------+--------------------------+---------------------------------------------+--------------+
|com                     |split                     |phoned                                       |result        |
+------------------------+--------------------------+---------------------------------------------+--------------+
|sans option             |[sans, option]            |[0.6832268970698724, 0.6248699945979845]     |[sans, opssion]
|                        |[]                        |[]                                           |[]            |
|fermer l hiver          |[fermer, l, hiver]        |[0.3154196179245309, 0.5, 0.3828829842720629]|[ferme]       |
+------------------------+--------------------------+---------------------------------------------+--------------+

I've tried my way around with some UDF, but I didn't find a solution that gives this kind of result. I have a dozen of DataFrames, some with multiple columns "com" and with ~1 million record for some of them, so I can't deal with it in Pandas.

Thanks !

1 Answers1

0

Your DataFrame(df_1)

+--------------+------------------+---------------------------------------------+
|com           |split             |phoned                                       |
+--------------+------------------+---------------------------------------------+
|sans option   |[sans, option]    |[0.6832268970698724, 0.6248699945979845]     |
|              |[]                |[]                                           |
|fermer l hiver|[fermer, l, hiver]|[0.3154196179245309, 0.5, 0.3828829842720629]|
+--------------+------------------+---------------------------------------------+

Your constants:

threshold = 0.01

match = [0.6067762380157111,0.4595708424660512,
         0.20093402090021173,0.5736288504883545,
         0.46593043507957116,0.5734057882715504,
         0.6940067723754003,0.30921836906829625,0.768595041322314]

Created UDF

def between_threshold(element, threshold): 
    if element is None:
        return False
    return any([abs(element - match_element) <= threshold for match_element in match])

Importing necessary packages:

from pyspark.sql.functions import lit, col, udf, row_number, posexplode_outer, collect_list
  1. Create an unique row identifier
row_window_spec = Window.orderBy(lit(1))

df_2 = df_1.withColumn("row_num", row_number().over(row_window_spec))
  1. Find if the elements in phoned is within the threshold or not UDF
df_3 = df_2.select("row_num", posexplode_outer("split").alias("index", "split"), "phoned")

df_4 = df_3.withColumn("phoned", col("phoned")[col("index")])

df_5 = df_4.withColumn("between_threshold", udf(between_threshold_udf("phoned", lit(threshold))))
  1. Filter and Join with the original DataFrame
df_6 = df_5.filter(col("between_threshold") == True) \
        .groupBy("row_num") \
        .agg(collect_list("split").alias("result"))

df_2.join(df_6, "row_num", "left").drop("row_num").show(truncate=False)

Output

+--------------+------------------+---------------------------------------------+--------+
|com           |split             |phoned                                       |result  |
+--------------+------------------+---------------------------------------------+--------+
|sans option   |[sans, option]    |[0.6832268970698724, 0.6248699945979845]     |null    |
|              |[]                |[]                                           |null    |
|fermer l hiver|[fermer, l, hiver]|[0.3154196179245309, 0.5, 0.3828829842720629]|[fermer]|
+--------------+------------------+---------------------------------------------+--------+

Note: In this dataset only fermer matches with the threshold 0.01

arudsekaberne
  • 830
  • 4
  • 11
  • This works perfectly ! I had to create the udf between_threshold_udf as between_threshold_udf = F.udf(between_threshold, BooleanType()), but that's it. Also, I had the impression that the row_num calculated on every row was kind of changing every time I ran my cell so the last join wasn't giving the right values. Instead I used an ID column I had on the table and no more problem. Thanks a lot ! – Antoine Fernandes Mar 29 '23 at 08:56
  • Yes I added `row_num` for identifying each rows, if you have one before you can use it as well :) – arudsekaberne Mar 29 '23 at 09:33
  • Also you can improve the udf by breaking the for loop when you find a matching number within threshold limit. – arudsekaberne Mar 29 '23 at 09:34