String matching across PySpark DataFrame columns

Question

I seek to standardize header names of my DataFrames given a reference table.

My reference table is a DataFrame with variables in rows, and the standard and all possible variant names as columns :

+-------------+---------+---------+
|Standard_name|Variant_1|Variant_2|
+-------------+---------+---------+
|     Pressure|    Press|  Press_1|
|        Speed|   Speed_|     Rate|
+-------------+---------+---------+

Say I have a DataFrame of data with these column names :

['Pressure', 'Rate', 'Altitude']

I want to look for each of these variable names in my reference DataFrame, return the corresponding Standard_name if it exist or keep the original variable if it is not yet referenced in the table.

Thus, the expected outcome of the dummy example above should be :

[Pressure, 'Speed', Altitude]

This is easy to do in regular Python Pandas, but I have no idea how to do that in Spark where you're not supposed to think in terms of row indices.

Many thanks in advance for the help.

If your reference table is not that big, better to `collect` it and then match the names. — mayank agrawal, Oct 01 '18 at 07:16

Benjamin · Answer 1 · 2019-10-04T07:26:16.853

Although I agree with mayank agrawal's comment above, I tried to solve this only with transformations.

I adapted this solution to extract all pairwise correspondences of each variant with the standard name in a big dictionary. I then mapped the dictionary onto the dataset headers to create a new column of standardized headers.

Thus the solution is :

from pyspark.sql import Row
from pyspark.sql.types import *
import pyspark.sql.functions as F
from itertools import chain


key_value_map = F.udf(lambda maps: {key:f[key] for f in maps for key in f},
    MapType(StringType(),StringType()))


map_df = variable_df
    .agg(F.collect_list(F.create_map(list(chain.from_iterable([[key, 'Standard'] for key in var_df.columns[2:]])))).alias('maps')) 
    .agg(F.collect_list(key_value_map('maps')))

result_dict = map_df.collect()  
ref_dict = result_dict[0][0][0]

corresp_df = header_df
    .withColumn('new_header', F.create_map([F.lit(x) for x in chain(*ref_dict.items())]).getItem(F.col('old_header')))    
    .withColumn("new_header", F.coalesce(F.col('new_header'), F.col('old_header')))

new_columns = corresp_df.select('new_header').rdd.flatMap(lambda row : row).collect()

renamed_df = data_df.toDF(*new_columns)

References :

Dataframe pyspark to dict

pyspark create new column with mapping from a dict

String matching across PySpark DataFrame columns

1 Answers1