0

I seek to standardize header names of my DataFrames given a reference table.

My reference table is a DataFrame with variables in rows, and the standard and all possible variant names as columns :

+-------------+---------+---------+
|Standard_name|Variant_1|Variant_2|
+-------------+---------+---------+
|     Pressure|    Press|  Press_1|
|        Speed|   Speed_|     Rate|
+-------------+---------+---------+

Say I have a DataFrame of data with these column names :

['Pressure', 'Rate', 'Altitude']

I want to look for each of these variable names in my reference DataFrame, return the corresponding Standard_name if it exist or keep the original variable if it is not yet referenced in the table.

Thus, the expected outcome of the dummy example above should be :

[Pressure, 'Speed', Altitude]

This is easy to do in regular Python Pandas, but I have no idea how to do that in Spark where you're not supposed to think in terms of row indices.

Many thanks in advance for the help.

Benjamin
  • 173
  • 1
  • 2
  • 11

1 Answers1

0

Although I agree with mayank agrawal's comment above, I tried to solve this only with transformations.

I adapted this solution to extract all pairwise correspondences of each variant with the standard name in a big dictionary. I then mapped the dictionary onto the dataset headers to create a new column of standardized headers.

Thus the solution is :

from pyspark.sql import Row
from pyspark.sql.types import *
import pyspark.sql.functions as F
from itertools import chain


key_value_map = F.udf(lambda maps: {key:f[key] for f in maps for key in f},
    MapType(StringType(),StringType()))


map_df = variable_df
    .agg(F.collect_list(F.create_map(list(chain.from_iterable([[key, 'Standard'] for key in var_df.columns[2:]])))).alias('maps')) 
    .agg(F.collect_list(key_value_map('maps')))

result_dict = map_df.collect()  
ref_dict = result_dict[0][0][0]

corresp_df = header_df
    .withColumn('new_header', F.create_map([F.lit(x) for x in chain(*ref_dict.items())]).getItem(F.col('old_header')))    
    .withColumn("new_header", F.coalesce(F.col('new_header'), F.col('old_header')))

new_columns = corresp_df.select('new_header').rdd.flatMap(lambda row : row).collect()

renamed_df = data_df.toDF(*new_columns)

References :

Dataframe pyspark to dict

pyspark create new column with mapping from a dict

Benjamin
  • 173
  • 1
  • 2
  • 11