0

I use this function to find if the pattern is in the column and replace it with the replacement but it does not give.

Can someone tell where I make the mistakes?

patterns = [
    '15/19',
    '14/11',
    'HTP',
    'VTP'
]

replacements = [
    'S15/19',
    'S11/14',
    'HTP',
    'VTP'
]
def formate_column(output_column, df, patterns, replacements):
    for p, r in zip(patterns, replacements):
        df = (
            df.withColumn(output_column, F.when(F.col("column").contains(p), F.regexp_replace(F.col("column"), p, r)).otherwise(F.col("column")))
            
    )
    return df
ZygD
  • 22,092
  • 39
  • 79
  • 102
elokema
  • 107
  • 6
  • see [`.rlike()`](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/api/pyspark.sql.Column.rlike.html#pyspark-sql-column-rlike) if needed – samkart Oct 04 '22 at 18:39
  • I don't think you need `when` and `contains`, `regex_replace` already have that (replaced when matched). However you need to make sure to write in regex (https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.regexp_replace.html). – Emma Oct 04 '22 at 18:59

1 Answers1

1

You could move your both lists into one dictionary. Then, looping will be easy and efficient. You don't need when/otherwise. It's enough to use regexp_replace, because if a match is not found, nothing is being replaced in the string.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame([("xx_15/19_",), ("_14/11111",), ("no",)], ["column"])
df.show()
# +---------+
# |   column|
# +---------+
# |xx_15/19_|
# |_14/11111|
# |       no|
# +---------+

Script:

patterns = {
    r'15/19'  :'S15/19',
    r'14/11'  :'S11/14',
    r'HTP'    :'HTP',
    r'VTP'    :'VTP'
}

for k, v in patterns.items():
    df = df.withColumn("column", F.regexp_replace("column", k, v))
df.show()
# +----------+
# |    column|
# +----------+
# |xx_S15/19_|
# |_S11/14111|
# |        no|
# +----------+

df.explain()
# == Physical Plan ==
# *(1) Project [regexp_replace(regexp_replace(regexp_replace(regexp_replace(column#619, 15/19, S15/19, 1), 14/11, S11/14, 1), HTP, HTP, 1), VTP, VTP, 1) AS column#632]
# +- *(1) Scan ExistingRDD[column#619]
ZygD
  • 22,092
  • 39
  • 79
  • 102