71

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address
1       2 foo lane
2       10 bar lane
3       24 pants ln

Would become

id     address
1       2 foo ln
2       10 bar ln
3       24 pants ln
Luke
  • 6,699
  • 13
  • 50
  • 88

2 Answers2

170

For Spark 1.5 or later, you can use the functions package:

from pyspark.sql.functions import *
newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Quick explanation:

  • The function withColumn is called to add (or replace, if the name exists) a column to the data frame.
  • The function regexp_replace will generate a new column by replacing all substrings that match the pattern.
CheTesta
  • 577
  • 4
  • 17
Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • 32
    Just remember that the first parameter of regexp_replace refers to the column being changed, the second is the regex to find and the last is how to replace it. – lfvv Aug 10 '18 at 11:12
  • can I use regexp_replace inside a pipeline? Thanks – Kailegh Nov 08 '18 at 12:16
  • 4
    Can we change more than one item in this code? –  Jan 21 '21 at 10:11
  • @elham you can change any value that fits a regexp **expression** for one column using this function: https://spark.apache.org/docs/2.2.0/api/R/regexp_replace.html – gbeaven Mar 08 '21 at 20:46
  • How does it work for subtracting two [string columns within a single dataframe](https://stackoverflow.com/q/69540138/10452700) in PySpark? – Mario Oct 12 '21 at 12:04
  • 1
    Can this be adapted to replace only if entire string is matched and not substring? i.e., if I wanted to replace 'lane' by 'ln' but keep 'skylane' unchanged? – GreenEye Nov 08 '21 at 17:48
8

For scala

import org.apache.spark.sql.functions.regexp_replace
import org.apache.spark.sql.functions.col
data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", ""))
loneStar
  • 3,780
  • 23
  • 40