PySpark find if pattern in one column is present in another column

Question

I've two pyspark data frames. One contain FullAddress field(say col1) and another data frame contains name of city/town/suburb in one of the columns(say col2). I want to compare col2 with col1 and return col2 if there is a match.

Additionally, the suburb name could be a list of suburb name.

Dataframe1 that contains full address

+--------+--------+----------------------------------------------------------+
|Postcode|District|City/ Town/ Suburb                                        |
+--------+--------+----------------------------------------------------------+
|2000    |Sydney  |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks  |
|2001    |Sydney  |Sydney                                                    |
|2113    |Sydney  |North Ryde                                                |
+--------+--------+----------------------------------------------------------+



+-----------------------------------------------------------+
|FullAddress                                                |
+-----------------------------------------------------------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia               |
| HAY STREET HAYMARKET 2000, NSW, Australia                 |
| SMART STREET FAIRFIELD 2165, NSW, Australia               |
|CLARENCE STREET SYDNEY 2000, NSW, Australia                |
+-----------------------------------------------------------+

I would like to have something like this

+-----------------------------------------------------------++-----------+
|FullAddress                                                |suburb      |
+-----------------------------------------------------------++-----------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia               |NORTH RYDE  |
| HAY STREET HAYMARKET 2000, NSW, Australia                 |HAYMARKET   |
| SMART STREET FAIRFIELD 2165, NSW, Australia               |NULL        |
|CLARENCE STREET SYDNEY 2000, NSW, Australia                |SYDNEY      |
+-----------------------------------------------------------++-----------+

Do you need to join the dataframes on Postcode? – Giorgos Myrianthous Mar 30 '19 at 13:56 — Giorgos Myrianthous, Mar 30 '19 at 13:56

score 1 · Accepted Answer · answered Mar 30 '19 at 21:59

There are two DataFrames -

DataFrame 1: DataFrame containing the complete address.

DataFrame 2: DataFrame containing the base data - Postcode, District & City / Town / Suburb.

The aim of the problem is to extract the appropriate suburb for DataFrame 1 from DataFrame 2. Though OP has not explicitly specified the key on which we can join the two DataFrames, but Postcode only seems to be the reasonable choice.

# Importing requisite functions
from pyspark.sql.functions import col,regexp_extract,split,udf
from pyspark.sql.types import StringType

Let's create the DataFrame 1 as df. In this DataFrame we need to extract the Postcode. In Australia, all the post codes are 4 digit long, so we use regexp_extract() to extract 4 digit number from the string column.

df = sqlContext.createDataFrame([('BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia ',),
                                 ('HAY STREET HAYMARKET 2000, NSW, Australia',),
                                 ('SMART STREET FAIRFIELD 2165, NSW, Australia',),
                                 ('CLARENCE STREET SYDNEY 2000, NSW, Australia',)],
                                 ('FullAddress',))
df = df.withColumn('Postcode', regexp_extract('FullAddress', "(\\d{4})" , 1 ))
df.show(truncate=False)
+---------------------------------------------+--------+
|FullAddress                                  |Postcode|
+---------------------------------------------+--------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |2113    |
|HAY STREET HAYMARKET 2000, NSW, Australia    |2000    |
|SMART STREET FAIRFIELD 2165, NSW, Australia  |2165    |
|CLARENCE STREET SYDNEY 2000, NSW, Australia  |2000    |
+---------------------------------------------+--------+

Now, that we have extracted the Postcode, we have created the key to join the two DataFrames. Let's create the DataFrame 2, from which we need to extract respective suburb.

df_City_Town_Suburb = sqlContext.createDataFrame([(2000,'Sydney','Dawes Point, Haymarket, Millers Point, Sydney, The Rocks'),
                                             (2001,'Sydney','Sydney'),(2113,'Sydney','North Ryde')],
                                             ('Postcode','District','City_Town_Suburb'))
df_City_Town_Suburb.show(truncate=False)

+--------+--------+--------------------------------------------------------+
|Postcode|District|City_Town_Suburb                                        |
+--------+--------+--------------------------------------------------------+
|2000    |Sydney  |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2001    |Sydney  |Sydney                                                  |
|2113    |Sydney  |North Ryde                                              |
+--------+--------+--------------------------------------------------------+

Joining the two DataFrames with left join -

df = df.join(df_City_Town_Suburb.select('Postcode','City_Town_Suburb'), ['Postcode'],how='left')
df.show(truncate=False)
+--------+---------------------------------------------+--------------------------------------------------------+
|Postcode|FullAddress                                  |City_Town_Suburb                                        |
+--------+---------------------------------------------+--------------------------------------------------------+
|2113    |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |North Ryde                                              |
|2165    |SMART STREET FAIRFIELD 2165, NSW, Australia  |null                                                    |
|2000    |HAY STREET HAYMARKET 2000, NSW, Australia    |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2000    |CLARENCE STREET SYDNEY 2000, NSW, Australia  |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
+--------+---------------------------------------------+--------------------------------------------------------+

Splitting the column City_Town_Suburb into an array using split() function -

df = df.select('Postcode','FullAddress',split(col("City_Town_Suburb"), ",\s*").alias("City_Town_Suburb"))
df.show(truncate=False)
+--------+---------------------------------------------+----------------------------------------------------------+
|Postcode|FullAddress                                  |City_Town_Suburb                                          |
+--------+---------------------------------------------+----------------------------------------------------------+
|2113    |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |[North Ryde]                                              |
|2165    |SMART STREET FAIRFIELD 2165, NSW, Australia  |null                                                      |
|2000    |HAY STREET HAYMARKET 2000, NSW, Australia    |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
|2000    |CLARENCE STREET SYDNEY 2000, NSW, Australia  |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
+--------+---------------------------------------------+----------------------------------------------------------+

Finally creating a UDF to check each and every element of the array City_Town_Suburb if it exists in the column FullAddress. If there exists a one, we return that immediately, else None is returned.

def suburb(FullAddress,City_Town_Suburb):
   # Check for the case where there is no Array, otherwise we will get an Error
   if City_Town_Suburb == None:
      return None
   # Checking each and every Array element if it exists in 'FullAddress',
   # and if a match is found, it's immediately returned.
   for sub in City_Town_Suburb:
      if sub.strip().upper() in FullAddress:
         return sub.upper()
   return None
suburb_udf = udf(suburb,StringType())

Applying this UDF -

df = df.withColumn('suburb', suburb_udf(col('FullAddress'),col('City_Town_Suburb'))).drop('City_Town_Suburb')
df.show(truncate=False)
+--------+---------------------------------------------+----------+
|Postcode|FullAddress                                  |suburb    |
+--------+---------------------------------------------+----------+
|2113    |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE|
|2165    |SMART STREET FAIRFIELD 2165, NSW, Australia  |null      |
|2000    |HAY STREET HAYMARKET 2000, NSW, Australia    |HAYMARKET |
|2000    |CLARENCE STREET SYDNEY 2000, NSW, Australia  |SYDNEY    |
+--------+---------------------------------------------+----------+

PySpark find if pattern in one column is present in another column

1 Answers1