There are two DataFrames
-
DataFrame 1: DataFrame
containing the complete address.
DataFrame 2: DataFrame
containing the base data - Postcode
, District
& City / Town / Suburb
.
The aim of the problem is to extract the appropriate suburb
for DataFrame 1
from DataFrame 2
. Though OP has not explicitly specified the key
on which we can join the two DataFrames, but Postcode
only seems to be the reasonable choice.
# Importing requisite functions
from pyspark.sql.functions import col,regexp_extract,split,udf
from pyspark.sql.types import StringType
Let's create the DataFrame 1
as df
. In this DataFrame
we need to extract the Postcode
. In Australia, all the post codes are 4 digit long, so we use regexp_extract() to extract 4 digit number from the string
column.
df = sqlContext.createDataFrame([('BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia ',),
('HAY STREET HAYMARKET 2000, NSW, Australia',),
('SMART STREET FAIRFIELD 2165, NSW, Australia',),
('CLARENCE STREET SYDNEY 2000, NSW, Australia',)],
('FullAddress',))
df = df.withColumn('Postcode', regexp_extract('FullAddress', "(\\d{4})" , 1 ))
df.show(truncate=False)
+---------------------------------------------+--------+
|FullAddress |Postcode|
+---------------------------------------------+--------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |2113 |
|HAY STREET HAYMARKET 2000, NSW, Australia |2000 |
|SMART STREET FAIRFIELD 2165, NSW, Australia |2165 |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |2000 |
+---------------------------------------------+--------+
Now, that we have extracted the Postcode
, we have created the key
to join the two DataFrames
. Let's create the DataFrame 2
, from which we need to extract respective suburb
.
df_City_Town_Suburb = sqlContext.createDataFrame([(2000,'Sydney','Dawes Point, Haymarket, Millers Point, Sydney, The Rocks'),
(2001,'Sydney','Sydney'),(2113,'Sydney','North Ryde')],
('Postcode','District','City_Town_Suburb'))
df_City_Town_Suburb.show(truncate=False)
+--------+--------+--------------------------------------------------------+
|Postcode|District|City_Town_Suburb |
+--------+--------+--------------------------------------------------------+
|2000 |Sydney |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2001 |Sydney |Sydney |
|2113 |Sydney |North Ryde |
+--------+--------+--------------------------------------------------------+
Joining the two DataFrames
with left
join -
df = df.join(df_City_Town_Suburb.select('Postcode','City_Town_Suburb'), ['Postcode'],how='left')
df.show(truncate=False)
+--------+---------------------------------------------+--------------------------------------------------------+
|Postcode|FullAddress |City_Town_Suburb |
+--------+---------------------------------------------+--------------------------------------------------------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |North Ryde |
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
+--------+---------------------------------------------+--------------------------------------------------------+
Splitting the column City_Town_Suburb
into an array using split() function -
df = df.select('Postcode','FullAddress',split(col("City_Town_Suburb"), ",\s*").alias("City_Town_Suburb"))
df.show(truncate=False)
+--------+---------------------------------------------+----------------------------------------------------------+
|Postcode|FullAddress |City_Town_Suburb |
+--------+---------------------------------------------+----------------------------------------------------------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |[North Ryde] |
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
+--------+---------------------------------------------+----------------------------------------------------------+
Finally creating a UDF to check each and every element of the array City_Town_Suburb
if it exists in the column FullAddress
. If there exists a one, we return that immediately, else None
is returned.
def suburb(FullAddress,City_Town_Suburb):
# Check for the case where there is no Array, otherwise we will get an Error
if City_Town_Suburb == None:
return None
# Checking each and every Array element if it exists in 'FullAddress',
# and if a match is found, it's immediately returned.
for sub in City_Town_Suburb:
if sub.strip().upper() in FullAddress:
return sub.upper()
return None
suburb_udf = udf(suburb,StringType())
Applying this UDF
-
df = df.withColumn('suburb', suburb_udf(col('FullAddress'),col('City_Town_Suburb'))).drop('City_Town_Suburb')
df.show(truncate=False)
+--------+---------------------------------------------+----------+
|Postcode|FullAddress |suburb |
+--------+---------------------------------------------+----------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE|
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |HAYMARKET |
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |SYDNEY |
+--------+---------------------------------------------+----------+