I have 2 pyspark dataframes that I am trying to join where some of the values in the columns have parenthesis. For example one of the values is
"Mangy (Dog)"
If I try joining like so:
df1.join(df2 expr("df1.animal rlike df2.animal_stat")
I don't get any results.
So I tried filtering using rlike
just to see if I am able to capture the values.
The filtering worked on all values except those with parenthesis. For example when i try to filter like so:
df.filter(col('animal').rlike("Mangy (Dog)")).show()
I don't get any results.
However, if I filter with rlike("Mangy")
or rlike("(Dog)"
it seems to work. Even though I specified parenthesis in (Dog)
.
Is there a way to make rlike to include parenthesis in its matches?
EDIT:
I have 2 dataframes df1
and df2
like so:
+-----------------+-------+
| animal| origin|
+-----------------+-------+
| mangy (dog)|Streets|
| Cat| house|
|[Bumbling] Bufoon| Utopia|
| Cheetah| Congo|
|(Sprawling) Snake| Amazon|
+-----------------+-------+
+-------------------+-----------+
| animal_stat|destination|
+-------------------+-----------+
| ^dog$| House|
| ^Cat$| Streets|
|^[Bumbling] Bufoon$| Circus|
| ^Cheetah$| Zoo|
| ^(Sprawling)$| Glass Box|
+-------------------+-----------+
I am trying to join the two using rlike
using the following method:
dff1=df1.alias('dff1')
dff2=df2.alias('dff2')
combine=dff1.join(dff2, expr("dff1.animal rlike dff2.animal_stat"), how='left')
.drop(dff2.animal_stat)
I would like the output dataframe to be like so:
+-----------------+-------+-----------+
| animal| origin|destination|
+-----------------+-------+-----------+
| mangy (dog)|Streets| House|
| Cat| house| Streets|
|[Bumbling] Bufoon| Utopia| Circus|
| Cheetah| Congo| Zoo|
|(Sprawling) Snake| Amazon| Glass Box|
+-----------------+-------+-----------+