0

I have 2 pyspark dataframes that I am trying to join where some of the values in the columns have parenthesis. For example one of the values is

"Mangy (Dog)"

If I try joining like so:

df1.join(df2 expr("df1.animal rlike df2.animal_stat")

I don't get any results. So I tried filtering using rlike just to see if I am able to capture the values.

The filtering worked on all values except those with parenthesis. For example when i try to filter like so:

df.filter(col('animal').rlike("Mangy (Dog)")).show()

I don't get any results.

However, if I filter with rlike("Mangy") or rlike("(Dog)" it seems to work. Even though I specified parenthesis in (Dog).

Is there a way to make rlike to include parenthesis in its matches?

EDIT: I have 2 dataframes df1 and df2 like so:

+-----------------+-------+
|           animal| origin|
+-----------------+-------+
|      mangy (dog)|Streets|
|              Cat|  house|
|[Bumbling] Bufoon| Utopia|
|          Cheetah|  Congo|
|(Sprawling) Snake| Amazon|
+-----------------+-------+

+-------------------+-----------+
|        animal_stat|destination|
+-------------------+-----------+
|              ^dog$|      House|
|              ^Cat$|    Streets|
|^[Bumbling] Bufoon$|     Circus|
|          ^Cheetah$|        Zoo|
|      ^(Sprawling)$|  Glass Box|
+-------------------+-----------+

I am trying to join the two using rlike using the following method:

dff1=df1.alias('dff1')
dff2=df2.alias('dff2')

combine=dff1.join(dff2, expr("dff1.animal rlike dff2.animal_stat"), how='left')
.drop(dff2.animal_stat)

I would like the output dataframe to be like so:

+-----------------+-------+-----------+
|           animal| origin|destination|
+-----------------+-------+-----------+
|      mangy (dog)|Streets|      House|
|              Cat|  house|    Streets|
|[Bumbling] Bufoon| Utopia|     Circus|
|          Cheetah|  Congo|        Zoo|
|(Sprawling) Snake| Amazon|  Glass Box|
+-----------------+-------+-----------+
thentangler
  • 1,048
  • 2
  • 12
  • 38

2 Answers2

2

Edit:

combine = df1.alias('df1').join(
    df2.withColumn('animal_stat', F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace('animal_stat', '\\(', '\\\\('), '\\)', '\\\\)'), '\\[', '\\\\['), '\\]', '\\\\]')).alias('df2'), 
    F.expr('df1.animal rlike df2.animal_stat'),
    'left'
)

If you're not using any regex, you probably want to use like instead of rlike. For example, you can do

df1.join(df2, expr("df1.animal like concat('%', df2.animal_stat, '%')"))

To do a filter, you can try

df.filter(col('animal').like("%Mangy (Dog)%")).show()
mck
  • 40,932
  • 13
  • 35
  • 50
  • Thank you for the answer. I am actually using regex. Is there a way to use the `rlike` in joins where you dont specify parts of the string? I was filtering in order to troubleshoot, but my main use for `rlike` is to join on columns that have regex. I have edited my post to show the same. – thentangler Mar 19 '21 at 15:41
  • the regex does not match for the first and last row – mck Mar 19 '21 at 15:51
  • also how could you tell whether the brackets are part of the regular expression or not? like, maybe you're trying to match any character in `[Bumbling]` because you enclosed it in square brackets? – mck Mar 19 '21 at 15:56
  • In my use case brackets (parentheses or square) will never be part of a regular expression. It should be part of the string to be matched to – thentangler Mar 19 '21 at 20:54
  • @thentangler see the edited answer? I think you can escape the parentheses using backslashes before doing an `rlike` join. – mck Mar 20 '21 at 09:36
  • Thank You I will try that. But can we replace the parentehsis with a blank instead? Like `F.regexp_replace('animal_stat', '\\(', ' ')` ? – thentangler Mar 21 '21 at 17:22
  • 1
    You could, it depends on whether you want to keep the parenthesis in the pattern to match – mck Mar 21 '21 at 17:27
0

.rlike() is the same as .like() except it uses regex. You need to escape the parentheses. Try filtering like this:

df.filter(col('animal').rlike("Mangy \(Dog\)")).show()

Not sure I can help with the original join issue without some sample data.