79

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'.

I have tried:

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

But this throws:

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly?

starball
  • 20,030
  • 7
  • 43
  • 238
gaatjeniksaan
  • 1,412
  • 2
  • 12
  • 17

4 Answers4

154

Spark 2.2 onwards

df.filter(df.location.contains('google.com'))

Spark 2.2 documentation link


Spark 2.1 and before

You can use plain SQL in filter

df.filter("location like '%google.com%'")

or with DataFrame column methods

df.filter(df.location.like('%google.com%'))

Spark 2.1 documentation link

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
  • Hi Srinivas, what if we had to check for two words, let's say `google.com` and `amazon.com` using `like`? How can we do so? – cph_sto Oct 18 '18 at 14:48
  • 3
    @cph_bon: There are many ways to do it. **SQL** `df.filter("location like '%google.com%' AND location like '%amazon.com%'")` or **DataFrame** `df.filter("location like '%google.com%'").filter("location like '%amazon.com%'")` – mrsrinivas Oct 23 '18 at 05:33
  • 1
    @mrsrinivas, what is we want to search "like 'ID' " in all columns. For instance, daframe to have all columns that include word "ID" – jgtrz May 21 '20 at 18:43
  • For multiple substrings use rlike with a join like so: `df.filter(F.col("yourcol").rlike('|'.join(substrings)))` where substrings is a list of substrings like `substrings = ['google.com','amazon.com']` – anky Mar 17 '22 at 10:19
23

pyspark.sql.Column.contains() is only available in pyspark version 2.2 and above.

df.where(df.location.contains('google.com'))
pault
  • 41,343
  • 15
  • 107
  • 149
joaofbsm
  • 625
  • 5
  • 13
8

When filtering a DataFrame with string values, I find that the pyspark.sql.functions lower and upper come in handy, if your data could have column entries like "foo" and "Foo":

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))
caffreyd
  • 1,151
  • 1
  • 17
  • 25
0

You can try the following expression, which helps you search for multiple strings at the same time:

df.filter(""" location rlike 'google.com|amazon.com|github.com' """)
Rakesh Chintha
  • 615
  • 5
  • 6