Spark SQL case insensitive filter for column conditions

Question

How can I use Spark SQL filter as a case insensitive filter?

For example:

dataFrame.filter(dataFrame.col("vendor").equalTo("fortinet"));

It just returns rows that 'vendor' column is equal to 'fortinet', but I want rows that 'vendor' column equal to 'fortinet' or 'Fortinet' or 'foRtinet' or ...

score 33 · Accepted Answer · edited May 23 '17 at 10:29

33

You can either use case-insensitive regex:

val df = sc.parallelize(Seq(
  (1L, "Fortinet"), (2L, "foRtinet"), (3L, "foo")
)).toDF("k", "v")

df.where($"v".rlike("(?i)^fortinet$")).show
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

or simple equality with lower / upper:

import org.apache.spark.sql.functions.{lower, upper}

df.where(lower($"v") === "fortinet")
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

df.where(upper($"v") === "FORTINET")
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

For simple filters I would prefer rlike although performance should be similar, for join conditions equality is a much better choice. See How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion? for details.

edited May 23 '17 at 10:29

Community

1
1

answered Jan 20 '16 at 08:05

zero323

322,348
103
959
935

1

@zero3232 I have this problem with all table. I mean I need that my application provides case insensitive result. is there any solution with which i can get SQLServer like results (where it ignores case everytime) ? – Parth Vishvajit Nov 28 '17 at 11:51
1

For Pyspark `lower` is available via `from pyspark.sql.functions import lower`. – akki Feb 20 '19 at 15:22
1

For future viewers confused by the `(?i)` as I was, that is the syntax for Scala regex flags ([case-insensitivity in this instance](https://stackoverflow.com/a/57119261)). – Excel Help Feb 02 '21 at 16:29
dataFrame.filter(dataFrame.col("vendor").rlike("(?i)^fortinet$")) – Gautam Jul 09 '23 at 07:47

score 21 · Answer 2 · answered Jan 20 '16 at 08:13

21

Try to use lower/upper string functions:

dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))

or

dataFrame.filter(upper(dataFrame.col("vendor")).equalTo("FORTINET"))

answered Jan 20 '16 at 08:13

Shawn Guo

3,169
3
21
28

score 0 · Answer 3 · answered Mar 10 '22 at 20:08

0

Another alternative which saves a couple of sets of parenthesis:

import pyspark.sql.functions as f

df.filter(f.upper("vendor") == "FORTINET)

answered Mar 10 '22 at 20:08

NonCreature0714

5,744
10
30
52

Spark SQL case insensitive filter for column conditions

3 Answers3

Linked

Related