22

How can I use Spark SQL filter as a case insensitive filter?

For example:

dataFrame.filter(dataFrame.col("vendor").equalTo("fortinet"));

It just returns rows that 'vendor' column is equal to 'fortinet', but I want rows that 'vendor' column equal to 'fortinet' or 'Fortinet' or 'foRtinet' or ...

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Arman
  • 1,019
  • 2
  • 14
  • 33

3 Answers3

33

You can either use case-insensitive regex:

val df = sc.parallelize(Seq(
  (1L, "Fortinet"), (2L, "foRtinet"), (3L, "foo")
)).toDF("k", "v")

df.where($"v".rlike("(?i)^fortinet$")).show
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

or simple equality with lower / upper:

import org.apache.spark.sql.functions.{lower, upper}

df.where(lower($"v") === "fortinet")
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

df.where(upper($"v") === "FORTINET")
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

For simple filters I would prefer rlike although performance should be similar, for join conditions equality is a much better choice. See How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion? for details.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    @zero3232 I have this problem with all table. I mean I need that my application provides case insensitive result. is there any solution with which i can get SQLServer like results (where it ignores case everytime) ? – Parth Vishvajit Nov 28 '17 at 11:51
  • 1
    For Pyspark `lower` is available via `from pyspark.sql.functions import lower`. – akki Feb 20 '19 at 15:22
  • 1
    For future viewers confused by the `(?i)` as I was, that is the syntax for Scala regex flags ([case-insensitivity in this instance](https://stackoverflow.com/a/57119261)). – Excel Help Feb 02 '21 at 16:29
  • dataFrame.filter(dataFrame.col("vendor").rlike("(?i)^fortinet$")) – Gautam Jul 09 '23 at 07:47
21

Try to use lower/upper string functions:

dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))

or

dataFrame.filter(upper(dataFrame.col("vendor")).equalTo("FORTINET"))
Shawn Guo
  • 3,169
  • 3
  • 21
  • 28
0

Another alternative which saves a couple of sets of parenthesis:

import pyspark.sql.functions as f

df.filter(f.upper("vendor") == "FORTINET)
NonCreature0714
  • 5,744
  • 10
  • 30
  • 52