How to change case of whole pyspark dataframe to lower or upper

Question

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help

#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])

Case of the data or of the column names? – T. Gawęda Feb 01 '18 at 13:52 — T. Gawęda, Feb 01 '18 at 13:52

score 13 · Answer 1 · answered Feb 01 '18 at 13:56

13

Assuming df is your dataframe, this should do the work:

from pyspark.sql import functions as F
for col in df.columns:
    df = df.withColumn(col, F.lower(F.col(col)))

answered Feb 01 '18 at 13:56

Steven

14,048
6
38
73

T. Gawęda · Accepted Answer · 2018-02-02T17:44:03.390

8

Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))

val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)

Now types are correct also when you have non-string fields, i.e. numeric fields). If you know that each column is of String type, use one of the other answers - they are correct in that cases :)

Python code in PySpark:

from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
 fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields) 
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)

edited Feb 02 '18 at 17:44

answered Feb 01 '18 at 14:52

T. Gawęda

15,706
4
46
61

According to OP, he wants to create a Hash from string columns ... therefore, they're all supposed to be stringType, no need to check the type. – Steven Feb 01 '18 at 14:59
2

@Steven We can assume it in this case. It's only an additional answer, if someone has similar problem, but DataFrame also has with numeric columns ;) – T. Gawęda Feb 01 '18 at 15:02
My dataframes contains all types of datatypes (String,numeric,date & many more). I am going with hash based matching since table keys information is not available.Please share it in python if possible.Thanks a ton – Jack Feb 01 '18 at 19:51
@Jack Sorry, I was busy in work. I will try to change to Python today or tomorrow afternoon :) – T. Gawęda Feb 01 '18 at 19:52
1

Thanks a ton T.Gaweda for helping me in resolving this problem – Jack Feb 02 '18 at 03:00
I am getting an error in this line `stringFieldsTransformed = stringFields.map(lambda f: col(f.name)).map(lambda f: lower(col(f)).alias(f))` – Jack Feb 02 '18 at 16:42
Error is AttributeError: 'filter' object has no attribute 'map' – Jack Feb 02 '18 at 16:44
@Jack Please try again with the code in the bottom of the answer – T. Gawęda Feb 02 '18 at 16:58
Thanks a lot for helping me out in your busy schedule. There are no changes in the dataframe `df = sourceDF.select(allFields)` since we are passing columns to the same dataframe. Please correct me if I am wrong. – Jack Feb 02 '18 at 17:22
1

@Jack You are right, changed it. Now I checked it:) – T. Gawęda Feb 02 '18 at 17:47

score 6 · Answer 3 · answered Feb 01 '18 at 14:34

6

You can generate an expression using list comprehension:

from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]

And then just call it over your existing dataframe

>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+

>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+

answered Feb 01 '18 at 14:34

philantrovert

9,904
3
37
61

psf function is useful for me to work on small data set, can it be safe to work on 40 million records table? coz now my client asking to make all data to upper case. – Koppula Mar 26 '21 at 15:29

How to change case of whole pyspark dataframe to lower or upper

3 Answers3

Linked