7

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help

#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
Jack
  • 957
  • 3
  • 10
  • 23

3 Answers3

13

Assuming df is your dataframe, this should do the work:

from pyspark.sql import functions as F
for col in df.columns:
    df = df.withColumn(col, F.lower(F.col(col)))
Steven
  • 14,048
  • 6
  • 38
  • 73
8

Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))

val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)

Now types are correct also when you have non-string fields, i.e. numeric fields). If you know that each column is of String type, use one of the other answers - they are correct in that cases :)

Python code in PySpark:

from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
 fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields) 
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • According to OP, he wants to create a Hash from string columns ... therefore, they're all supposed to be stringType, no need to check the type. – Steven Feb 01 '18 at 14:59
  • 2
    @Steven We can assume it in this case. It's only an additional answer, if someone has similar problem, but DataFrame also has with numeric columns ;) – T. Gawęda Feb 01 '18 at 15:02
  • My dataframes contains all types of datatypes (String,numeric,date & many more). I am going with hash based matching since table keys information is not available.Please share it in python if possible.Thanks a ton – Jack Feb 01 '18 at 19:51
  • @Jack Sorry, I was busy in work. I will try to change to Python today or tomorrow afternoon :) – T. Gawęda Feb 01 '18 at 19:52
  • 1
    Thanks a ton T.Gaweda for helping me in resolving this problem – Jack Feb 02 '18 at 03:00
  • I am getting an error in this line `stringFieldsTransformed = stringFields.map(lambda f: col(f.name)).map(lambda f: lower(col(f)).alias(f))` – Jack Feb 02 '18 at 16:42
  • Error is AttributeError: 'filter' object has no attribute 'map' – Jack Feb 02 '18 at 16:44
  • @Jack Please try again with the code in the bottom of the answer – T. Gawęda Feb 02 '18 at 16:58
  • Thanks a lot for helping me out in your busy schedule. There are no changes in the dataframe `df = sourceDF.select(allFields)` since we are passing columns to the same dataframe. Please correct me if I am wrong. – Jack Feb 02 '18 at 17:22
  • 1
    @Jack You are right, changed it. Now I checked it:) – T. Gawęda Feb 02 '18 at 17:47
6

You can generate an expression using list comprehension:

from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]

And then just call it over your existing dataframe

>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+

>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • psf function is useful for me to work on small data set, can it be safe to work on 40 million records table? coz now my client asking to make all data to upper case. – Koppula Mar 26 '21 at 15:29