0

I want to trim extra characters from a given column "name". For that, I am using expr function within which I am passing a SQL expression to trim extra characters.

from pyspark.sql import Row
from pyspark.sql.functions import expr

data = [
    Row(id = 1, name = "Lisa Brenan", phone = Row(home = "+1 23456789", personal = None), projects = ["CIBC", "Shell"], salary = 11000),
    Row(id = 2, name = " Thomas Kingston", phone = Row(home = "+1 98765432", personal = "+1 2345665432"), projects = ["BMW"], salary = 15000),
    Row(id = 3, name = "[Lucy Pierson]", phone = Row(home = None, personal = None), projects= None, salary = 20000)
]

df = spark.createDataFrame(data)

df.\
    withColumn("correct_name", expr("rtrim(TRAILING ']' FROM name)")).\
    select("correct_name").\
    show()

I am receiving the below error message:

ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'FROM'.(line 1, pos 19)

== SQL ==
rtrim(TRAILING ']' FROM name)
-------------------^^^

Please let me know the cause and correct solution for the same.

James Z
  • 12,209
  • 10
  • 24
  • 44

2 Answers2

0

Here's what you're looking for. Note that the FROM statement is not needed in expr(). Let me know if this helped.

from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import expr

spark = SparkSession.builder.appName('ss').getOrCreate()

data = [
    Row(id = 1, name = "Lisa Brenan", phone = Row(home = "+1 23456789", personal = None), projects = ["CIBC", "Shell"], salary = 11000),
    Row(id = 2, name = " Thomas Kingston", phone = Row(home = "+1 98765432", personal = "+1 2345665432"), projects = ["BMW"], salary = 15000),
    Row(id = 3, name = "[Lucy Pierson]", phone = Row(home = None, personal = None), projects= None, salary = 20000)
]

df = spark.createDataFrame(data=data)

df = (
    df
    .withColumn("correct_name", expr("rtrim(']', name)"))
    .select("correct_name")
    )

df.show()

Returns:

+----------------+
|    correct_name|
+----------------+
|     Lisa Brenan|
| Thomas Kingston|
|   [Lucy Pierson|
+----------------+
Simon David
  • 663
  • 3
  • 13
  • You are using a deprecated method of rtrim with 2 parameters. It is acceptable. But doesn't answer to the current situation. I want to use rtrim that takes TRAILING or LEADING as one parameter. – aws_sangram May 13 '23 at 10:16
0

I have found answer to my own question.

from pyspark.sql import Row
from pyspark.sql.functions import expr

data = [
    Row(id = 1, name = "Lisa Brenan", phone = Row(home = "+1 23456789", personal = None), projects = ["CIBC", "Shell"], salary = 11000),
    Row(id = 2, name = " Thomas Kingston", phone = Row(home = "+1 98765432", personal = "+1 2345665432"), projects = ["BMW"], salary = 15000),
    Row(id = 3, name = "[Lucy Pierson]", phone = Row(home = None, personal = None), projects= None, salary = 20000)
]

df = spark.createDataFrame(data)

df.\
    withColumn("correct_name", expr("trim(TRAILING ']' FROM name)")).\
    select("correct_name").\
    show()

You cannot use TRAILING or LEADING simultaneously with rtrim and ltrim. That was the reason it was failing.