1

I'm using PySpark 2.4 and I noticed that the pyspark.sql.functions module is missing some methods like trim and col. In PyCharm, it shows as undefined. However, I have written some tasks using these methods and they run correctly in the local environment of PySpark 2.4, with the expected results. Why is that?

Here is my environment setup:

from pyspark.sql import SparkSession

def create_env():
    spark = SparkSession.builder \
        .appName("HiveTest") \
        .master("local") \
        .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
        .config("spark.hadoop.hive.metastore.uris", "thrift://master:9083") \
        .config("spark.hadoop.hive.exec.scratchdir", "/user/hive/tmp") \
        .enableHiveSupport() \
        .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    return spark

And here is an excerpt of my SparkSQL code:

df = spark.table("ods.t_ctp20_department_d").select(
    trim(col("departmentid")).alias("branch_id"),
    trim(col("departmentid")).alias("branch_no"),
    trim(col("departmentname")).alias("branch_name"),
    when(trim(col("departmentid")) == 'FU', '00')
    .when(length(trim(col("departmentid"))) == 2, 'FU')
    .when(length(trim(col("departmentid"))) == 4, substring(trim(col("departmentid")), 1, 2))
    .when(length(trim(col("departmentid"))) == 6, substring(trim(col("departmentid")), 1, 4))
    .otherwise(substring(trim(col("departmentid")), 1, 6)).alias("up_branch_no"),
    lit('0').alias("branch_type"),
    lit('00').alias("data_source"),
    col("brokerid").alias("brokers_id"),
    lit(busi_date).alias("ds_date")
)

I tried using the trim and col methods from the pyspark.sql.functions module in my PySpark 2.4 code. Surprisingly, even though my PyCharm IDE highlighted these methods as undefined, my code still executed successfully in the local PySpark 2.4 environment and produced the expected results.

I have a Python script that I run either by executing "python3 xx.python" or by using a remote interpreter in PyCharm. The remote interpreter is set up with only the pyspark2.4 package installed within a virtual environment.

When running the script in PyCharm, everything seems to run fine. However, I encounter an error stating that the function is not defined when accessing the pyspark2.4 API.

I would like to understand the reason behind this error. Is there any additional configuration required in PyCharm when using pyspark2.4? Thank you for your assistance!

Simon Mau
  • 11
  • 1
  • see spark function list. the functions in question were introduced in the initial versions. I guess pycharm needs to be configured correctly. – samkart Jul 06 '23 at 05:11
  • Did you try to include them `from pyspark.sql.functions import trim, col`? If so there is a possibility that you might be able to access trim only via `expr` like this `expr("trim(your_var)")` – abiratsis Jul 06 '23 at 11:33

1 Answers1

0

This is because col, lit and some other functions are binded dynamically. This goes back to the very early versions of Spark, and looks like it comes to handle versions compatibility.

shay__
  • 3,815
  • 17
  • 34