0

I was implementing the answer mentioned here. This is my struct and I want to add a new col to it.

root
 |-- shops: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- epoch: double (nullable = true)
 |    |    |-- request: string (nullable = true)

So I executed this

from pyspark.sql import functions as F
df = new_df.withColumn('state', F.col('shops').withField('a', F.lit(1)))
df.printSchema()

But I get this error

TypeError                                 Traceback (most recent call last)
<ipython-input-47-1749b2131995> in <module>
      1 from pyspark.sql import functions as F
----> 2 df = new_df.withColumn('state', F.col(‘shops’).withField('a', F.lit(1)))
      3 df.printSchema()

TypeError: 'Column' object is not callable

EDIT: My version is Python 39 Spark 3.0.3 (Max possible)

Blue Clouds
  • 7,295
  • 4
  • 71
  • 112

2 Answers2

2

Try with transform higher order function, as you are trying to add new column to an array.

Example:

from pyspark.sql.functions import *
jsn_str="""{"shop_time":[{"seconds":10,"shop":"Texmex"},{"seconds":5,"shop":"Tex"}]}"""

df = spark.read.json(sc.parallelize([jsn_str]), multiLine=True)
df.\
  withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
    show(10,False)
#+------------------------------+
#|shop_time                     |
#+------------------------------+
#|[{10, Texmex, 1}, {5, Tex, 1}]|
#+------------------------------+

df.withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
    printSchema()
#root
# |-- shop_time: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- seconds: long (nullable = true)
# |    |    |-- shop: string (nullable = true)
# |    |    |-- diff_sec: integer (nullable = false)

UPDATE:

Using Spark-sql:

df.createOrReplaceTempView("tmp")
spark.sql("select transform(shop_time,x -> struct(1 as diff_sec, x.seconds,x.shop)) as shop_time from tmp").\
  show(10,False)
#+------------------------------+
#|shop_time                     |
#+------------------------------+
#|[{1, 10, Texmex}, {1, 5, Tex}]|
#+------------------------------+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • Do you know how to implement same solution in spark sql ? is withField available in spark sql ? – Srinivas Aug 10 '23 at 12:17
  • Yes, we need to use `struct` for this case, check my update section of the answer.! – notNull Aug 10 '23 at 12:31
  • ok, you are extracting columns from struct & then re creating struct again. is there any other function or method without extracting columns like `withField` – Srinivas Aug 10 '23 at 12:55
  • spark-sql error `AnalysisException: cannot resolve 'struct(namedlambdavariable().`seconds`, namedlambdavariable().`shop`, 1)' due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder,NamePlaceholder; line 1 pos 39;` – Blue Clouds Aug 10 '23 at 16:03
1

Your issue is that you're using the withField method on a column (your shops column) that is of type ArrayType and not of StructType.

You can fix this by using pyspark.sql.functions's transform function. From the docs:

Returns an array of elements after applying a transformation to each element in the input array.

So let's first create some input data:

from pyspark.sql.types import StringType, StructType, StructField, ArrayType, DoubleType
from pyspark.sql import functions as F

schema = StructType(
    [
        StructField(
            "shops",
            ArrayType(
                StructType(
                    [
                        StructField("epoch", DoubleType()),
                        StructField("request", StringType()),
                    ]
                )
            ),
        )
    ]
)

df = spark.createDataFrame(
    [
        [[(5.0, "haha")]],
        [[(6.0, "hoho")]],
    ],
    schema=schema,
)

And now use the transform function to apply your withField operation on each element of the shops column.

new_df = df.withColumn(
    "state", F.transform(F.col("shops"), lambda x: x.withField("a", F.lit(1)))
)

>>> new_df.printSchema()
root
 |-- shops: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- epoch: double (nullable = true)
 |    |    |-- request: string (nullable = true)
 |-- state: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- epoch: double (nullable = true)
 |    |    |-- request: string (nullable = true)
 |    |    |-- a: integer (nullable = false)
Koedlt
  • 4,286
  • 8
  • 15
  • 33