21

I'm having an error when trying to cast a StringType to a IntType on a pyspark dataframe:

joint = aggregates.join(df_data_3,aggregates.year==df_data_3.year)
joint2 = joint.filter(joint.CountyCode==999).filter(joint.CropName=='WOOL')\
    .select(aggregates.year,'Production')\
    .withColumn("ProductionTmp", df_data_3.Production.cast(IntegerType))\
    .drop("Production")\
    .withColumnRenamed("ProductionTmp", "Production")

I'm getting:

TypeErrorTraceback (most recent call last) in () 1 joint = aggregates.join(df_data_3,aggregates.year==df_data_3.year) ----> 2 joint2 = joint.filter(joint.CountyCode==999).filter(joint.CropName=='WOOL')
.select(aggregates.year,'Production') .withColumn("ProductionTmp", df_data_3.Production.cast(IntegerType)) .drop("Production")
.withColumnRenamed("ProductionTmp", "Production")

/usr/local/src/spark20master/spark/python/pyspark/sql/column.py in cast(self, dataType) 335 jc = self._jc.cast(jdt) 336 else: --> 337 raise TypeError("unexpected type: %s" % type(dataType)) 338 return Column(jc) 339

TypeError: unexpected type:

Community
  • 1
  • 1
Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58

1 Answers1

45

PySpark SQL data types are no longer (it was the case before 1.3) singletons. You have to create an instance:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col

col("foo").cast(IntegerType())
Column<b'CAST(foo AS INT)'>

In contrast to:

col("foo").cast(IntegerType)
TypeError  
   ...
TypeError: unexpected type: <class 'type'>

cast method can be also used with string descriptions:

col("foo").cast("integer")
Column<b'CAST(foo AS INT)'>

For an overview of the supported Data Types in Spark SQL and Dataframes, one can click this link.

Bebeerna
  • 87
  • 1
  • 6
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 2
    Does anyone know where you can find a list of the valid strings to pass to the `cast()`? I've looked and I find things like [this](https://spark.apache.org/docs/latest/sql-reference.html) or [this](https://docs.databricks.com/spark/1.6/sparkr/functions/cast.html) but none of them are explicitly answering the question. – seth127 Jul 05 '19 at 15:53
  • @seth127 it's better to pass an instance of spark data types, listed here: https://spark.apache.org/docs/latest/sql-ref-datatypes.html. – Reza Keshavarz Oct 10 '22 at 13:29