pyspark: Valid strings to pass to dataType arg of cast()

Question

I posted this as a comment in this semi-related question but I felt it needed a post of its own.

Does anyone know where you can find a list of the valid strings to pass to the dataType argument of cast()? I've looked and I find things like this or this but none of them are explicitly answering the question.

Also, I've found through trial-and-error that you can pass things like bigint or tinyint and those seem to work, though they are nowhere listed as valid Spark data types, at least not that I can find. Any ideas?

For some reproducibility:

df = spark.createDataFrame(
    [
        [18786, "attr1", 0.9743],
        [65747, "attr1", 0.4568],
        [56465, "attr1", 0.6289],
        [18786, "attr2", 0.2976],
        [65747, "attr2", 0.4869],
        [56465, "attr2", 0.8464],
    ],
    ["id", "attr", "val"],
)
print(df)

This gives you DataFrame[id: bigint, attr: string, val: double], I guess by inferring the schema by default.

Then you can do something like this to re-cast the types:

from pyspark.sql.functions import col

fielddef = {'id': 'smallint', 'attr': 'string', 'val': 'long'}
df = df.select([col(c).cast(fielddef[c]) for c in df.columns])
print(df)

And now I get DataFrame[id: smallint, attr: string, val: bigint] so apparently 'long' converts to 'bigint'. I'm sure there are other conversions like that.

Also, I had this weird feeling that it would just silently ignore invalid strings you pass it, but this is not true. When I tried passing 'attr': 'varchar' in the fielddef dict I got an DataType varchar is not supported... error.

Any help is much appreciated!

Charlie Flowers · Accepted Answer · 2019-07-05T22:57:09.947

This is kind of tricky to answer definitively since Spark supports complex types (Maps, Arrays, Structs) of arbitrary complexity, as well as user-defined types. For practical purposes, DataTypeParserSuite.scala has a pretty comprehensive set of examples for primitive and complex types.

For primitive types, I've adapted this list from the visitPrimitiveDataType method of AstBuilder.scala

"boolean" -> BooleanType
"tinyint" | "byte" -> ByteType
"smallint" | "short" -> ShortType
"int" | "integer" -> IntegerType
"bigint" | "long" -> LongType
"float" -> FloatType
"double" -> DoubleType
"date" -> DateType
"timestamp" -> TimestampType
"string" | "char(x)" | "varchar(x)" -> StringType
"binary" -> BinaryType
"decimal" | "decimal(x)" | "decimal(x.y)" -> DecimalType

Complex types are then combinations of themselves and primitive types, e.g. struct<col1 : timestamp, col2 : bigint, col3 : map<string,array<double>>>

Great answer! This explains why it's not clearly documented, although an incomplete list in the docs _would_ be helpful. Who was it that said "the only real documentation is the code itself?" Thanks again for the help! — seth127, Jul 07 '19 at 12:47

pyspark: Valid strings to pass to dataType arg of cast()

1 Answers1