I posted this as a comment in this semi-related question but I felt it needed a post of its own.
Does anyone know where you can find a list of the valid strings to pass to the dataType
argument of cast()
? I've looked and I find things like this or this but none of them are explicitly answering the question.
Also, I've found through trial-and-error that you can pass things like bigint
or tinyint
and those seem to work, though they are nowhere listed as valid Spark data types, at least not that I can find. Any ideas?
For some reproducibility:
df = spark.createDataFrame(
[
[18786, "attr1", 0.9743],
[65747, "attr1", 0.4568],
[56465, "attr1", 0.6289],
[18786, "attr2", 0.2976],
[65747, "attr2", 0.4869],
[56465, "attr2", 0.8464],
],
["id", "attr", "val"],
)
print(df)
This gives you DataFrame[id: bigint, attr: string, val: double]
, I guess by inferring the schema by default.
Then you can do something like this to re-cast the types:
from pyspark.sql.functions import col
fielddef = {'id': 'smallint', 'attr': 'string', 'val': 'long'}
df = df.select([col(c).cast(fielddef[c]) for c in df.columns])
print(df)
And now I get DataFrame[id: smallint, attr: string, val: bigint]
so apparently 'long'
converts to 'bigint'
. I'm sure there are other conversions like that.
Also, I had this weird feeling that it would just silently ignore invalid strings you pass it, but this is not true. When I tried passing 'attr': 'varchar'
in the fielddef
dict I got an DataType varchar is not supported...
error.
Any help is much appreciated!