I have the following simple implicit class:
object Utils {
implicit class StringUtils(s: String) {
private val iso8601 = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSX")
def toUTC: Instant = iso8601.parse(s).toInstant
}
}
I'd like to use it on a DataFrame
that has several untyped timestamps (i.e. of type String
), and I would like to create new columns that have these timestamps while keeping the original ones (just in case).
I looked at DataFrame.withColumn
but it's signature requires a Column
and my toUTC
method only works on Strings. So, this obviously does not work:
df.withColumn("typed_ts", df("stringy_ts").toUTC)
I know I can do it in the SQLContext
by writing dynamic SQL but I'd hate to do that because it quickly becomes messy.
Is there a way to make this work? Is this even recommended?
Note: I'm stuck on Spark 1.4.1.
Edit: Thanks to this question I now know that I have to create a udf[Instant, String]
. However, Instant
is not supported:
java.lang.UnsupportedOperationException: Schema for type java.time.Instant is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:152)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28)
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:63)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28)
at org.apache.spark.sql.functions$.udf(functions.scala:1363)
Any ideas on what time representation to use? In the end I need to dump the data into Hive, so ideally something that works well with ORC/Parquet...