0

I have the following simple implicit class:

object Utils {     
  implicit class StringUtils(s: String) {
    private val iso8601 = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSX")
    def toUTC: Instant = iso8601.parse(s).toInstant
  }
}

I'd like to use it on a DataFrame that has several untyped timestamps (i.e. of type String), and I would like to create new columns that have these timestamps while keeping the original ones (just in case).

I looked at DataFrame.withColumn but it's signature requires a Column and my toUTC method only works on Strings. So, this obviously does not work:

df.withColumn("typed_ts", df("stringy_ts").toUTC)

I know I can do it in the SQLContext by writing dynamic SQL but I'd hate to do that because it quickly becomes messy.

Is there a way to make this work? Is this even recommended?

Note: I'm stuck on Spark 1.4.1.

Edit: Thanks to this question I now know that I have to create a udf[Instant, String]. However, Instant is not supported:

java.lang.UnsupportedOperationException: Schema for type java.time.Instant is not supported
    at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:152)
    at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28)
    at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:63)
    at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28)
    at org.apache.spark.sql.functions$.udf(functions.scala:1363)

Any ideas on what time representation to use? In the end I need to dump the data into Hive, so ideally something that works well with ORC/Parquet...

Community
  • 1
  • 1
Ian
  • 1,294
  • 3
  • 17
  • 39
  • I guess the simplest way is to use UDF instead of implicit. See example [here](http://stackoverflow.com/questions/36751265/sparksql-cut-string-after-special-position/36754202#36754202) – Vitalii Kotliarenko Apr 22 '16 at 09:20
  • True. However, that still does not solve the problem with `Instant`. If I leave off the `toInstant` call (so I end up with `java.util.Date`), I also get a problem with reflection in Catalyst. I think I need to use `java.sql.Timestamp`: see the [doc](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala). Bummer. – Ian Apr 22 '16 at 09:58

0 Answers0