17

I have a dataframe which includes timestamp. To aggregate by time(minute, hour, or day), I have tried as:

val toSegment = udf((timestamp: String) => {
  val asLong = timestamp.toLong
  asLong - asLong % 3600000 // period = 1 hour
})

val df: DataFrame // the dataframe
df.groupBy(toSegment($"timestamp")).count()

This works fine.

My question is how to generalize the UDF toSegment as

val toSegmentGeneralized = udf((timestamp: String, period: Int) => {
  val asLong = timestamp.toLong
  asLong - asLong % period
})

I have tried as follows but it doesn't work

df.groupBy(toSegment($"timestamp", $"3600000")).count()

It seems to find the column named 3600000.

Possible solution is to use constant column but I couldn't find it.

emesday
  • 6,078
  • 3
  • 29
  • 46

1 Answers1

29

You can use org.apache.spark.sql.functions.lit() to create the constant column:

import org.apache.spark.sql.functions._

df.groupBy(toSegment($"timestamp", lit(3600000))).count()
Spiro Michaylov
  • 3,531
  • 21
  • 19
  • 4
    The lit function works great if you have a string or int to pass in. Fails miserably with something like an Array/List. Any ideas on what to do there? – J Calbreath Jun 24 '15 at 17:46
  • That package also has a function called `array()` that you may be able to use to combine a bunch of literal columns -- I haven't tried it. It may not be too hard to create an analogous function for lists, especially if you look at the implementation of `array()` in [functions.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala) -- one doesn't seem to exist. – Spiro Michaylov Jun 24 '15 at 20:35
  • Now having tried using `array()`, I should point out that the corresponding UDF parameter needs to be of type `ArrayBuffer[T]` for some `T`. – Spiro Michaylov Jul 03 '15 at 06:34
  • 1
    **Spark 1.5.0 Note:** Passing `array()` now seems to result in a `WrappedArray` being passed into the UDF. This means you can make the UDF parameter type be something like `Seq` or `IndexedSeq`. – Spiro Michaylov Sep 20 '15 at 15:34
  • @SpiroMichaylov Do you know how to pass map? I am not able to pass map to udf. http://stackoverflow.com/questions/40598890/spark-udf-how-to-pass-map-as-a-parameter – nir Nov 15 '16 at 00:13
  • @nir Are you looking for a function that would construct a map from pre-existing columns, or do you want to simply pass a column that already contains a map? I don't think the former exists, but the latter should be more a matter of "how do you access it inside the UDF?" or "how do you declare the UDF?" rather than "how do you pass it?" You may want to post a new question showing what you're trying to achieve, and link to it here. – Spiro Michaylov Nov 18 '16 at 15:43
  • @SpiroMichaylov I think it's former that I am looking. http://stackoverflow.com/questions/40598890/spark-udf-how-to-convert-map-to-column?noredirect=1#comment68442055_40598890. I have a temp workaround by converting map to array first and then pass array as a udf parameter. I can't use clouser as I am running interactive shell. – nir Nov 18 '16 at 18:43