3

I have a list of events something like shown below generated from a spark dataframe. I am using Spark 2.2.0 with Scala.

val events = df.select($"event", hour($"time") as "hour", to_date($"time", "yyyy-MM-dd") as "day")

+-----+-----+----------+ |event|hour | day| +-----+-----+----------+ |event1| 18|2015-02-05| |event1| 17|2015-02-19| |event5| 18|2015-02-02| |event5| 19|2015-02-02| |event1| 1|2015-03-17| |event1| 0|2015-02-03| |event1| 20|2015-02-02| |event1| 22|2015-02-02| |event1| 23|2015-02-02| |event1| 18|2015-02-09| |event1| 19|2015-02-09| |event1| 21|2015-02-09| |event1| 21|2015-04-06| |event1| 23|2015-02-09| |event1| 20|2015-02-16| |event2| 19|2015-02-12| |event3| 18|2015-02-18| |event1| 22|2015-02-16| |event2| 17|2015-02-04| |event1| 23|2015-02-16| +-----+----+----------+ only showing top 20 rows

I need to create hourly buckets and count how many events happen every hour. So my approach was to create buckets (24 of them) and count the events in their specific hour slot as shown below.

val splits = (0 to 24).map(_ * 1.0).toArray
val bucketizer = new Bucketizer()
    .setInputCol("hour")
    .setOutputCol("bucket")
    .setSplits(splits)

val bucket = bucketizer.transform(events)

val result = bucket.groupBy($"day", $"bucket").agg(count($"event").as("count")).orderBy(asc("bucket"))

result.filter($"day" === "2015-05-21").orderBy(asc("bucket")).show()

And the result of the above code is

+----------+------+-----+ | day|bucket|count| +----------+------+-----+ |2015-05-21| 0.0| 1| |2015-05-21| 2.0| 1| |2015-05-21| 11.0| 1| |2015-05-21| 17.0| 1| |2015-05-21| 18.0| 4| |2015-05-21| 19.0| 4| |2015-05-21| 21.0| 1| |2015-05-21| 22.0| 3| |2015-05-21| 23.0| 1| +----------+------+-----+

Which is correct. However what I was expecting as output is something like this:

+----------+------+-----+ | day|bucket|count| +----------+------+-----+ |2015-05-21| 0.0| 1| |2015-05-21| 1.0| 0| |2015-05-21| 2.0| 1| |2015-05-21| 3.0| 0| |2015-05-21| 4.0| 0| |2015-05-21| 5.0| 0| : : |2015-05-21| 11.0| 1| |2015-05-21| 12.0| 0| |2015-05-21| 13.0| 0| : : |2015-05-21| 17.0| 1| |2015-05-21| 18.0| 4| |2015-05-21| 19.0| 4| |2015-05-21| 20.0| 0| |2015-05-21| 21.0| 1| |2015-05-21| 22.0| 3| |2015-05-21| 23.0| 1| +----------+------+-----+

Basically, the bins (buckets) that have no events should be populated with 0. Any idea how to achieve that?

Thank you!

geo
  • 516
  • 5
  • 12
  • As of now, you hour column and bucket column should contain the same value, i.e. the bucketing step is unnecessary. – Shaido Sep 28 '17 at 07:21
  • @Shaido I was probably not clear enough in my request. Because events may not happen in a certain hour (eg: above between 1 and 2 there were not events), my dataframe will not contain the hour labeled 1. That's why I wanted to force the creation of all 24 hours and group by bucket – geo Sep 28 '17 at 07:29

1 Answers1

1

This is my current solution without using Bucketizer (I admit not very pretty)

val events = df.select($"event", hour($"time") as "hour", to_date($"time", "yyyy-MM-dd") as "day")

val left = (0 to 24).toDF.withColumnRenamed("value", "hour")
val right = or_counts.filter($"day" === "2015-05-21").groupBy($"hour").agg(count("event").as("count")).orderBy(asc("hour"))

left.join(right, Seq("hour"), "left_outer").na.fill(0, Seq("count")).show()

And this code returns something as shown below:

+----+-----+ |hour|count| +----+-----+ | 0| 1| | 1| 0| | 2| 1| | 3| 0| | 4| 0| | 5| 0| | 6| 0| | 7| 0| | 8| 0| | 9| 0| | 10| 0| | 11| 1| | 12| 0| | 13| 0| | 14| 0| | 15| 0| | 16| 0| | 17| 1| | 18| 4| | 19| 4| | 20| 0| | 21| 1| | 22| 3| | 23| 1| | 24| 0| +----+-----+ Which was my expected result. If anyone can come up with a nicer solution I will accept that answer.

Thank you!

geo
  • 516
  • 5
  • 12