0

I have an RDD containing data like this: (downloadId: String, date: LocalDate, downloadCount: Int). The date and download-id are unique and the download-count is for the date.

What I've been trying to accomplish is to get the number of consecutive days (going backwards from the current date) that a download-id was in the top 100 of all download-ids. So if a given download was in the top-100 today, yesterday and the day before, then it's streak would be 3.

In SQL, I guess this could be solved using window functions. I've seen similar questions like this. How to add a running count to rows in a 'streak' of consecutive days

(I'm rather new to Spark but wasn't sure to how to map-reduce an RDD to even begin solving a problem like this.)

Some more information, the dates are the last 30 days and there are approximately unique 4M download-ids per day.

Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382

2 Answers2

1

Using a similar approach in the listed PostgreSQL link, you can apply Window function in Spark as well. Spark's DataFrame API doesn't have encoders for java.time.LocalDate, so you'll need to convert it to, say, java.sql.Date.

Here're the steps: First, transfrom the RDD to a DataFrame with supported date format; next, create a UDF to compute the baseDate which requires a date and a per-id chronological row-number (generated using Window function) as parameters. Another Window function is applied to calculate per-id-baseDate row-number, which is the wanted streak value:

import java.time.LocalDate

val rdd = sc.parallelize(Seq(
  (1, LocalDate.parse("2017-12-13"), 2),
  (1, LocalDate.parse("2017-12-16"), 1),
  (1, LocalDate.parse("2017-12-17"), 1),
  (1, LocalDate.parse("2017-12-18"), 2),
  (1, LocalDate.parse("2017-12-20"), 1),
  (1, LocalDate.parse("2017-12-21"), 3),
  (2, LocalDate.parse("2017-12-15"), 2),
  (2, LocalDate.parse("2017-12-16"), 1),
  (2, LocalDate.parse("2017-12-19"), 1),
  (2, LocalDate.parse("2017-12-20"), 1),
  (2, LocalDate.parse("2017-12-21"), 2),
  (2, LocalDate.parse("2017-12-23"), 1)
))

val df = rdd.map{ case (id, date, count) => (id, java.sql.Date.valueOf(date), count) }.
  toDF("downloadId", "date", "downloadCount")

def baseDate = udf( (d: java.sql.Date, n: Long) =>
  new java.sql.Date(new java.util.Date(d.getTime).getTime - n * 24 * 60 * 60 * 1000)
)

import org.apache.spark.sql.expressions.Window

val dfStreak = df.withColumn("rowNum", row_number.over(
    Window.partitionBy($"downloadId").orderBy($"date")
  )
).withColumn(
  "baseDate", baseDate($"date", $"rowNum")
).select(
  $"downloadId", $"date", $"downloadCount", row_number.over(
    Window.partitionBy($"downloadId", $"baseDate").orderBy($"date")
  ).as("streak")
).orderBy($"downloadId", $"date")

dfStreak.show
+----------+----------+-------------+------+
|downloadId|      date|downloadCount|streak|
+----------+----------+-------------+------+
|         1|2017-12-13|            2|     1|
|         1|2017-12-16|            1|     1|
|         1|2017-12-17|            1|     2|
|         1|2017-12-18|            2|     3|
|         1|2017-12-20|            1|     1|
|         1|2017-12-21|            3|     2|
|         2|2017-12-15|            2|     1|
|         2|2017-12-16|            1|     2|
|         2|2017-12-19|            1|     1|
|         2|2017-12-20|            1|     2|
|         2|2017-12-21|            2|     3|
|         2|2017-12-23|            1|     1|
+----------+----------+-------------+------+
Leo C
  • 22,006
  • 3
  • 26
  • 39
1

I suggest you work with DataFrames, as they are much easier to use than RDDs. Leo's answer is shorter, but I couldn't find where it was filtering for the top 100 downloads, so I decide to post my answer as well. It does not depend on window functions, but it is bound on the number of days in the past you want to streak by. Since you said you only use the last 30 days' data, that should not be a problem.

As a first Step, I wrote some code to generate a DF similar to what you described. You don't need to run this first block (if you do, reduce the number of rows unless you have a cluster to try it on, it's heavy on memory). You can see how to transform the RDD (theData) into a DF (baseData). You should define a schema for it, like I did.

import java.time.LocalDate
import scala.util.Random

val maxId = 10000
val numRows = 15000000
val lastDate = LocalDate.of(2017, 12, 31)

// Generates the data. As a convenience for working with Dataframes, I converted the dates to epoch days.

val theData = sc.parallelize(1.to(numRows).map{
  _ => {
    val id = Random.nextInt(maxId)
    val nDownloads = Random.nextInt((id / 1000 + 1))
    Row(id, lastDate.minusDays(Random.nextInt(30)).toEpochDay, nDownloads)
  }
})

//Working with Dataframes is much simples, so I'll generate a DF named baseData from the RDD

val schema = StructType(
    StructField("downloadId", IntegerType, false) ::
    StructField("date", LongType, false) ::
    StructField("downloadCount", IntegerType, false) :: Nil)

val baseData = sparkSession.sqlContext.createDataFrame(theData, schema)
  .groupBy($"downloadId", $"date")
    .agg(sum($"downloadCount").as("downloadCount"))
  .cache()

Now you have the data you want in a DF called baseData. The next step is to restrict it to the top 100 for each day - you should discard the data you don't before doing any additional heavy transformations.

import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Row}

def filterOnlyTopN(data: DataFrame, n: Int = 100): DataFrame = {
  // For each day in the data, let's find the cutoff # of downloads to make it into the top N
  val getTopNCutoff = udf((downloads: Seq[Long]) => {
    val reverseSortedDownloads = downloads.sortBy{- _ }
    if (reverseSortedDownloads.length >= n)
      reverseSortedDownloads.drop(n - 1).head
    else
      reverseSortedDownloads.last
  })

  val topNLimitsByDate = data.groupBy($"date").agg(collect_set($"downloadCount").as("downloads"))
          .select($"date", getTopNCutoff($"downloads").as("cutoff"))

  // And then, let's throw away the records below the top 100
  data.join(topNLimitsByDate, Seq("date"))
    .filter($"downloadCount" >= $"cutoff")
    .drop("cutoff", "downloadCount")
}

val relevantData = filterOnlyTopN(baseData)

Now that you have the relevantData DF with only the data you need, you can calculate the streak for them. I have left the ids with no streaks as streak 0, you can filter those out by using streaks.filter($"streak" > lit(0)).

def getStreak(df: DataFrame, fromDate: Long): DataFrame = {
  val calcStreak = udf((dateList: Seq[Long]) => {
    if (!dateList.contains(fromDate))
      0
    else {
      val relevantDates = dateList.sortBy{- _ }             // Order the dates descending
        .dropWhile(_ != fromDate)        // And drop everything until we find the starting day we are interested in
      if (relevantDates.length == 1)     // If there's only one day left, it's a one day streak
        1
      else                               // Otherwise, let's count the streak length (this works if no dates are left, too - but not with only 1 day)
        relevantDates.sliding(2)         // Take days by pairs
          .takeWhile{twoDays => twoDays(1) == twoDays(0) - 1}   // While the pair is of consecutive days
          .length+1                      // And the streak will be the number of consecutive pairs + 1 (the initial day of the streak)
    }
  })
  df.groupBy($"downloadId").agg(collect_list($"date").as("dates")).select($"downloadId", calcStreak($"dates").as("streak"))
}
val streaks = getStreak(relevantData, lastDate.toEpochDay)
streaks.show()

+------------+--------+
| downloadId | streak |
+------------+--------+
|       8086 |      0 |
|       9852 |      0 |
|       7253 |      0 |
|       9376 |      0 |
|       7833 |      0 |
|       9465 |      1 |
|       7880 |      0 |
|       9900 |      1 |
|       7993 |      0 |
|       9427 |      1 |
|       8389 |      1 |
|       8638 |      1 |
|       8592 |      1 |
|       6397 |      0 |
|       7754 |      1 |
|       7982 |      0 |
|       7554 |      0 |
|       6357 |      1 |
|       7340 |      0 |
|       6336 |      0 |
+------------+--------+

And there you have the streaks DF with the data you need.

bobbruno
  • 74
  • 1
  • 4