0

I have 2 DataFrames:

Users (~29.000.000 entries)

|-- userId: string (nullable = true)

Impressions (~1000 entries)

|-- modules: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- content: array (nullable = true)
|    |    |    |-- element: string (containsNull = true)
|    |    |-- id: string (nullable = true)

I want to walk through all the Users and attach to each User 1 Impression from these ~1000 entries. So actually at each ~1000th User the Impression would be the same, then the loop on the Impressions would start from the beginning and assign the same ~1000 impressions for the next ~1000 users. At the end I want to have a DataFrame with the combined data. Also the Users dataframe could be reused by adding the columns of the Impressions or a newly created one would work also as a result.

You have any ideas, which would be a good solution here?

esbej
  • 27
  • 7
  • 1
    The reason why this is so difficult in spark is because you're trying to carry state outside of the operation you are trying to carry out and that breaks functional programming principles. Having said that, are you allowed to use RDDs? Because this can be solved using the zip function of the RDD api. – Lyuben Todorov Nov 13 '17 at 12:37
  • If there is no faster alternative, I could also go with RDD-s, yes. – esbej Nov 13 '17 at 12:43

2 Answers2

1

What I would do is use the old trick of adding a monotically increasing ID to both dataframes, then create a new column on your LARGER dataframe (Users) which contains the modulo of each row's ID and the size of smaller dataframe.

This new column then provides a rolling matching key against the items in the Impressions dataframe.

This is a minimal example (tested) to give you the idea. Obviously this will work if you have 1000 impressions to join against:

var users = Seq("user1", "user2", "user3", "user4", "user5", "user6", "user7", "user8", "user9").toDF("users")
var impressions = Seq("a", "b", "c").toDF("impressions").withColumn("id", monotonically_increasing_id())

var cnt = impressions.count

users=users.withColumn("id", monotonically_increasing_id())
       .withColumn("mod", $"id" mod cnt)
       .join(impressions, $"mod"===impressions("id"))
       .drop("mod")

users.show


+-----+---+-----------+---+
|users| id|impressions| id| 
+-----+---+-----------+---+ 
|user1|  0|          a| 0| 
|user2|  1|          b| 1| 
|user3|  2|          c| 2| 
|user4|  3|          a| 0| 
|user5|  4|          b| 1| 
|user6|  5|          c| 2| 
|user7|  6|          a| 0| 
|user8|  7|          b| 1| 
|user9|  8|          c| 2| 
+-----+---+-----------+---+ 
Chondrops
  • 728
  • 1
  • 4
  • 14
  • fails on join: An exception occurred while running Spark App! org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (dedupeMode, geo, modules, targetedOpenTime, vars); at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219) at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218) at org.apache.spark.sql.Dataset.col(Dataset.scala:1073) at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059) – esbej Nov 13 '17 at 15:30
  • This was tested on Spark 2,2. What are you running? – Chondrops Nov 13 '17 at 15:44
  • My mistake. I have missed to add the id column to Impressions. This worked for me. Thanks! – esbej Nov 13 '17 at 15:50
  • I would add that you also want to drop the 'id' column in the end, not only the 'mod'. – esbej Nov 13 '17 at 15:58
0

Sketch of idea:

  • Add monotonically increasing id to both dataframes Users and Impressions via

    val indexedUsersDF = usersDf.withColumn("index", monotonicallyIncreasingId) 
    val indexedImpressionsDF = impressionsDf.withColumn("index", monotonicallyIncreasingId) 
    

    (see spark dataframe :how to add a index Column )

  • Determine number of rows in Impressions via count and store as int, e.g.

    val numberOfImpressions = ... 
    
  • Apply UDF to index-column in indexedUsersDF that computes the modulo in a seperate column (e.g. moduloIndex)

    val moduloIndexedUsersDF = indexedUsersDF.select(...)
    
  • Join moduloIndexedUsersDF and indexedImperessionsDF on

    moduloIndexedUsersDF("moduloIndex")===indexedImpressions("index")
    
Martin Senne
  • 5,939
  • 6
  • 30
  • 47
  • Could you help out with the details of the last 2 steps? I am new to Scala and Spark... – esbej Nov 13 '17 at 15:33