How to do this transformation in SQL/Spark/GraphFrames

Question

I've a table containing the following two columns:

Device-Id    Account-Id
d1           a1   
d2           a1
d1           a2
d2           a3
d3           a4
d3           a5 
d4           a6
d1           a4

Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up).

I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the transformation I'm looking for, generates the following table:

Unique-User-Id    Devices-Used    Accounts-Used
uuid1             [d1, d2, d3]    [a1, a2, a3, a4, a5]   
uuid2             [d4]            [a6]

The idea behind the above generated table is that an actual user, uuid1, has an account a1 set up on their devices d1 and d2, which essentially means that both these devices belong to uuid 1 and all other accounts set up on these d1 and d2 devices also map to the same user uuid1. Similarly, d1 also has an account a4 which is also set up on d3, so d3 is also uuid1's device and every account on it should get mapped to uuid1.

How can I achieve the above mentioned transformation in SQL/Spark/GraphFrames (by DataBricks) where both Device-Ids and Account-Ids can be in millions?

What is the logic to find which device and account belongs to which user? — pissall, Nov 25 '19 at 19:44
what's your spark version, 2.4+ or below? each device can have only one owner(user_id), is this the logic? — jxc, Nov 26 '19 at 03:32
@pissall The context is one account(eg a1) will belong to only one user(but converse isn't true, a user may have multiple accounts), and now since this same account can be configured on multiple devices, all these devices and all the accounts set up on those devices will be mapped to a single user(note: a user != account). The logic behind combining two devices is that they should both share a common account set up on it. — Aman Gill, Nov 26 '19 at 05:31
@jxc 2.4+; Nope, an actual user will have many devices(Android, iOS, tablet etc.) and would've many accounts registered on it. An account however will only point to one user and I need to find unique users. Example: Ben has two devices d1 and d2 and has a1,a2 and a2, a3 set up on both of these respectively. Now since, a1 will correspond to one unique user, let's call that uuid1; I need some transformation which figure out that since a1 was on d1, all accounts on d1 also maps to uuid1 and since a2, being uuid1's account is also there on d2, every account on d2 should also get mapped to uuid1. — Aman Gill, Nov 26 '19 at 05:37
@AmanGill, so I was actually right. accounts on the same device should belong to the same user(or uuid) as you mentioned `all accounts on d1 also maps to uuid1` and it's how you link d2 to d1. — jxc, Nov 26 '19 at 12:29
@jxc Both d1 and d2 both share a1 account, that's how they are linked. — Aman Gill, Nov 26 '19 at 12:41

score 2 · Answer 1 · answered Nov 25 '19 at 20:09

I'm not really proud of this solution because I think that there is probably a more efficient one, but I will leave it here either way. Hope it helps

import org.apache.spark.sql.functions._

val flatten_distinct = (array_distinct _) compose (flatten _)

val df = Seq(
  ("d1","a1"),  
  ("d2","a1"),
  ("d1","a2"),
  ("d2","a3"),
  ("d3","a4"),
  ("d3","a5"),
  ("d4","a6")
).toDF("d_id","u_id")


val userDevices = df
  .groupBy("u_id")
  .agg(collect_list("d_id").alias("d_id_list"))

//+----+---------+
//|u_id|d_id_list|
//+----+---------+
//|  a5|     [d3]|
//|  a3|     [d2]|
//|  a4|     [d3]|
//|  a2|     [d1]|
//|  a1| [d1, d2]|
//|  a6|     [d4]|
//+----+---------+


val accountsByDevice = df
  .groupBy("d_id")
  .agg(collect_list("u_id").alias("u_id_list"))

//+----+---------+
//|d_id|u_id_list|
//+----+---------+
//|  d2| [a3, a1]|
//|  d3| [a4, a5]|
//|  d1| [a1, a2]|
//|  d4|     [a6]|
//+----+---------+


val ungroupedDf = userDevices
  .join(accountsByDevice, expr("array_contains(d_id_list,d_id)"))
  .groupBy("d_id_list")
  .agg(collect_set("u_id_list") as "set")
  .select(col("d_id_list") as "d_id", flatten_distinct(col("set")) as "u_id")
  .select(explode(col("d_id")) as "d_id", col("u_id"), size(col("u_id")) as "size")

//+----+------------+----+
//|d_id|        u_id|size|
//+----+------------+----+
//|  d2|    [a1, a3]|   2|
//|  d1|[a1, a3, a2]|   3|
//|  d2|[a1, a3, a2]|   3|
//|  d3|    [a4, a5]|   2|
//|  d1|    [a1, a2]|   2|
//|  d4|        [a6]|   1|
//+----+------------+----+


val finalDf = ungroupedDf
  .join(ungroupedDf.groupBy("d_id").agg(max(col("size")) as "size"), Seq("size","d_id"))
  .groupBy("u_id")
  .agg(collect_set("d_id") as "d_id")
  .withColumn("unique_id", monotonically_increasing_id())

//+------------+--------+-------------+
//|        u_id|    d_id|    unique_id|
//+------------+--------+-------------+
//|[a1, a2, a3]|[d1, d2]|1228360646656|
//|    [a4, a5]|    [d3]|1297080123392|
//|        [a6]|    [d4]|1520418422784|
//+------------+--------+-------------+

I also thought of something very similar and I wasn't proud of it either. I'll wait for a few more days to see if something else comes up, If not I'll mark this as a solution. — Aman Gill, Nov 26 '19 at 05:40

jxc · Accepted Answer · 2020-12-29T19:20:28.283

You can try GraphFrame.connectedComponents, add a prefix to all Device-IDs, so that they can be split from Account-IDs in the post-processing step:

from graphframes import GraphFrame
from pyspark.sql.functions import collect_set, expr

df = spark.createDataFrame([
         ("d1","a1"), ("d2","a1"), ("d1","a2"), ("d1","a4"),
         ("d2","a3"), ("d3","a4"), ("d3","a5"), ("d4","a6")  
], ["Device-Id","Account-Id"])

# set checkpoint which is required for Graphframe
spark.sparkContext.setCheckpointDir("/tmp/111")

# for testing purpose, set a small shuffle partitions
spark.conf.set("spark.sql.shuffle.partitions", 2)

# set up edges and vertices, add an underscore as prefix of Device-ID
edges = df.withColumn('Device-Id', expr('concat("_", `Device-Id`)')).toDF('src', 'dst')
vertices = edges.selectExpr('src as id').distinct().union(edges.select('dst').distinct())

# set up the graph
g = GraphFrame(vertices, edges)

# compute the connected components and group resultset by component
# and collect corresponding ids using collect_set(id)
df1 = g.connectedComponents().groupby('component').agg(collect_set('id').alias('ids'))
df1.show(truncate=False)
+------------+-----------------------------------+
|component   |ids                                |
+------------+-----------------------------------+
|309237645312|[a6, _d4]                          |
|85899345920 |[_d1, a4, a1, _d3, a3, a5, a2, _d2]|
+------------+-----------------------------------+

# split the ids based on the prefix we predefined when creating edges.
df1.selectExpr(
      'transform(filter(ids, x -> left(x,1) = "_"), y -> substr(y,2)) AS `Devices-Used`'
    , 'filter(ids, x -> left(x,1) != "_") AS `Accounts-Used`'
    , 'component AS `Unique-User-Id`'
).show()
+------------+--------------------+--------------+
|Devices-Used|       Accounts-Used|Unique-User-Id|
+------------+--------------------+--------------+
|[d1, d3, d2]|[a4, a1, a3, a5, a2]|   85899345920|
|        [d4]|                [a6]|  309237645312|
+------------+--------------------+--------------+

Edit: The above method is less efficient in creating large list of edges/vertices which is unnecessary, using self-join to create edges list should be a better choice (inspired by this post):

edges = df.alias('d1').join(df.alias('d2'), ["Account-Id"]) \
    .filter("d1.`Device-Id` > d2.`Device-Id`") \
    .toDF("account", "src", "dst")
+-------+---+---+
|account|src|dst|
+-------+---+---+
|     a1| d2| d1|
|     a4| d3| d1|
+-------+---+---+

vertices = df.selectExpr('`Device-Id` as id', "`Account-Id` as acct_id")
g = GraphFrame(vertices, edges)

df1 = g.connectedComponents() \
    .groupby('component') \
    .agg(
       collect_set('id').alias('Device-Ids'),
       collect_set('acct_id').alias('Account-Ids')
     )
+---------+------------+--------------------+
|component|  Device-Ids|         Account-Ids|
+---------+------------+--------------------+
|        0|[d1, d2, d3]|[a4, a1, a3, a5, a2]|
|        1|        [d4]|                [a6]|
+---------+------------+--------------------+

This looks interesting, but in the last table you attached(where you included (d1, a4)), the first two rows should get consolidated into 1 ie. ((a1, a2, a3, a4, a5) (d1, d2, d3)) — Aman Gill, Nov 26 '19 at 15:47
@AmanGill that's why I think your example is too simple and I suggest you add a more general case `("d1", "a4")` as a side note to your existing example so people don't get confused. — jxc, Nov 26 '19 at 16:43
@AmanGill, as a side-note, both answers have the same issue. i doubt if you can get any useful answer if you stick to simple example when people do not have a full view with the real problem. — jxc, Nov 26 '19 at 17:50

How to do this transformation in SQL/Spark/GraphFrames

2 Answers2

Linked