Spark: Flatten simple multi-column DataFrame

Question

How to flatten a simple (i.e. no nested structures) dataframe into a list? My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.

This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.

Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1  |after.id2  |
+-----------+-----------+-----------+-----------+
|       null|       null|         E2|         E3|
|         B3|         B1|       null|       null|
|         I1|         I2|       null|       null|
|         A2|         A3|       null|       null|
|       null|       null|         G3|         G4|

The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:

{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}

Potential approaches:

Union all the columns separately and distinct
flatMap and distinct
map and flatten

Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?

Other notes

Order of id1-id2 pair is only important to for change detection
Order in the resulting list is not important
DataFrame is between 10k and 100k rows
distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation

So why is there no timestamp then? Order of tuples is thus significant? Many files to process in 1 run? — thebluephantom, Nov 02 '18 at 18:06
@thebluephantom timestamp is not needed since there is essentially a "before" and "after" table. Order of tuples is significant but somewhat out of scope since I have a dataframe of 4 columns with comparable values. No files to run, only this dataframe but it is considerable in size. — joynoele, Nov 02 '18 at 18:16
Where I come from some form of timestamping is always req'd - many projects with that approach you just mentioned mean -> non-deterministic outcomes. Good luck though — thebluephantom, Nov 02 '18 at 18:35
Timestamping in similar situations may reduce the need for even solving this scenario in the first place - but all the same I'd like to know how to flatten a simple dataframe like this. — joynoele, Nov 02 '18 at 19:18

score 1 · Accepted Answer · answered Nov 03 '18 at 09:55

1

Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:

val df = Seq(("A","B"),(null,"A")).toDF 
val result = df.rdd.map(_.toSeq.toList)
   .collect().toList.flatten.toSet - null

answered Nov 03 '18 at 09:55

Anurag Sharma

2,409
2
16
34

Spark: Flatten simple multi-column DataFrame

1 Answers1