2

How to flatten a simple (i.e. no nested structures) dataframe into a list? My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.

This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.

Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1  |after.id2  |
+-----------+-----------+-----------+-----------+
|       null|       null|         E2|         E3|
|         B3|         B1|       null|       null|
|         I1|         I2|       null|       null|
|         A2|         A3|       null|       null|
|       null|       null|         G3|         G4|

The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:

{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}

Potential approaches:

  • Union all the columns separately and distinct
  • flatMap and distinct
  • map and flatten

Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?

Other notes

  • Order of id1-id2 pair is only important to for change detection
  • Order in the resulting list is not important
  • DataFrame is between 10k and 100k rows
  • distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation
joynoele
  • 393
  • 4
  • 16
  • So why is there no timestamp then? Order of tuples is thus significant? Many files to process in 1 run? – thebluephantom Nov 02 '18 at 18:06
  • @thebluephantom timestamp is not needed since there is essentially a "before" and "after" table. Order of tuples is significant but somewhat out of scope since I have a dataframe of 4 columns with comparable values. No files to run, only this dataframe but it is considerable in size. – joynoele Nov 02 '18 at 18:16
  • Where I come from some form of timestamping is always req'd - many projects with that approach you just mentioned mean -> non-deterministic outcomes. Good luck though – thebluephantom Nov 02 '18 at 18:35
  • 1
    Timestamping in similar situations may reduce the need for even solving this scenario in the first place - but all the same I'd like to know how to flatten a simple dataframe like this. – joynoele Nov 02 '18 at 19:18

1 Answers1

1

Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:

val df = Seq(("A","B"),(null,"A")).toDF 
val result = df.rdd.map(_.toSeq.toList)
   .collect().toList.flatten.toSet - null
Anurag Sharma
  • 2,409
  • 2
  • 16
  • 34