PySpark: How to compare two dataframes

Question

I have two dataframes which I've loaded from two csv files. Examples:

old
+--------+---------+----------+
|HOTEL ID|GB       |US        |
+--------+---------+----------+
|   80341|     0.78|       0.7|
|  255836|      0.6|       0.6|
|  245281|     0.78|      0.99|
|  229166|      0.0|       0.7|
+--------+---------+----------+

new
+--------+---------+----------+
|HOTEL ID|GB       |US        |
+--------+---------+----------+
|   80341|     1   |       0.7|
|  255836|      0.6|       1  |
|  245281|     0.78|      0.99|
|  333   |      0.0|       0.7|
+--------+---------+----------+

and I would like to get:

expected result
+--------+---------+----------+
|HOTEL ID|GB       |US        |
+--------+---------+----------+
|   80341|     1   |      None|
|  255836|     None|       1  |
|  333   |      0.0|       0.7|
+--------+---------+----------+

I have been fiddling with the dataframe foreach method, but failing to get it to work... as a spark newbie would be grateful for any clues.

Cheers!

Rafael

actually can get the last |333|0.0|0.7| row by using subtract(), still clueless about the cell by cell comparison though. — Rafael, Apr 25 '16 at 18:26

score -1 · Answer 1 · edited Apr 25 '16 at 20:53

Can you give more detail regarding the operation that you are running on old and new to get the expected result?

Are you also doing some arithmetic operation on GB and US columns between the old and new dataframes?

If not a join seems like what you might be looking for If the order is not the same between in two dataframes you would have to do a join first

#renaming column names for convenience
newDF=new.toDF('HOTEL ID','N_GB','N_US')
#doing an inner join (lookup sql joins  for the type of join you need)
old.join(newDF,'HOTEL ID','inner')

This will give you a table with schema

| HOTEL ID | US | DB | N_US | N_GB |
|----------|----|----|------|------|
| 80341    |0.78| 0.7|1     | 0.7  |
|          |    |    |      |      |
|          |    |    |      |      |

thanks for your reply, the operation(s) on old and new to get the expected result is what I am after :). I would like to keep the cell values in new, replacing the ones in old, and putting a null or empty values when the values in old and new are the same (for the same cell). The resulting dataframe should have the same columns as old and new. Cheers. — Rafael, Apr 26 '16 at 14:13

PySpark: How to compare two dataframes

1 Answers1