1

I have two dataframes which I've loaded from two csv files. Examples:

old
+--------+---------+----------+
|HOTEL ID|GB       |US        |
+--------+---------+----------+
|   80341|     0.78|       0.7|
|  255836|      0.6|       0.6|
|  245281|     0.78|      0.99|
|  229166|      0.0|       0.7|
+--------+---------+----------+

new
+--------+---------+----------+
|HOTEL ID|GB       |US        |
+--------+---------+----------+
|   80341|     1   |       0.7|
|  255836|      0.6|       1  |
|  245281|     0.78|      0.99|
|  333   |      0.0|       0.7|
+--------+---------+----------+

and I would like to get:

expected result
+--------+---------+----------+
|HOTEL ID|GB       |US        |
+--------+---------+----------+
|   80341|     1   |      None|
|  255836|     None|       1  |
|  333   |      0.0|       0.7|
+--------+---------+----------+

I have been fiddling with the dataframe foreach method, but failing to get it to work... as a spark newbie would be grateful for any clues.

Cheers!

Rafael

Rafael
  • 572
  • 5
  • 9
  • actually can get the last |333|0.0|0.7| row by using subtract(), still clueless about the cell by cell comparison though. – Rafael Apr 25 '16 at 18:26

1 Answers1

-1

Can you give more detail regarding the operation that you are running on old and new to get the expected result?

Are you also doing some arithmetic operation on GB and US columns between the old and new dataframes?

If not a join seems like what you might be looking for If the order is not the same between in two dataframes you would have to do a join first

#renaming column names for convenience
newDF=new.toDF('HOTEL ID','N_GB','N_US')
#doing an inner join (lookup sql joins  for the type of join you need)
old.join(newDF,'HOTEL ID','inner')

This will give you a table with schema

| HOTEL ID | US | DB | N_US | N_GB |
|----------|----|----|------|------|
| 80341    |0.78| 0.7|1     | 0.7  |
|          |    |    |      |      |
|          |    |    |      |      |
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • thanks for your reply, the operation(s) on old and new to get the expected result is what I am after :). I would like to keep the cell values in new, replacing the ones in old, and putting a null or empty values when the values in old and new are the same (for the same cell). The resulting dataframe should have the same columns as old and new. Cheers. – Rafael Apr 26 '16 at 14:13