1

I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:

  • if a column in df2 is null, it should not cause any changes in df1
  • if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
  • otherwise the value in column in df1 should get replaced with the value from df2

How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?

(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)

Example

Let's say the schema is: id:Int, name:String, age: Int.

df1 is:

1,"Greg",18
2,"Kate",25
3,"Chris",30

df2 is:

1,"Gregory",null
2,~,26

The updated dataframe should look like this:

1,"Gregory",18
2,null,26
3,"Chris",30
Tom
  • 139
  • 2
  • 15

3 Answers3

1

you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation. Spark incremental loading overwrite old record

vikrant rana
  • 4,509
  • 6
  • 32
  • 72
0

I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).

val idsToEdits=df2.rdd.map{row=>
  (row(0),
   row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
  .map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()

Broadast that map and define an editRow function updating a row.

val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
  idsToEditsBr
    .value
    .get(row(0))
    .map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
    .getOrElse(row)
}

Finally, use that function on RDD derived from df1 and convert back to a dataframe.

val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)
Tom
  • 139
  • 2
  • 15
  • The above solves the problem, but if anyone can show me something analogous but done only using dataframes or SQL, I will accept their answer. – Tom Dec 21 '17 at 17:07
0

It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.

import org.apache.spark.sql.types.StringType

val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
    if (x.dataType == StringType) {
        doLogicUdf(col(x), col(x + "2"))) 
    } else {
        when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
    }
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*)) 
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*) 
ayplam
  • 1,943
  • 1
  • 14
  • 20
  • Interesting approach @ayplam. I don't mind iterating through all the columns and that is in fact what you do here with map. I just don't want to list all the columns explicitly, that's it. Your assumption that all column names from df2 are the same as from df1 is correct. I thought my statement "with the same schema" covered that. – Tom Dec 22 '17 at 11:40
  • I am a bit confused by the order of lines in your code snippet. Didn't you want the first line to come last? – Tom Dec 22 '17 at 11:43
  • Did you test that code? I suppose the underlying function of doLogicUdf should accept 2 arguments of type Any or AnyVal at least as different columns can be of different types. I don't think that arguments of type Any worked for me with UDFs when I tried that before... I'd be grateful if you could test it on columns of different types e.g. on something as simple as the example I provided to confirm your approach is valid. If it is, I will accept it as it would be the most elegant solution. – Tom Dec 22 '17 at 11:45
  • I didn't get a chance to test the code - but it should get you pretty close to your desired solution. The first line just pairs up columns from df1 and df2 within doLogicUdf, so as long as it comes before the ".select" it doesn't matter as nothing is executed. I also haven't been able to do AnyVal UDFs. I've edited it slightly since the "tilda" logic automatically makes the column StringType so all other columns only use rules from bullets (1) + (3) – ayplam Dec 22 '17 at 18:53