Change value of a row using multiple columns in Spark DataFrame

Question

I got a dataframe(df) of this format .

df.show()
********************
X1 | x2  | X3 | ..... | Xn   | id_1 | id_2 | .... id_23
1  |  ok |good|         john | null | null |     |null
2  |rick |good|       | ryan | null | null |     |null
....

I got a dataframe in which I have a lot of columns and the dataframe is named df. I need to edit columns in this dataframe(df). I have 2 maps, m1 (Integer->Integer) and m2 (Integer->String) mapping.

I need to look at each row and take the column X1 value and see the mapped value of X1 in m1 which will be in the range [1,23], let it be 5 and also find the mapped value of X1 in m2 which will be something like X8. I need to add the value at column X8 to id_5. I have the following code but I cant get this to work.

val dfEdited = df.map( (row) => {
  val mapValue = row.getAs("X1")
  row.getAs("id_"+m1.get(mapValue)) = row.getAs(m2.get(mapValue)
})

Álvaro Valencia · Answer 1 · 2019-02-05T14:05:34.507

What you are doing in row.getAs("id_"+m1.get(mapValue)) = row.getAs(m2.get(mapValue) does not make sense.

First of all, you are assigning a value to the result of the operation getAs("id_"+m1.get(mapValue)), which gives you an immutable value. Secondly, you are not using correctly the method getAs since you need to specify the data type returned by such method.

I am not sure whether or not I understood correctly what you want to do, I guess you are missing some details. Anyways, here is what I got and it works fine.

Of course, I have commented on each code line so that you can easily understand it.

// First of all we need to create a case class to wrap the content of each row.
case class Schema(X1: Int, X2: String, X3: String, X4: String, id_1: Option[String], id_2: Option[String], id_3: Option[String])


val dfEdited = ds.map( row => {
  // We use the getInt method to get the value of a field which is expected to be Int
  val mapValue = row.getInt(row.fieldIndex("X1"))

  // fieldIndex gives you the position inside the row fo the field you are looking for. 
  // Regarding m1(mapValue), NullPointer might be thrown if mapValue is not in that Map. 
  // You need to implement mechanisms to deal with it (for example, an if...else clause, or using the method getOrElse)
  val indexToModify = row.fieldIndex("id_" + m1(mapValue)) 

  // We convert the row to a sequence, and pair each element with its index.
  // Then, with the map method we generate a new sequence.
  // We replace the element situated in the position indexToModify.
  // In addition, if there are null values, we have to convert it to an object of type Option.
  // It is necessary for the next step.
  val seq = row.toSeq.zipWithIndex.map(x => if (x._2 == indexToModify) Some(m2(mapValue)) else if(x._1 == null) None else x._1)


  // Finally, you have to create the Schema object by using pattern matching
  seq match {
    case Seq(x1: Int, x2: String, x3: String, x4: String, id_1: Option[String], id_2: Option[String], id_3: Option[String]) => Schema(x1, x2,x3,x4, id_1, id_2, id_3)
  }
})

Some comments:

The ds object is a Dataset. Datasets must have a structure. You cannot modify the rows inside the map method and return them because Spark will not know if the structure of the dataset has changed. For this reason, I am returning a case class object, since it provides a structure to the Dataset object.
Bear in mind that you might have problems with null values. This code might throw you null pointers if you do not establish mechanisms to deal with cases in which, for example, the value of X1 is not in m1.

Hope it works.

Change value of a row using multiple columns in Spark DataFrame

1 Answers1