How to change a cell's value in dataframe with pySpark?

Question

here is my dataframe:

I am looking for the right way to replace city's value based on the name, for example, case name when 'Alice' then 'New York' when 'Alex' then 'LA' when 'Aaron' then 'Beijing' when .......

Thank you.

if it's a couple of rules, you could `map` a function to do this, otherwise you probably want to keep your mapping rules in another dataframe and do a `join` — maxymoo, Jul 25 '16 at 04:25
isn't there a way to update a record based on criteria? like in SQL: `update table set col1 = 'value2' where id = 1 ` — mdivk, Jul 25 '16 at 14:09

score 1 · Accepted Answer · answered Jul 25 '16 at 23:33

Probably the simplest way would be to store your mappings in a second dataframe and join them together:

val df = sqlContext.read.json(sc.parallelize(
  """{"name":"Alice", "age":21, "city":"Toronto"}""" :: Nil))

val mapping = sqlContext.read.json(sc.parallelize(
  """{"name":"Alice", "newcity":"New York"}""" :: Nil))

df.join(mapping, "name").select("name", "age", "newcity").collect
// Array[org.apache.spark.sql.Row] = Array([Alice,21,New York])

Another option would be to just write SQL, you'll just have to phrase your operation as a SELECT rather than an update, for a simple mapping you could use CASE, if you have more mappings you should use a JOIN:

df.registerTempTable("df")
sqlContext.sql("""select name, 
                         age, 
                         case when name = 'Alice' then 'New York' end as city                   
                         from df""").collect()

// Array[org.apache.spark.sql.Row] = Array([Alice,21,New York])

Hi @maxymoo, How could this be done using indexes of your Spark DataFrame? For example, changing 1st row from 3rd column from 'Aaron' to 'Albert'? Something similar to Python´s pandas: df.iloc[1,3] = 'Albert' or df.loc[1,"city"] = 'Albert'. Thanks in advance for your response. — NuValue, May 17 '18 at 12:19
@NuValue spark dataframes don't have a concept of index, you could do something like https://stackoverflow.com/questions/43406887/spark-dataframe-how-to-add-a-index-column if you want this. but more generally you might perform value replacements with a function that you map over your entire dataframe rather than editing a single element. Or you could temporarily convert to a pandas dataframe and perform your substitution. — maxymoo, May 18 '18 at 00:48

How to change a cell's value in dataframe with pySpark?

1 Answers1