1

Suppose I have a list new_id_acc = [6,8,1,2,4] and I have PySpark DataFrame like

id_acc  |  name  | 
  10    |  ABC   |
  20    |  XYZ   |
  21    |  KBC   |
  34    |  RAH   |
  19    |  SPD   |

I want to replace the pyspark column id_acc with new_id_acc value how can I achieve and do this. I tried and found that lit() can be used but for a constant value but didn't find anything how to do for list.

After replacement I want my PySpark Dataframe to look like this

id_acc  |  name  | 
   6    |  ABC   |
   8    |  XYZ   |
   1    |  KBC   |
   2    |  RAH   |
   4    |  SPD   |
  • 1
    what should be the logic for replacement or do you want [6,8,1,2,4] to be populated for all rows? Please post how your dataframe will look like after replace. – Shantanu Sharma May 14 '19 at 10:40
  • I have added how my new data frame should look like, any help really appreciated – RAHUL VISHWAKARMA May 14 '19 at 10:58
  • 1
    when your dataframe is distributed across multiple machines then there is no guarantee that order of dataframe rows will remain same. Are you fine with first item of list replacing first record of dataframe (order of dataframe record may change everytime as there is no sorting) and second item replacing second record of dataframe or do you have a logic to order/sort dataframe records? – Shantanu Sharma May 14 '19 at 11:32

2 Answers2

1

Probably long answer but it works.

df = spark.sparkContext.parallelize([(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'ABC'),(19,'SPD')]).toDF(('id_acc', 'name'))
df.show()
+------+----+
|id_acc|name|
+------+----+
|    10| ABC|
|    20| XYZ|
|    21| KBC|
|    34| ABC|
|    19| SPD|
+------+----+
new_id_acc = [6,8,1,2,4]
indx = ['ABC','XYZ','KBC','ABC','SPD']
from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),StructField("new_id_ac", IntegerType(), True)])
df1=spark.createDataFrame(zip(indx,new_id_acc),schema = myschema)
df1.show()
+----+---------+
|indx|new_id_ac|
+----+---------+
| ABC|        6|
| XYZ|        8|
| KBC|        1|
| ABC|        2|
| SPD|        4|
+----+---------+
dfnew = df.join(df1, df.name == df1.indx,how='left').drop(df1.indx).select('new_id_ac','name').sort('name').dropDuplicates(['new_id_ac'])
dfnew.show()
+---------+----+
|new_id_ac|name|
+---------+----+
|        1| KBC|
|        6| ABC|
|        4| SPD|
|        8| XYZ|
|        2| ABC|
+---------+----+
Prathik Kini
  • 1,067
  • 11
  • 25
0

The idea is to create a column of consecutive serial/row numbers and then use them to get the corresponding values from the list.

# Creating the requisite DataFrame
from pyspark.sql.functions import row_number,lit, udf
from pyspark.sql.window import Window
valuesCol = [(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'RAH'),(19,'SPD')]
df = spark.createDataFrame(valuesCol,['id_acc','name'])
df.show()
+------+----+ 
|id_acc|name| 
+------+----+ 
|    10| ABC| 
|    20| XYZ| 
|    21| KBC| 
|    34| RAH| 
|    19| SPD| 
+------+----+ 

You can create row/serial numbers like done here.

Note that A below is just a dummy value, as we don't need to order tha values. We just want the row number.

w = Window().orderBy(lit('A'))
df = df.withColumn('serial_number', row_number().over(w))
df.show()
+------+----+-------------+ 
|id_acc|name|serial_number| 
+------+----+-------------+ 
|    10| ABC|            1| 
|    20| XYZ|            2| 
|    21| KBC|            3| 
|    34| RAH|            4| 
|    19| SPD|            5| 
+------+----+-------------+

As a final step, we will access the elements from the list provided by the OP using the row number. For this we use udf.

new_id_acc = [6,8,1,2,4]
mapping = udf(lambda x: new_id_acc[x-1])
df = df.withColumn('id_acc', mapping(df.serial_number)).drop('serial_number')
df.show()
+------+----+ 
|id_acc|name| 
+------+----+ 
|     6| ABC| 
|     8| XYZ| 
|     1| KBC| 
|     2| RAH| 
|     4| SPD| 
+------+----+
cph_sto
  • 7,189
  • 12
  • 42
  • 78