Spark - reading CSV file and enforcing Serial mode

Question

All,

I am processing the vendor data files and adding few additional fields ( ENRICHMENT ) and requirement is that I always need to maintain the order of files.

Hence to achieve above, I am adding sequence ID using monotonically_increasing_id(); How can I ensure this operation is executed with partition = 1, so that ID is not repeated.. I am open to alternate suggestions.

    val srcDF = spark.read.textFile(PATH).withColumn("idCol", monotonically_increasing_id())

score 1 · Answer 1 · answered Apr 14 '20 at 17:16

Instead of monotonically_increasing_id use row_number window function.

Use spark.read.csv if you are reading a delimited file.

Example:

//sample data

$cat t1.txt
NAME|AGE|COUNTRY
d|18|USA
a|18|USA
b|20|Germany
c|23|USA

import org.apache.spark.sql.expressions.Window

val w=Window.orderBy("NAME")

spark.
read.
option("header",true).
option("delimiter","|").
csv("t1.txt").
withColumn("idCol",row_number().over(w)).
show()

//+----+---+-------+-----+
//|NAME|AGE|COUNTRY|idCol|
//+----+---+-------+-----+
//|   a| 18|    USA|    1|
//|   b| 20|Germany|    2|
//|   c| 23|    USA|    3|
//|   d| 18|    USA|    4|
//+----+---+-------+-----+

We are ordering by NAME column and then adding idCol will be assigned to all the rows with out repetition.

In addition if there is no orderby column then try:

val w=Window.orderBy(lit("1"))

spark.
read.
option("header",true).
option("delimiter","|").
csv("t1.txt").
withColumn("idCol",row_number().over(w)).
show()

//+----+---+-------+-----+
//|NAME|AGE|COUNTRY|idCol|
//+----+---+-------+-----+
//|   d| 18|    USA|    1|
//|   a| 18|    USA|    2|
//|   b| 20|Germany|    3|
//|   c| 23|    USA|    4|
//+----+---+-------+-----+

Spark - reading CSV file and enforcing Serial mode

1 Answers1