Instead of monotonically_increasing_id
use row_number
window function.
- Use
spark.read.csv
if you are reading a delimited file.
Example:
//sample data
$cat t1.txt
NAME|AGE|COUNTRY
d|18|USA
a|18|USA
b|20|Germany
c|23|USA
import org.apache.spark.sql.expressions.Window
val w=Window.orderBy("NAME")
spark.
read.
option("header",true).
option("delimiter","|").
csv("t1.txt").
withColumn("idCol",row_number().over(w)).
show()
//+----+---+-------+-----+
//|NAME|AGE|COUNTRY|idCol|
//+----+---+-------+-----+
//| a| 18| USA| 1|
//| b| 20|Germany| 2|
//| c| 23| USA| 3|
//| d| 18| USA| 4|
//+----+---+-------+-----+
We are ordering by NAME
column and then adding idCol
will be assigned to all the rows with out repetition.
In addition if there is no orderby column then try:
val w=Window.orderBy(lit("1"))
spark.
read.
option("header",true).
option("delimiter","|").
csv("t1.txt").
withColumn("idCol",row_number().over(w)).
show()
//+----+---+-------+-----+
//|NAME|AGE|COUNTRY|idCol|
//+----+---+-------+-----+
//| d| 18| USA| 1|
//| a| 18| USA| 2|
//| b| 20|Germany| 3|
//| c| 23| USA| 4|
//+----+---+-------+-----+