One hot encoding in RDD in scala

Question

I have a user data from movielense ml-100K dataset.

Sample rows are -

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213

I have read data as RDD as follows-

scala> val user_data =  sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29

scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician,    43537), Array(5, 33, F, other, 15213))


# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31

scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7),  (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))

I want to do one hot encoding for Occupation column.

Expected output is -

 userId   Age  Gender  Occupation   Zipcodes technician  other  writer 
 1        24    M      technician   85711      1           0     0
 2        53    F      other        94043      0           1     0
 3        23    M      writer       32067      0           0     1
 4        24    M      technician   43537      1           0     0
 5        33    F      other        15213      0           1     0

How do I achieve this on RDD in scala. I want to perform operation on RDD without converting it to dataframe.

Any help

Thanks

Before down voting please let the user post complete question. Incomplete question was posted unintentionally after which internet got disconnected. — r4sn4, Dec 10 '16 at 06:09
Any reason why you would not want to use Spark's default One hot encoder. See : http://stackoverflow.com/questions/31872396/how-to-encode-categorical-features-in-apache-spark or in Spark2 dataframe API : https://spark.apache.org/docs/2.0.2/ml-features.html#onehotencoder). — GPI, Dec 12 '16 at 12:42

score 0 · Accepted Answer · answered Dec 12 '16 at 12:21

I did this in following way -

1) Read user data -

scala> val user_data =  sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))

2) show 5 rows of data-

scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician,    43537), Array(5, 33, F, other, 15213))

3) Create map of profession by indexing-

scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()

scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)

4) create encode function which does one hot encoding of profession

scala> def encode(x: String) =
 |{
 | var encodeArray = Array.fill(21)(0)
 | encodeArray(indexed_user.get(x).get.toInt)=1
 | encodeArray
 }

5) Apply encode function to user data -

scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}

6) show encoded data -

scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] = 

1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)), 
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)), 
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)), 
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)), 
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)), 
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))

Any better solution. Please do post – r4sn4 Dec 12 '16 at 12:22 — r4sn4, Dec 12 '16 at 12:22

Ankita Mehta · Answer 2 · 2020-07-30T18:06:13.550

[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.

    var OutputDf = df
        for (cat <- catMap.keys) {
          val categories = catMap(cat)
        for (oneHotVal <- categories) {
          OutputDf = OutputDf.withColumn(oneHotVal, 
            when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
                                          }
                }
    OutputDf

One hot encoding in RDD in scala

2 Answers2

Linked