1

I have a user data from movielense ml-100K dataset.

Sample rows are -

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213

I have read data as RDD as follows-

scala> val user_data =  sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29

scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician,    43537), Array(5, 33, F, other, 15213))


# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31

scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7),  (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))

I want to do one hot encoding for Occupation column.

Expected output is -

 userId   Age  Gender  Occupation   Zipcodes technician  other  writer 
 1        24    M      technician   85711      1           0     0
 2        53    F      other        94043      0           1     0
 3        23    M      writer       32067      0           0     1
 4        24    M      technician   43537      1           0     0
 5        33    F      other        15213      0           1     0

How do I achieve this on RDD in scala. I want to perform operation on RDD without converting it to dataframe.

Any help

Thanks

r4sn4
  • 117
  • 5
  • 14
  • Before down voting please let the user post complete question. Incomplete question was posted unintentionally after which internet got disconnected. – r4sn4 Dec 10 '16 at 06:09
  • Any reason why you would not want to use Spark's default One hot encoder. See : http://stackoverflow.com/questions/31872396/how-to-encode-categorical-features-in-apache-spark or in Spark2 dataframe API : https://spark.apache.org/docs/2.0.2/ml-features.html#onehotencoder). – GPI Dec 12 '16 at 12:42
  • somehow I skipped this thread...Will try this aapproach – r4sn4 Dec 12 '16 at 17:18

2 Answers2

0

I did this in following way -

1) Read user data -

scala> val user_data =  sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))

2) show 5 rows of data-

scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician,    43537), Array(5, 33, F, other, 15213))

3) Create map of profession by indexing-

scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()

scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)

4) create encode function which does one hot encoding of profession

scala> def encode(x: String) =
 |{
 | var encodeArray = Array.fill(21)(0)
 | encodeArray(indexed_user.get(x).get.toInt)=1
 | encodeArray
 }

5) Apply encode function to user data -

scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}

6) show encoded data -

scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] = 

1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)), 
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)), 
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)), 
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)), 
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)), 
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
r4sn4
  • 117
  • 5
  • 14
0

[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.

    var OutputDf = df
        for (cat <- catMap.keys) {
          val categories = catMap(cat)
        for (oneHotVal <- categories) {
          OutputDf = OutputDf.withColumn(oneHotVal, 
            when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
                                          }
                }
    OutputDf
Ankita Mehta
  • 590
  • 4
  • 19