0

I have followed this solution for one hot encoding. Now I want the last variable in my array (which is an array of integers) to change so that I get individual columns for each one hot-encoded variable.

My current RDD is:

scala> encode_cars
res2: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Array[Int])] = MapPartitionsRDD[17] at map at <console>:27

and I ideally I would want something like:

res2: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Int, Int, Int, Int, Int, Int, Int)] = MapPartitionsRDD[17] at map at <console>:27

I know that this could be done using a map / flatmap but I am not sure how to do it.

  • [My answer to this question](https://stackoverflow.com/a/71717000/2743131) details the use of the `array` function which can join column into one. – tjheslin1 Apr 08 '22 at 11:34
  • @tjheslin1 Thanks, but I was wondering if this can be done without changing my RDD to a dataframe – Kyriacos Xanthos Apr 08 '22 at 11:47

1 Answers1

0

I found an easy solution by just indexing the array and using the map function:

encode_cars.map(x => (x._1, x._2, x._3, x._4, x._5(1), x._5(2), x._5(3))