how to convert mix of text and numerical data to feature data in apache spark

Question

I have a CSV of both textual and numerical data. I need to convert it to feature vector data in Spark (Double values). Is there any way to do that ?

I see some e.g where each keyword is mapped to some double value and use this to convert. However if there are multiple keywords, it is difficult to do this way.

Is there any other way out? I see Spark provides Extractors which will convert into feature vectors. Could someone please give an example?

48, Private, 105808, 9th, 5, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50K
42, Private, 169995, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K

have you check [spark-csv](https://github.com/databricks/spark-csv) — Rockie Yang, Jul 18 '16 at 07:00
Have a look at the StringIndexer (is ML allowed, or are you strictly MLLIB?) http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer — wmoco_6725, Jul 18 '16 at 07:04
This data has a mix and match of textual data and numerical data. Is there any way to convert this to a feature vector ? — Charls Joseph, Jul 18 '16 at 12:50
This data has a mix and match of textual data and numerical data. Is there any way to convert this to a feature vector ? — Charls Joseph, Jul 18 '16 at 12:51

Charls Joseph · Accepted Answer · 2016-07-20T10:10:06.130

Finally I did this way. I iterate through each data and make a map with key as each item and increment a Double counter.

def createMap(data: RDD[String]) : Map[String,Double] = {  
 var mapData:Map[String,Double] = Map()
 var counter = 0.0
 data.collect().foreach{ item => 
  counter = counter +1
  mapData += (item -> counter)
 }
 mapData
}

def getLablelValue(input: String): Int = input match {
 case "<=50K" => 0
 case ">50K" => 1
}


val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd  = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct

val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)


val featureVector = census.map{line => 
  val fields = line.split(", ")
 LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble,  orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}

how to convert mix of text and numerical data to feature data in apache spark

1 Answers1

Linked