Calculating vector distance for classification with mixed features

Question

I'm doing a project comparing the effectiveness of various classification algorithms, but I'm stuck on a frustrating point. The data may be found here: http://archive.ics.uci.edu/ml/datasets/Adult The classification problem is whether or not a person makes over 50k a year based on their census data.

Two example entries are as follows:

45, Private, 98092, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K

50, Self-emp-not-inc, 386397, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K

I'm familiar with using Euclidean distance to calculate the difference between vectors, but I'm not sure how to work with a mix of continuous and discrete attributes. Are there any effective methods for representing the difference between two vectors in a meaningful way? I'm having a hard time wrapping my head around how large values like the third attribute (a weight calculated by the people who extracted the data set based on factors, so that similar weights should have similar attributes) and differences between it can preserve meaning from discrete features like male or female, which is only a Euclidean distance of 1 if I understand the method correctly. I'm sure some categories could be removed, but I don't want to remove something that factors into classification significantly. I'm tackling k-NN first once I get this figured out, then a Bayesian classifier, and finally a decision tree model like C4.5 or ID3 if I have the time.

score 1 · Answer 1 · answered Nov 25 '13 at 21:02

Sure, you can extend Euclidean distance in any number of ways. The simplest extension would be the following rule:

distance = 0 in that coordinate if there's a match, 1 otherwise

The challenge will be making the concept of distance "relevant" for the k-NN follow up. In some cases (e.g. education), I think it will be best to map education (discrete variable) into a continuous variable, such as years of education. So you'll need to write a function which maps e.g. "HS-grad" to 12, "Bachelors" to 16, something like that.

Beyond that, using k-NN directly isn't going to work because the idea of "distance" among multiple dis-similar dimensions isn't well defined. I think you'll be better off throwing some of these dimensions away or weighting them differently. I don't know what the third number in your dataset (e.g. 98092) means, but if you use naive Euclidean distance this would be extremely overweighted compared to other dimensions such as age.

I'm not a machine learning expert, but I would personally be tempted to start k-NN on a reduced dimensionality dataset where you just pick some broad demographics (e.g. age, education, marital status) and ignore the trickier/"noisier" categories.

Yeah, I've definitely thought about throwing that category away, because the problem you noticed with the third number is the same one I did. I was just wondering if there was a way to transform it or other categories into a meaningful but not overly impactful value. I've considered reducing dimensions but I'll probably have to experiment to see which ones are meaningful and which are not. — Baldier, Nov 25 '13 at 21:10
+1 for this : "using k-NN directly isn't going to work because the idea of "distance" among multiple dis-similar dimensions isn't well defined" :) — bendaizer, Nov 26 '13 at 08:57

score 0 · Answer 2 · answered Nov 26 '13 at 10:01

0

You need to code your categorical variables as 1-of-n binary variables (n choices for the variable, and of those variables one and only one is active). Then standardise your features---for each feature, subtract its mean and divide by standard deviation. Or normalise into the range 0-1. It's not perfect, but this will at least make dimensions comparable.

answered Nov 26 '13 at 10:01

Ben Allison

7,244
1
15
24

I hadn't heard about the standard deviation process, I'll definitely take a look at that. Thanks. – Baldier Nov 26 '13 at 13:40

score 0 · Answer 3 · answered Jul 21 '16 at 07:01

Create individual Maps for each data points and use the map to convert to a double value.

def createMap(data: RDD[String]) : Map[String,Double] = {  
 var mapData:Map[String,Double] = Map()
 var counter = 0.0
 data.collect().foreach{ item => 
  counter = counter +1
  mapData += (item -> counter)
 }
 mapData
}

def getLablelValue(input: String): Int = input match {
 case "<=50K" => 0
 case ">50K" => 1
}


val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd  = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct

val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)


val featureVector = census.map{line => 
  val fields = line.split(", ")
 LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble,  orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}

Calculating vector distance for classification with mixed features

3 Answers3