5

I'm calculating the cosine similarity between two feature vectors and wondering if someone might have a neat solution to the below problem around categorical features.

Currently i have (example):

# define the similarity function
cosineSim <- function(x){
  as.matrix(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2))))) 
}

# define some feature vectors
A <- c(1,1,0,0.5)
B <- c(1,1,0,0.5)
C <- c(1,1,0,1.2)
D <- c(1,0,0,0.7)

dataTest <- data.frame(A,B,C,D)
dataTest <- data.frame(t(dataTest))
dataMatrix <- as.matrix(dataTest)

# get similarity matrix
cosineSim(dataMatrix)

which works fine.

But say i want to add in a categorical variable such as city to generate a feature that is 1 when two cities are equal and 0 other wise.

In this case, example feature vectors would be:

A <- c(1,1,0,0.5,"Dublin")
B <- c(1,1,0,0.5,"London")
C <- c(1,1,0,1.2,"Dublin")
D <- c(1,0,0,0.7,"New York")

I'm wondering is there a neat way to generate the pairwise equality of the last feature on the fly within the function in a way that keeps it a vectorised implementation?

I have tried pre-processing to make binary flags for each category such that above example would become something like:

A <- c(1,1,0,0.5,1,0,0)
B <- c(1,1,0,0.5,0,1,0)
C <- c(1,1,0,1.2,1,0,0)
D <- c(1,0,0,0.7,0,0,1)

This works but the problem is it means i have to pre-process each variable and in some cases i can see the number of categories becoming quite large. This seems quite expensive/inefficient when all i want is to generate a feature that returns 1 for equality and 0 otherwise (granted there is complexity here in that it is essentially a feature dependent on two records and shared between them).

One solution i can see is to just write a loop to build each pair of feature vectors (where i can build a feature such as [is_same_city]=1/0 and set to 1 for each vector when we have equality and 0 otherwise) and then get distance - but this approach will kill me when i try to scale.

I am hoping my R skills are not well enough developed and there is a neat solution that ticks most of the boxes...

Any suggestions at all are very welcome, Thanks

andrewm4894
  • 1,451
  • 4
  • 17
  • 37
  • I believe what you are looking for is [Gower's distance](http://www.clustan.com/gower_similarity.html). There is a function [?daisy](http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html) in the `cluster` package that will calculate it. – gung - Reinstate Monica Nov 04 '13 at 22:17
  • Thanks will check it out - do you know if that will accept factor variables or do would i need to pre-process the categoricals in some way? – andrewm4894 Nov 05 '13 at 10:43
  • 1
    In the documentation to the function `daisy()` at the link I provided above, it says: "x numeric matrix or data frame, of dimension n x p, say. Dissimilarities will be computed between the rows of x. Columns of mode numeric (i.e. all columns when x is a matrix) will be recognized as interval scaled variables, columns of class factor will be recognized as nominal variables...", so I'm guessing it will accept factor variables. – gung - Reinstate Monica Nov 05 '13 at 14:41
  • @andrewm4894 I was wondering if you have solved this problem. I am facing the same thing. Thank you. – Rotail Jun 30 '17 at 20:24

0 Answers0