I'm calculating the cosine similarity between two feature vectors and wondering if someone might have a neat solution to the below problem around categorical features.
Currently i have (example):
# define the similarity function
cosineSim <- function(x){
as.matrix(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
# define some feature vectors
A <- c(1,1,0,0.5)
B <- c(1,1,0,0.5)
C <- c(1,1,0,1.2)
D <- c(1,0,0,0.7)
dataTest <- data.frame(A,B,C,D)
dataTest <- data.frame(t(dataTest))
dataMatrix <- as.matrix(dataTest)
# get similarity matrix
cosineSim(dataMatrix)
which works fine.
But say i want to add in a categorical variable such as city to generate a feature that is 1 when two cities are equal and 0 other wise.
In this case, example feature vectors would be:
A <- c(1,1,0,0.5,"Dublin")
B <- c(1,1,0,0.5,"London")
C <- c(1,1,0,1.2,"Dublin")
D <- c(1,0,0,0.7,"New York")
I'm wondering is there a neat way to generate the pairwise equality of the last feature on the fly within the function in a way that keeps it a vectorised implementation?
I have tried pre-processing to make binary flags for each category such that above example would become something like:
A <- c(1,1,0,0.5,1,0,0)
B <- c(1,1,0,0.5,0,1,0)
C <- c(1,1,0,1.2,1,0,0)
D <- c(1,0,0,0.7,0,0,1)
This works but the problem is it means i have to pre-process each variable and in some cases i can see the number of categories becoming quite large. This seems quite expensive/inefficient when all i want is to generate a feature that returns 1 for equality and 0 otherwise (granted there is complexity here in that it is essentially a feature dependent on two records and shared between them).
One solution i can see is to just write a loop to build each pair of feature vectors (where i can build a feature such as [is_same_city]=1/0 and set to 1 for each vector when we have equality and 0 otherwise) and then get distance - but this approach will kill me when i try to scale.
I am hoping my R skills are not well enough developed and there is a neat solution that ticks most of the boxes...
Any suggestions at all are very welcome, Thanks