4

I have thousands of vectors of about 20 features each.

Given one query vector, and a set of potential matches, I would like to be able to select the best N matches.

I have spent a couple of days trying out regression (using SVM), training my model with a data set I have created myself : each vector is the concatenation of the query vector and a result vector, and I give a score (subjectively evaluated) between 0 and 1, 0 for perfect match, 1 for worst match.

I haven't had great results, and I believe one reason could be that it is very hard to subjectively assign these scores. What would be easier on the other hand is to subjectively rank results (score being an unknown function):

score(query, resultA) > score(query, resultB) > score(query, resultC)

So I believe this is more a problem of Learning to rank and I have found various links for Python:

but I haven't been able to understand how it works really. I am really confused with all the terminology, pairwise ranking, etc ... (note that I know nothing about machine learning hence my feeling of being a bit lost), etc ... so I don't understand how to apply this to my problem.

Could someone please help me clarify things, point me to the exact category of problem I am trying to solve, and even better how I could implement this in Python (scikit-learn) ?

sebpiq
  • 7,540
  • 9
  • 52
  • 69
  • Looks like you want an unsupervised learning approach. Look into Nearest Neighbors http://scikit-learn.org/stable/modules/neighbors.html – Ryan Oct 04 '15 at 16:37
  • Correct me if I'm wrong, but I believe if I want to use nearest neighbors, I need to be able to order my whole dataset ... which I cannot do. In my case, the order of the results depend on the query. – sebpiq Oct 04 '15 at 18:27
  • This might be better asked on cross validated: https://stats.stackexchange.com/ – Charlie Haley Oct 04 '15 at 18:44
  • @CharlieHaley yes you're probably right... though apparently there is no clear consensus on where machine learning questions belong. I'll ask for a moderator to move it. – sebpiq Oct 04 '15 at 18:54
  • @sebpiq That is incorrect about ordering the data set. What you will need is the distance (Euclidean, Minkowski, etc) between each sample in your data, and every other sample in your data. In the Euclidean case this is simply the distance between two vectors; for a simple scenario, consider the distance on a Cartesian plane between [0, 0] and [5, 5]. If you start with an N X M matrix, your distance matrix will be N X M as well. Then, if you're interested in finding the closest neighbors to an arbitrary sample, you simple look up in your distance matrix the indexes of the minimum values. – Ryan Oct 05 '15 at 06:31
  • Correction to my last comment: If you start with an N X M matrix, your distance matrix will be N X N. – Ryan Oct 05 '15 at 07:04

1 Answers1

5

It seems to me that what you are trying to do is to simply compute the distances between the query and the rest of your data, then return the closest N vectors to your query. This is a search problem. There is no ordering, you simply measure the distance between your query and "thousands of vectors". Finally, you sort the distances and take the smallest N values. These correspond to the most similar N vectors to your query.

For increased efficiency at making comparisons, you can use KD-Trees or other efficient search structures: http://scikit-learn.org/stable/modules/neighbors.html#kd-tree

Then, take a look at the Wikipedia page on Lp space. Before picking an appropriate metric, you need to think about the data and its representation:

  1. What kind of data are you working with? Where does it come from and what does it represent? Is the feature space comprised of only real numbers or does it contain binary values, categorical values or all of them? Wiki for homogeneous vs heterogeneous data.

For a real valued feature space, the Euclidean distance (L2) is usually the choice metric used, with 20 features you should be fine. Start with this one. Otherwise you might have to think about cityblock distance (L1) or other metrics such as Pearson's correlation, cosine distance, etc. You might have to do some engineering on the data before you can do anything else.

  1. Are the features on the same scale? e.g. x1 = [0,1], x2 = [0, 100]

If not, then try scaling your features. This is usually a matter of trial and error since some features might be noisy in which case scaling might not help. To explain this, think about a data set with two features: height and weight. If height is in centimeters (10^3) and weight is in kilograms (10^1), then you should aim to convert the cm to meters so both features weigh equally. This is generally a good idea for feature spaces with a wide range of values, meaning you have a large sample of values for both features. You'd ideally like to have all your features normally distributed, with only a bit of noise - see central limit theorem.

  1. Are all of the features relevant?

If you are working with real valued data, you can use Principal Component Analysis (PCA) to rank the features and keep only the relevant ones. Otherwise, you can try feature selection http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection Reducing the dimension of the space increases performance, although it is not critical in your case.


If your data consists of continuous, categorical and binary values, then aim to scale or standardize the data. Use your knowledge about the data to come up with an appropriate representation. This is the bulk of the work and is more or less a black art. Trial and error.

As a side note, metric based methods such as knn and kmeans simply store data. Learning begins where memory ends.

user91213
  • 166
  • 4
  • thanks! Those are good points ... 1) my features are all real numbers 2) I have standardized my data using http://scikit-learn.org/stable/modules/preprocessing.html 3) is especially true. I have started to have a closer look at my features, and there was a lot of junk indeed. I'll clean it and see if I can get better results. Sad there is no magic formula to make it all work quick and easily :) – sebpiq Oct 05 '15 at 16:51
  • you're in luck then. I would try both standardizing (z-score) and scaling [0,1]. Another trick is to make vectors unit length (divide each by its L2 or L1 norm). – user91213 Oct 06 '15 at 08:04
  • If you want to look into other ways of doing dimensionality reduction, have a look at autoencoders, the nonlinear version of pca. You might also want to look into metric learning. Good luck and have fun! – user91213 Oct 06 '15 at 08:11
  • Thanks! I did a lot of cleaning, and I divided by 2 the number of features. It is starting to work I think... – sebpiq Oct 08 '15 at 10:12