Here is a simple and straightforward way (Linear Model) that should work.
If you are working with documents and queries, probably the features you are using are those tokens(or words) or n-grams or topics. Let's assume the features are words for simplicity.
Suppose you have a query document:
apple iphone6
and you have a set of documents and their corresponding match scores against the above query:
(Assume the documents are the contents of urls)
www.apple.com (Apple - iPhone 6) score: 0.8
www.cnet.com/products/apple-iphone-6 (Apple iPhone 6 review), score: 0.75
www.stuff.tv/apple/apple-iphone-6/review (Apple iPhone 6 review), score: 0.7
....
Per-query model
First you need to extract word features from the matching urls. suppose we get word and their L1-normalized TF/IDF scores:
www.apple.com
apple 0.5
iphone 0.4
ios8 0.1
www.cnet.com/products/apple-iphone-6
apple 0.4
iphone 0.2
review 0.2
cnet 0.2
www.stuff.tv/apple/apple-iphone-6/review
apple 0.4
iphone 0.4
review 0.1
cnet 0.05
stuff 0.05
Second you can combine feature scores and match scores and aggregate on a per-feature basis:
w(apple) = 0.5 * 0.8 + 0.4 * 0.75 + 0.1 * 0.7 = 0.77
w(iphone) = 0.4 * 0.8 + 0.2 * 0.75 + 0.4 * 0.7 = 0.75
w(ios8) = 0.1 * 0.8 = 0.08
w(review) = 0.2 * 0.75 + 0.1 * 0.7 = 0.22
w(cnet) = 0.2 * 0.75 = 0.15
w(stuff) = 0.05 * 0.7 = 0.035
You might want to do a normalization step to divide each w
by the number of documents. Now you get below features ordered by their relevancy decendingly:
w(apple)=0.77 / 3
w(iphone)=0.75 / 3
w(review)=0.22 / 3
w(cnet)=0.15 / 3
w(ios8)=0.08 / 3
w(stuff)=0.035 / 3
You even get a linear classifier by using those weights:
score = w(apple) * tf-idf(apple) + w(iphone) * tf-idf(iphone) + ... + w(stuff) * tf-idf(stuff)
Suppose now you have a new url with those features detected:
ios8: 0.5
cnet: 0.3
iphone:0.2
You can then calculate its match score regarding query "apple iphone6":
score = w(ios8)*0.5 + w(cnet)*0.3 + w(iphone)*0.2
= (.08*.5 + .15*0.3 + .75*.2 ) / 3
The match score can then be used to rank documents regarding their relevancy to the same query.
Any-query model
You perform the same thing to construct a linear model for each query. Suppose you have k
such queries and matching documents in your training data, you will end up with
k
such models; each model is constructed based on one query.
model(apple iphone6) = (0.77*apple + 0.75iphone + 0.22review + ...) / 3
model(android apps) = (0.77google + 0.5android + ...) / 5
model(samsung phone) = (0.5samsung + 0.2galaxy + ...) / 10
Note in above example models, 3, 5, 10 are the normalizer (the total number of documents matched to each query).
Now a new query comes, suppose it is :
samsung android release
Our tasks left are to:
- find relevant queries
q1, q2, ..., qm
- use query models to score new documents and aggregate.
You first need to extract features from this query and also suppose you have already cached the features for each query you have learned. Based on any nearest neighbor approach (e.g., Locality sensitive hashing), you can find the top-k similar queries to "samsung android release", probably they should be:
similarity(samsung phone, samsung android release) = 0.2
similarity(android apps, samsung android release) = 0.2
Overall Ranker
Thus we get our final ranker as:
0.2*model(samsung phone) + 0.2*model(android apps) =
0.2* (0.77*apple + 0.75iphone + 0.22review + ...) / 3 +
0.2* (0.77google + 0.5android + ...) / 5
Usually in those information retrieval apps, you have constructed inverted index from features (words) to documents. Therefore the final rank should be able to evaluated very efficiently across top documents.
Reference
For details, please refer to the IND algorithm in Omid Madani et al. Learning When Concepts Abound.