3

I have the following query which works great:

string[] Words = {"search","query","example"};

... Snip ...

var Results = (
    from a in q
    from w in Words
    where
        (
        a.Title.ToLower().Contains(w)
        || a.Body.ToLower().Contains(w)
        )
    select new
    {
        a,
        Count = 0
    }).OrderByDescending(x=> x.Count)
    .Distinct()
    .Take(Settings.ArticlesPerPage);

What I need it to do, is return Count which is the total occurrences of the words. I'm going to weight it in favour of the title as well, example:

Count = (OccuranceInTitle * 5) + (OccurancesInBody)

I'm assuming I need to use the Linq.Count but I'm not sure how to apply it in this instance.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Tom Gullen
  • 61,249
  • 84
  • 283
  • 456

1 Answers1

5

This is what I came up with:

var query =
    from a in q
    from w in Words
    let title = a.Title.ToLower()
    let body = a.Body.ToLower()
    let replTitle = Regex.Replace(title, string.Format("\\b{0}\\b", w), string.Empty)
    let replBody = Regex.Replace(body, string.Format("\\b{0}\\b", w), string.Empty)
    let titleOccurences = (title.Length - replTitle.Length) / w.Length
    let bodyOccurences = (body.Length - replBody.Length) / w.Length
    let score = titleOccurences * 5 + bodyOccurences
    where score > 0
    select new { Article = a, Score = score };

var results = query.GroupBy(r => r.Article)
                   .OrderByDescending(g => g.Sum(r => r.Score))
                   .Take(Settings.ArticlesPerPage);

Counting occurrences is done with the (surprisingly) quick and dirty method of replacing occurrences with string.Empty and calculating based on the resulting string length. After the scores for each article and each word are calculated, I 'm grouping for each article, ordering by the sum of scores for all the words and taking a chunk out of the results.

I didn't fire up the compiler, so please excuse any obvious mistakes.

Update: This version uses regexes as in

Regex.Replace(title, string.Format("\\b{0}\\b", w), string.Empty)

instead of the original version's

title.Replace(w, string.Empty)

so that it now matches only whole words (the string.Replace version would also match word fragments).

Jon
  • 428,835
  • 81
  • 738
  • 806
  • Oh nice that's clever! That even naturally weights longer words more, I like it! Just to check though, title.Replace(w, string.Empty) will word with an array of words? – Tom Gullen Sep 12 '11 at 18:27
  • This is dangerous. It counts "Book" in the title "The Boring Bookkeepers". And similarly "a" is way overcounted in "The Aardvarks of Armadillo, Texas". – jason Sep 12 '11 at 18:27
  • @Jason: Good catch. That can be fixed by using `Regex.Replace` instead of `string.Replace` -- I 'll get round to it. – Jon Sep 12 '11 at 18:29
  • @TomGullen: Actually it doesn't weigh longer words more because it divides the difference by `w.Length` as it stands (you need to scroll right to see it). – Jon Sep 12 '11 at 18:30
  • I've already filtered the words by removing stop words (so searching for "a file" would search for "file"), also "Book" matching in "Bookkeepers" is fine for my application. – Tom Gullen Sep 12 '11 at 18:33
  • @TomGullen: In the meantime I switched to `Regex.Replace` using "\b" to target word boundaries, so that now it only matches whole words :) – Jon Sep 12 '11 at 18:35
  • @Jon awesome that's so cool thanks :D The search before worked but was pretty random in ordering this should be miles better, will test it soon! – Tom Gullen Sep 12 '11 at 18:37
  • @Jon I get a couple of queries giving a divide by 0 error, do you know of any way to tackle this in linq? – Tom Gullen Sep 12 '11 at 18:55
  • @TomGullen: Well the only division here is by `w.Length`, so maybe some empty strings slipped in there? – Jon Sep 12 '11 at 20:50
  • @Jon here it is in action by the way http://www.scirra.com/tutorials/search/1/beginners works a charm! – Tom Gullen Sep 12 '11 at 23:30