I run a search engine that specialises in clinical documents. Most are abstracts, perhaps 250-350 words. One issue, which is a real pain, is searching over guidelines. They are typically long (1000+ words) and have short titles. So, the title might be 'diagnosis of prostate cancer and subsequent management' within that might be many sections including one, called, say "Screening for prostate cancer"
Now, if someone searches for 'screening and prostate cancer' that guideline will not feature highly in the search for two reasons:
- screening' isn't mentioned in the title (title words score more highly)
- the 'screening' section might be really pertinent but over the whole guideline it might only be 10% - so term density is really low.
These guidelines are both HTML and PDF and from a lot of different publishers so it's not feasible (as far as I can tell) to create specific rules for each.
In the above example - for a search for 'screening and prostate cancer' - how can I boost the document(s) to see the guidelines higher up the results? I guess I could weight guidelines more highly, but that seems like it lacks finesse!