0

I run a search engine that specialises in clinical documents. Most are abstracts, perhaps 250-350 words. One issue, which is a real pain, is searching over guidelines.  They are typically long (1000+ words) and have short titles. So, the title might be 'diagnosis of prostate cancer and subsequent management' within that might be many sections including one, called, say "Screening for prostate cancer"

Now, if someone searches for 'screening and prostate cancer' that guideline will not feature highly in the search for two reasons:

  1. screening' isn't mentioned in the title (title words score more highly)
  2. the 'screening' section might be really pertinent but over the whole guideline it might only be 10% - so term density is really low.

These guidelines are both HTML and PDF and from a lot of different publishers so it's not feasible (as far as I can tell) to create specific rules for each.

In the above example - for a search for 'screening and prostate cancer' - how can I boost the document(s) to see the guidelines higher up the results? I guess I could weight guidelines more highly, but that seems like it lacks finesse!

JRBTrip
  • 11
  • 1
  • Is there any way you can determine that something is a subheader from these documents? (for HTML versions - are they possibly marked up with h1-h6 tags?) Is there any way to programmatically determine that these words are more important? How would you decide that out of all the text, these titles are important -- if someone gave you just the content you've extracted? Start by trying to solve the problem, then find a way to express that in some manner that works with Solr or Lucene. – MatsLindh Sep 25 '19 at 19:21
  • Thank you. I think the issue is that - as there are lots of publishers it'd take forever to ascertain all the 'markup' rules. Having said that I could start with a few bigger publishers and just see how it goes. – JRBTrip Sep 26 '19 at 17:53
  • I'm not sure how you're going to weigh something higher without having some way to determine _what_ should have increased weight. – MatsLindh Sep 26 '19 at 20:13
  • Currently we weight publications as a core part of our algorithm - the better quality publishers get a higher weighting. So, we could simply increase the weight of guidelines - but that lacks subtlety and will have knock-on effects! – JRBTrip Sep 28 '19 at 08:13
  • Yes, my comment was in relation to the sub sections you want to increase the weight of. You'll have to have some way to determine that they actually are sub sections, before you can increase their weigh. If you can't determine what content should receive extra weigh, boosting will be impossible. – MatsLindh Sep 28 '19 at 09:24

0 Answers0