1

Consider a user inputs this search string to a news search engine:

"Oops, Donald Trump Jr. Did It Again (Wikileaks Edition) :: Politics - Paste"

Imagine we have a database of News Titles, and a database of "Important People". The goal here is: If a Search string contains an Important person, then return results containing this "substring" with higher ranking then those resutls that do NOT contain it.

Using the Yahoo Vespa Engine, How can I match a database full of people names against long news title strings ?

*I hope that made sense, sorry everyone, my english not so good :( Thank you !

Gotys
  • 1,371
  • 2
  • 13
  • 22

1 Answers1

3

During document processing/indexing of news titles you could extract named entities from the input text using the "important people" database. This process could be implemented in a custom document processor. See http://docs.vespa.ai/documentation/document-processing-overview.html).

A document definition for the news search could look something like this with a custom ranking function. The document processor reads the input title and populates the entities array.

search news { 
  document news { 
     field title type string { 
       indexing: summary | index
     }
     field entities type array<string> {
       indexing: summary | index
       match: word 
     }
   }
   rank-profile entity-ranking { 
      first-phase {
        expression: nativeRank(title) + matches(entities) 
      }
   }

At query time you'll need to do the same named entity extraction from the query input and built a Vespa query tree which can search the title (e.g using OR or WeakAnd) and also search the entities field for the possible named entities using the Vespa Rank operator. E.g given your query example the actual query could look something like:

select * from sources * where rank(title contains "oops" or title 
contains "donald" or title contains "trump", entities contains "Donald Trump Jr.");

You can build the query tree in a custom searcher http://docs.vespa.ai/documentation/searcher-development.html using a shared named entity extraction component.

Some resources

Jo Kristian Bergum
  • 2,984
  • 5
  • 8