How do search engines quickly retrieve relevant subsets of pages w/o considering all possible pages?

Question

In reading about search engines, the top two areas that come up are related to (A) PageRank, which given a set of pages ranks them in order of eigenvector centrality based on patterns in web traffic, and (B) keywords/semantic-meaning encodings such as TF-IDF or word2vec.

I'm familiar with both of these areas, but I'm curious, when a search engine is queried with "running shoes", I can't imagine that the encodings of billions of webpages are retrieved and acted upon before ranking and presenting to the user. Is there some process where a query is mapped to a semi-refined structure of web pages in order to the limit the candidate pages retrieved?

The simple answer is that the web pages are indexed so that only pages relevant to "running shoes" are retrieved. The process of web crawling and indexing web pages is, of course, more complex and as you mentioned, you're familiar with a couple of indexing approaches. — Gilbert Le Blanc, Mar 25 '21 at 02:26
@GilbertLeBlanc, so token indices would be stored as columns in a massive db? But select queries with no joins would be relatively fast. (Sidebar- maybe this is what incentivized Google to develop BigQuery, BigTable, etc.?) — jbuddy_13, Mar 25 '21 at 15:29
Perhaps. I'm not sure Google ever used a relational database for web page indexing. From the beginning, Google replicated its web pages database across hundreds of servers across the world. Web crawling and indexing can take a relatively long time. Search results need to come up quickly. — Gilbert Le Blanc, Mar 25 '21 at 15:37

How do search engines quickly retrieve relevant subsets of pages w/o considering all possible pages?

0 Answers0