How does a browser 'find' something on a webpage?

Question

Now this may be a really trivial question, but how do modern browsers handle the find ( Ctrl + F ) operation on webpages?

Do they convert it to some plain text representation by regex'ing out all HTML/CSS/JS from the webpage and then running a recursive find?

I simply want to know, I couldn't find a resource which has details for how browsers run a find. If you could point me to one, it'd be great. — Raghav Kukreti, Dec 23 '19 at 06:49
I don't understand why this is getting downvoted, maybe because it's too broad. But I'll do some research, read the code for small open source browsers and answer this. — Raghav Kukreti, Dec 23 '19 at 07:08

Euler · Accepted Answer · 2019-12-23T07:51:20.757

This question is a rather loaded question because various browsers can perform slightly different from others based on the purpose and design of the browser's engine. However, they typically should produce similar looks and feels based on W3C standards. The best way the find out how each browser functions would be to go the individual website of the browser manufacturer to research the mapping system that it uses. HTML by default is a tree node system where a document can branch off into other subtrees. One pathing system that can be used is called XPath. Below are some links respectively how browsers function, W3 Schools, and XPath. Hopefully these will help you to at least understand the concept of browser functionality. I would start with the rendering engine link first.

https://www.html5rocks.com/en/tutorials/internals/howbrowserswork/#The_rendering_engine

https://www.w3schools.com

https://librarycarpentry.org/lc-webscraping/02-xpath/index.html

score 0 · Answer 2 · answered Dec 23 '19 at 07:19

I do not know about other browsers, but google chrome uses the Boyer Moore search algorithm to find words on a webpage. In this algorithm, the browser scans the word you have entered from right to left.

The string to be searched for is called P, which is called "Pattern".

The string we are searching within is called T, or "Test".

The length of T and P are generally represented by m and n respectively. The advantage of this algorithm is that instead of using brute force for searching (which would have taken m - n - 1 trials), it preprocesses P and skips as many possibilities as possible.

According to Wikipedia:

The key insight in this algorithm is that if the end of the pattern is compared to the text, then jumps along the text can be made rather than checking every character of the text. The reason that this works is that in lining up the pattern against the text, the last character of the pattern is compared to the character in the text. If the characters do not match, there is no need to continue searching backwards along the text. If the character in the text does not match any of the characters in the pattern, then the next character in the text to check is located n characters farther along the text, where n is the length of the pattern. If the character in the text is in the pattern, then a partial shift of the pattern along the text is done to line up along the matching character and the process is repeated. Jumping along the text to make comparisons rather than checking every character in the text decreases the number of comparisons that have to be made, which is the key to the efficiency of the algorithm.

Boyer-Moore algorithm employs two approaches:

Bad character Heuristic
Good Suffix Heuristic

P is processed and different arrays for both heuristics are formed.

The character of T which doesn’t match with the current character (of P) is called the Bad Character.

A good suffix happens when a substring of T has been successfully matched with a substring of P.

In both these methods or Heuristics, several rules are followed, which you can read in detail here and here. There is no point in copy-pasting articles from different websites.

Thanks for the detailed answer, the algorithm works, yes, but how does the browser parse the webpage to apply the algorithm? — Raghav Kukreti, Dec 23 '19 at 07:31
@RaghavKukreti That has to do with HTML. The browser goes through every displayed word in the HTML to check whether the word matches or not. Keyword "displayed". The browser does not parse through the hidden elements AFAIK. — Aditya Prakash, Dec 23 '19 at 07:42

How does a browser 'find' something on a webpage?

2 Answers2