5

Is there a name for this technique that consists in exploring a page open in the browser to find specific content and modify it?

Some examples:

  • Skype finds phone numbers on a page, and attaches a call menu
  • a script finds percentages in a page and replaces them with a small pie
  • an advertising engine finds keywords in the page and converts them into hyperlinks
  • add an icon next to all the hyperlinks on the page that point to another domain
  • etc.

I understand that it is a kind of progressive enhancement. But I am specifically interested in the first step, the content discovery process. I'd be interested in articles that offer best practices, or explain the shortcomings of this technique.

Edit: I added an example to show that this technique is not just for text nodes, but can apply to any kind of html content.

Christophe
  • 27,383
  • 28
  • 97
  • 140
  • Just walk the DOM and process all text nodes... – Šime Vidas Dec 07 '11 at 19:47
  • 1
    web/html/content/text parsing/scraping – Cory Danielson Dec 07 '11 at 19:48
  • Sime Vidas: sure, I do this all the time. But this doesn't tell me much about best practices and shortcomings! – Christophe Dec 07 '11 at 19:51
  • @Christophe The DOM traversal API is implemented in all browsers. It's fast and straightforward. This also goes for string manipulation. I can't think of any shortcomings. – Šime Vidas Dec 07 '11 at 20:00
  • An example of issue I'm facing: when content is added asynchronously to the DOM. – Christophe Dec 07 '11 at 20:10
  • @Christophe The last example: 1. get all anchors on the page, 2. for each anchor, analyze its `href` property, 3. add a CSS class to those anchors that have a foreign domain. This is pretty straightforward, I don't see what sort of best practices you're after.. – Šime Vidas Dec 07 '11 at 20:40
  • @Christophe So you would like to be notified whenever a new anchor is added to the DOM, so that you can conditionally show an icon next to it? – Šime Vidas Dec 07 '11 at 20:45
  • Sime Vidas: that's the idea. But it could also be that an anchor is removed, or its href is modified dynamically, etc. Or, in my pie example, it could be a value that is updated every 30 seconds. I am trying to understand the pattern, not solve a specific issue. – Christophe Dec 07 '11 at 21:31

5 Answers5

5

For example, execute this code for this web-page (from the console), and all numbers on the page will be replaced with "X":

function walkTheDOM( node, func ) {
    func( node );
    node = node.firstChild;
    while ( node ) {
        walkTheDOM( node, func );
        node = node.nextSibling;
    }
}

walkTheDOM( document.body, function ( node ) {
    if ( node.nodeType === 3 ) {
        node.data = node.data.replace( /\d/g, 'X' );
    }
});

enter image description here

Šime Vidas
  • 182,163
  • 62
  • 281
  • 385
  • Thanks for example. I realized that my initial examples were too specific and edited the question. – Christophe Dec 07 '11 at 20:04
  • @EdS. Of course you can. SHIFT + ENTER. You can also write your code somewhere else and then just copy-paste it into the console... – Šime Vidas Dec 07 '11 at 22:21
  • @ŠimeVidas: Thank you. I tried Ctrl+Enter and Shift+Enter... then I googled it... found nothing... moved on. I'm not a web dev, I just dabble. Thanks again. – Ed S. Dec 07 '11 at 22:22
  • I call that greasemonkey scripts ^^ – Guillaume86 Dec 15 '11 at 22:29
0

This is functionality is called Add-ons and the technic used by these is DOM traversing

The cases you describe is not something specific to one site, but appear on every site you visit, so there must be some extra functionality added to your browser. This often happen when checking on install toolbars etc when installing a new software like Skype

The technic can be called recognition (as in PNR, Skype Phone Number Recognition), and what they are doing is traversing your site DOM .

This add ons describe above probably runs only on page load, so content added later on with ajax will not be affected.

If its your own add-on there is a way to access it with javascript ad described here: how to call a function in Firefox extension from a html button.

Take also a look at GreaseMonkey and jQuery traversing.

Community
  • 1
  • 1
  • Try to trigger a hashchange() after DOM content is loaded with ajax and see if addon runs and appends it stuff again. –  Dec 07 '11 at 20:35
0

So the conclusion for now is that there doesn't seem to be a name or established practices for this technique.

Thanks to those who have mentioned search engines, it makes sense to see it as a local search, with an effort to interpret the content and structure.

Christophe
  • 27,383
  • 28
  • 97
  • 140
-1

Summarization

It is the technique used in all the web crawlers. Please have a look at open source well documented web crawler/search engine Yioop!

hrishikeshp19
  • 8,838
  • 26
  • 78
  • 141
  • I don't know, but maybe your answer would need to be more detailed? I have looked up definitions of summarization, but didn't find anything directly related to the question. Also, I followed your link to Yioop, but didn't see any documentation. – Christophe Dec 15 '11 at 21:22
  • Could anyone mind to give reason for a downvote. Please see when to downvote. http://stackoverflow.com/privileges/vote-down – hrishikeshp19 Feb 02 '12 at 18:54
-1

As it is already said it is call summarization but you can find about it more searching therm "web crawling bot/technique/robot". Here some starting document you might find useful:

Crawling the Web

kapex
  • 28,903
  • 6
  • 107
  • 121
Siblja
  • 859
  • 2
  • 12
  • 19