21

I'm trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store and index their frontpage or any of their main pages, then within hours my index for that page will be out of date.

Does a large search engine such as Google have an algorithm to re-crawl frequently updated pages very frequently, hourly even? Or does it just score frequently updated pages very low so they don't get returned?

How can I handle this in my own application?

Kara
  • 6,115
  • 16
  • 50
  • 57
OdieO
  • 6,836
  • 7
  • 56
  • 88

4 Answers4

30

Good question. This is actually an active topic in WWW research community. The technique involved is called Re-crawl Strategy or Page Refresh Policy.

As I know there are three different factors that were considered in the literature:

  • Change frequency (how ofter the content of a web page is updated)
    • [1]: Formalized the notion of "freshness" of data and use a poisson process to model the change of web pages.
    • [2]: Frequency estimator
    • [3]: More of scheduling policy
  • Relevance (how much influence the updated page content has on search results)
    • [4]: Maximize the quality of the user experience for those who query the search engine
    • [5]: Determine the (nearly) optimal crawling frequencies
  • Information Longevity (the lifetimes of content fragments that appear and disappear from web pages over time, which is shown not strongly correlated with change frequency)
    • [6]: distinguish between ephemeral and persistent content

You might want to decide which factor is more important for your application and users. Then you can check the below reference for more details.


Edit: I briefly discuss the frequency estimator mentioned in [2] to get you started. Based on this, you should be able to figure out what might be useful to you in the other papers. :)

Please follow the order I pointed out below to read this paper. It should not be too hard to understand as long as you know some probability and stats 101 (maybe much less if you just take the estimator formula):

Step 1. Please go to Section 6.4 -- Application to a Web crawler. Here Cho listed 3 approaches to estimate the web page change frequency.

  • Uniform policy: A crawler revisits all pages at the frequency of once every week.
  • Naive policy: In the first 5 visits, a crawler visits each page at the frequency of once every week. After the 5 visits, the crawler estimates the change frequencies of the pages using the naive estimator (Section 4.1)
  • Our policy: The crawler uses the proposed estimator (Section 4.2) to estimate change frequency.

Step 2. The naive policy. Please go to section 4. You will read:

Intuitively, we may use X/T (X:the number of detected changes, T: monitoring period) as the estimated frequency of change.

The subsequence section 4.1 just proved this estimation is biased7, in-consistant8 and in-efficient9.

Step 3. The improved estimator. Please go to section 4.2. The new estimator looks like below: enter image description here

where \bar X is n - X (the number of accesses that the element did not change) and n is the number of accesses. So just take this formula and estimate the change frequency. You don't need to understand the proof in the rest of the sub-section.

Step 4. There are some tricks and useful techniques discussed in Section 4.3 and Section 5 that might be helpful to you. Section 4.3 discussed how to deal with irregular intervals. Section 5 solved the question: When the last-modication date of an element is available, how can we use it to estimate change frequency? The proposed estimator using last-modification date is shown below:

enter image description here

The explanation to the above algorithm after Fig.10 in the paper is very clear.

Step 5. Now if you have interest, you can take a look at the experiment setup and results in section 6.

So that's it. If you feel more confident now, go ahead and try the freshness paper in [1].


References

[1] http://oak.cs.ucla.edu/~cho/papers/cho-tods03.pdf

[2] http://oak.cs.ucla.edu/~cho/papers/cho-freq.pdf

[3] http://hal.inria.fr/docs/00/07/33/72/PDF/RR-3317.pdf

[4] http://wwwconference.org/proceedings/www2005/docs/p401.pdf

[5] http://www.columbia.edu/~js1353/pubs/wolf-www02.pdf

[6] http://infolab.stanford.edu/~olston/publications/www08.pdf

greeness
  • 15,956
  • 5
  • 50
  • 80
  • 2
    Quite advanced stuff, my head hurts a bit when reading it. Thanks. – Swader Oct 25 '12 at 10:31
  • @Swader : What is the value of "fresh information" for the end-users? Is it strictly negative exponential in time? Are all users the same in the form and scale of this function; are all sites the same for all users? This does require a bit of optimization number-crunching. – Deer Hunter Oct 25 '12 at 20:33
  • All users and sites are the same in form and scale. In other words, the final goal is to simply have a searchable directory of data crawled elsewhere. – Swader Oct 25 '12 at 20:39
  • @Swader, when you read those papers, you don't need to understand the proof. Just check out their models and some tricks they use. I will add more intro to my post to get you started. – greeness Oct 25 '12 at 21:12
  • Brilliant stuff, there's actual science in this. Thanks! – Swader Oct 29 '12 at 09:24
6

Google's algorithms are mostly closed, they won't tell how they do it.

I built a crawler using the concept of a directed graph and based the re-crawl rate on pages' degree centrality. You could consider a website to be a directed graph with pages as nodes and hyperlinks as edges. A node with high centrality will probably be a page that is updated more often. At least, that is the assumption.

This can be implemented by storing URLs and the links between them. If you crawl and don't throw away any links, the graph per site will grow. Calculating for every node per site the (normalised) in- and outdegree will then give you a measure of which page is most interesting to re-crawl more often.

TTT
  • 6,505
  • 10
  • 56
  • 82
  • A solid theory, but how would this apply to my original problem of having a directory of people who are spread out across 2300 pages, any of which can be updated at any given moment (thus also changing all the others as the change cascades into all later pages)? – Swader Oct 25 '12 at 18:49
  • If any page can be updated at any time with the same probability and that's all we know, there is no way of telling which page will updated next. In that is the case, this concept won't work at least. The idea I gave considers every page in relation to the other pages of a site. You might then be looking for a method that predicts the use of re-crawl *only* based on the page itself. – TTT Oct 25 '12 at 18:57
  • In that case, greeness answer can help better maybe, especially **relevance** and **change frequency**. – TTT Oct 25 '12 at 19:05
3

Try to keep some per frontpage stats on update frequency. Detecting an update is easy just store the ETag/Last-Modified and send back If-None-Match/If-Updated-Since headers with your next request. Keeping a running average update frequency (say for the last 24 crawls) allows you to fairly accurately determine the update frequency of the frontpages.

After having crawled a frontpage you would determine when the next update is expected and put a new crawl-job in a bucket right around that time (buckets of one hour are typically a good balance between fast and polite). Every hour you would simply take the corresponding bucket and add the jobs to your job queue. Like this you can have any number of crawlers and still have allot of control over the scheduling of the individual crawls.

simonmenke
  • 2,819
  • 19
  • 28
  • Thanks. Allow me to ask about something more specific though - what about in the case of crawling various directories? For instance, a page that has a directory of people who are searchable, but can be browsed alphabetically without filters? Or a page that collects articles and posts them in the order of their online publication date? How would one detect that there was a new entry injected on, say, page 34. I would have to re-crawl all available pages? – Swader Oct 20 '12 at 09:52
  • The listing pages would obviously have new ETag headers (but not necessarily new Las-Modified headers). In most cases you would have to recrawls the listing pages. But, when you are also following the links to the individual article pages, you would only need to crawl the new posts. – simonmenke Oct 21 '12 at 09:27
  • Etag/Last-Modified are not trustworthy sources for page modification specially for dynamically generated content. In many cases this variables are generated by language interpreter inaccurately. – AMIB Oct 22 '12 at 07:47
  • You should keep a copy of pages, change crawl rate so that modification rate of pages are determined at a reasonable amount. Look at "Google webmasters" options and some useful tips you will find there – AMIB Oct 22 '12 at 07:54
  • @Swader Correct me if I'm wrong, but if you index search results within the crawled website, you will end up with duplicate content. So my guess is that will need to ignore website's search and different (order) filters. You could scan the webpage for "form" tags, detect "input" names and ignore any URI that contains those names. That way you could avoid crawling search results within the website. – Alexandru Guzinschi Oct 27 '12 at 21:35
2

I'm not an expert in this topic by any stretch of the imagination but Sitemaps are one way to alleviate this problem.

In its simplest terms, a XML Sitemap—usually called Sitemap, with a capital S—is a list of the pages on your website. Creating and submitting a Sitemap helps make sure that Google knows about all the pages on your site, including URLs that may not be discoverable by Google's normal crawling process. In addition, you can also use Sitemaps to provide Google with metadata about specific types of content on your site, including video, images, mobile, and News.

Google uses this specifically to help them crawl news sites. You can find more info here on Sitemaps and info about Google News and Sitemaps here.

Usually, you can find the Sitemaps.xml in a website's robots.txt. For example, TechCrunch's Sitemap is just

http://techcrunch.com/sitemap.xml

which turns this problem into parsing xml on a regular basis. If you can't find it in the robots.txt, you can always contact the web master and see if they'll provide it to you.

UPDATE 1 Oct 24 2012 10:45 AM,

I spoke with one of my team members and he gave me some additional insight about how we handle this problem. I want really reiterate that this isn't a simple issue and requires a lot of partial solutions.

Another thing we do is monitor several "index pages" for changes on a given domain. Take the New York Times for example. We create one index page for a the top level domain at:

http://www.nytimes.com/

If you take a look at the page, you can notice additional sub areas like World, US, Politics, Business, etc. We create additional index pages for all of them. Business has additional nested index pages like Global, DealBook, Markets, Economy, etc. It isn't uncommon for a url to have 20 plus index pages. If we notice any additional urls that are added on the index, we add them to the queue to crawl.

Obviously this is very frustrating because you may have to do this by hand for every website you want crawl. You may want to consider paying for a solution. We use SuprFeedr and are quite happy with it.

Also, many websites still offer RSS which is an effective way of crawling pages. I would still recommend contacting a webmaster to see if they have any simple solution to help you out.

sunnyrjuneja
  • 6,033
  • 2
  • 32
  • 51
  • Good advice for websites that offer sitemaps. I am, unfortunately, dealing with some that don't keep their sitemaps up to date, or don't have them at all. – Swader Oct 24 '12 at 10:16