3

I look after a large site and have been studying other similar sites. In particular, I have had a look at flickr and deviantart. I have noticed that although they say they have a whole lot of data, they only display up to so much of it.

I persume this is because of performance reasons, but anyone have an idea as to how they decide what to show and what not to show. Classic example, go to flickr, search a tag. Note the number of results stated just under the page links. Now calculate which page that would be, go to that page. You will find there is no data on that page. In fact, in my test, flickr said there were 5,500,000 results, but only displayed 4,000. What is this all about?

Do larger sites get so big that they have to start brining old data offline? Deviantart has a wayback function, but not quite sure what that does.

Any input would be great!

David
  • 16,246
  • 34
  • 103
  • 162

2 Answers2

1

This is type of perfomance optimisation. You don't need to scan full table if you already get 4000 results. User will not go to page 3897. When flickr runs search query it finds first 4000 results and then stops and don't spend CPU time and IO time for finding useless additional results.

Andrey Frolov
  • 1,534
  • 10
  • 19
  • Okay, I understand that. So, now that I know why they do it. How do we think they do it? They use pagination so to you reckon that after page 115 they just have code that says, stop serving stuff and running queries? – David Nov 10 '10 at 10:18
  • You can set limitation directly in SQL query. Like "select * from posts order by date limit 1000". This trick is described in every good book about SQL optimisation. – Andrey Frolov Nov 10 '10 at 10:53
0

I guess in a way it makes sense. Upon search if the user does not click on any link till page 400 (assuming each page has 10 results) then either the user is a moron or a crawler is involved in some way.

Seriously speaking if no favorable result is yielded till page 40, the concerned company might need to fire all their search team & adopt Lucene or Sphinx :)

What I mean is they will be better off trying to improve their search accuracy than battling infrastructure problems trying to show more than 4000 search results.

Srikar Appalaraju
  • 71,928
  • 54
  • 216
  • 264