2

These requirements are sketchy at the moment, but will appreciate any insights. We are exploring what would be required to build a system that can handle 50 database million queries a day - specifiically from the programming language and database choice

Its not a typical website, but an API / database accessing through the internet. Speed is critical. The application will primarily receive these inputs (about a few kb each) and will have to address each of them via the database lookup. Only a few kb will be returned.

The server will be run over https/ssl.


Added:
* yes, there will be a couple of thousand inserts as well. dont have an insight into that yet, but lets say 10-50,000 / day.
* There may be updates also, but lets not complicate the problem yet
* no, it wont be evenly spread durimg the day. as typical, during office / waking hours, load will be higher? perhaps following a normal curve - dont know yet.
* database size will be 1.5 billion entries.
* the client ends wont send sql queries, but a number to retrieve the database entry of.

user36981
  • 135
  • 4

8 Answers8

1

OK, lets see.

  • Language: Irrelevant. Really. You talk of clustered front ends anyway, and if you buildthem right, it is pretty much something you can scale pretty much as much as you want. That said, obviously stay away from INTERPRETED langauges (like "stock" PHP) and go with those who are at least just in time compiled (is there one for PHP - not sure). If you want the API to adhere to standards, this means pretty much some SOAP / REST based front end - ASP.NET / C# may be good choices here, becuase the system has a very strong support end for web services. Not only consuming them. You may also want to look at OData (http://www.odata.org/) for some stuff. I am not sure how good support for web service hosting is on other systems - but you may want some of the finter points tehre and MS is pretty much pushing web services.

  • Database: Looks like you are read heavy. This is good because it means you can work in a hub/spoke setup with one database centrally taking all the writes and replicating changes out to other computers. Reads can be distributed among those. That being said, you talk of massive setup here.

Now o the load. You talk of a peak of possibly 100.000 to 250.000 queries per minute (depends how high your peaks are - if many people use that during work start, it will be quite massive). This is about 4166 queries per second.

I personally think youa re in the spce for SQL Server / Oracle clustering. Anyhow, on SQL Server you would possibly go with:

  • A central database cluster (2 instances enterpris edition, possibly standard but that needs more details - under SPLA license agreement) mirored + one small one as witness) to handle the master copy and also do writes. If you use a master/slave setup, one should actually be free of charge licensing wise. If you mange to live with standard edition - not so expensive. But you need a maintenance window then for index reograinzation should the need arrive. The small database (witness for mirroring) can be one one of the web servers. It is merely used as "third vote" which database server to use in case of doubt (like network parts going down). It then decides which of the servers will shut down.

If that is not enough to handle the load - but it may well be, if you do the db design right and get some higher end systems (dual six core opterons). You can posssibly fit the whole hardware for one unit into a 2 rack unit high cage - Supermicro has some out which have space for 24 2.5" hard discs. No need to go SAS - WD Velociraptors should be more efficient, and get a fast SSD and an Adapted RAID controller and you get the SSD as read buffer ;) Should be more than enough to handle your load.

If that does not work out well enough performance wise, you basically need more bang.

  • A cluster of replicated copies of the database. You possibly could use Web edition here which accepts being a replication target and is pretty cheap to use. Those would NOT get ANY updates / inserts and be read only copies. You can easily use a load balancer in front of those (which windows has out of the box).

SImilar setups should be possible with - well... not sure. Oracle - yes. MySQL - someone can step in and answer.

TomTom
  • 51,649
  • 7
  • 54
  • 136
  • just a note re Oracle cluster... I presume you're referring to RAC?... If so, then I feel obliged to warn you and advice to think twice before going that path... In short - avoid RAc... – rytis Mar 27 '10 at 20:17
  • I know. Heard that more than once ;) My personal esteem of Oracle is very low... but then, i avoid them, so they may have changed... well, hope dies last ;) – TomTom Mar 27 '10 at 20:25
1

Your mean rate of queries per second is 600. What do you know of the traffic pattern? Are all the queries going to come in at lunch time, only during office hours in a specific timezone, what?) Assuming all the queries are evenly spread over an 8h working day, you'll be planning for a peak of 2k queries per second.

Database? If you have to. A simple key/value store would have a higher performance. 1.5B records of (say) 4kB each is 6 TB. Try this architecture:

5 frontends talking to a duplicated set of data stores. Maybe use 40 servers for that, storing 300 GB each. This means you can lose any one host and still continue to serve. If you're going to serve a novel result most of the time, I'd double that to 80 servers: you'll be incurring at least one disk seek for every query and I wouldn't be so optimistic as to hope for a sustained 50 seeks a second.

The programming language is irrelevant.

stribb
  • 59
  • 1
1

Building a database system that can handle 50 million queries a day, is not a difficult task. With a large cassandra server we are able to get ~100 reads per second per core, and ~25 writes per second per core. Based on your number 50M I would suggest 2 8core systems. In order to get the performance numbers you will need to tweak the OS, disk setup and memory specs.

Memory preloading per column family, and layout (no-schema) tuning will need to be done.

Other options in the large relational cluster arena, are not as scalable and the costs will be outrageous.

user67823
  • 104
  • 2
0

You're looking at about 30,000 requests a minute, assuming the load is spread throughout the day (unlikely). This is a lot no matter what system design is.

However, you've underspecified the problem. We don't know how big the database itself is, or how amenable the queries are to caching. We don't know the interface you're giving people; are you required to accept SQL, or is the query language decidable? We also don't know how often the database is going to be updated and how critical those updates to subsequent queries.

The more ways you can constrain the problem the better you'll be.

jldugger
  • 14,342
  • 20
  • 77
  • 129
0

As you didn't provide a lot of specifics I'm going to keep this brief as well. The language is really up to you, although C Sharp / ASP.NET would fit well here. I'd go with a noSQL database such as cassandra: http://en.wikipedia.org/wiki/Cassandra_%28database%29

Lastly, with that number of reads versus writes be sure to plan your hardware (specifically your drive speed) accordingly.

Dave Holland
  • 1,898
  • 1
  • 13
  • 18
0

The issue that hasn't been discussed is what is the query mix that is going on here. Often you can solve half the problem by application redesign.

For instance, if a page hit produces 100 queries, optimise the code to only do 20 queries, then if the page is hit a lot, also precompute the page contents and only recompute as needed (even every minute would produce much more efficiency). This can produce 100 - 1000 times more efficiency. In an application of this size, you MUST put as much work into the application data access patterns as you do the actual implementation, or it will be many times more expensive for the organisation. Also, if the performance is bad at time of implementation you are asking for trouble as the application grows. I've literally seen database runs reduced from 6 hours to 3 minutes by applying mature application and database design principles, and not just once either.

There are substantial experts available in the field; it's simply a matter of knowing who to talk to. People who work in organizations dealing with these sorts of size applications normally have access to those experts, hence the trolling comment above.

0

MySQL can handle thousands of queries per second on decent hardware, and if the app is coded to be able to segregate read queries from update queries, it's really easy to set up read-only slaves for scaling reads. Whatever the language, make sure the app support persistent connections and/or connection pooling.

ggiroux
  • 244
  • 1
  • 2
0

What is total size of your database and what are your hardware specs? Above two points are most crucial to answer your question because on low grade hardware and non-decent setup you would not get the performance mark you are looking for.

Rick James
  • 101
  • 1