I am considering writing a Cascading application that issues SELECT statements to MYSQL databases where each query can return millions of rows.
Each database exists on N slaves and one master, as shown here: http://dev.mysql.com/doc/refman/5.0/en/replication-solutions-scaleout.html#figure_replication-scaleout
In JDBCTap, I see that we can only pass one hostname: this means that all my mappers would try and connect to a single slave (while the other (N-1) slaves are idle).
Assuming there is no load balancer in front of the slaves, is there a JAVA package/module/proxy (or better, a customized JDBCTap implementation) I can use that can spread out the connection of the mappers across the N slaves?
A sample script using a hardware proxy that seems to do this is detailed here: http://blog.netoearth.com/html/201006/building-a-mysql-load-balancing-proxy-with-trafficscript.htm
If such a thing does not exist, I would have to create my own JDBCTap (that spreads out the connection from the mappers across the N slaves) using the following algorithm:
- Keep a list of the N slaves on an external database (anything that supports concurrent writes/updates well)
- Have a counter of mappers that have connected against each slave
- Every-time a mapper needs to connect to a slave, it queries this database to find out the slave that is least loaded(has min count of mappers connected to it). Random tie-break on collision.
Does this sound like a good idea? What could be the external database "that supports concurrent writes/updates well" that I referred to above? (Cassandra, VoltDB, etc)