0

I am considering writing a Cascading application that issues SELECT statements to MYSQL databases where each query can return millions of rows.

Each database exists on N slaves and one master, as shown here: http://dev.mysql.com/doc/refman/5.0/en/replication-solutions-scaleout.html#figure_replication-scaleout

In JDBCTap, I see that we can only pass one hostname: this means that all my mappers would try and connect to a single slave (while the other (N-1) slaves are idle).

Assuming there is no load balancer in front of the slaves, is there a JAVA package/module/proxy (or better, a customized JDBCTap implementation) I can use that can spread out the connection of the mappers across the N slaves?

A sample script using a hardware proxy that seems to do this is detailed here: http://blog.netoearth.com/html/201006/building-a-mysql-load-balancing-proxy-with-trafficscript.htm

If such a thing does not exist, I would have to create my own JDBCTap (that spreads out the connection from the mappers across the N slaves) using the following algorithm:

  1. Keep a list of the N slaves on an external database (anything that supports concurrent writes/updates well)
  2. Have a counter of mappers that have connected against each slave
  3. Every-time a mapper needs to connect to a slave, it queries this database to find out the slave that is least loaded(has min count of mappers connected to it). Random tie-break on collision.

Does this sound like a good idea? What could be the external database "that supports concurrent writes/updates well" that I referred to above? (Cassandra, VoltDB, etc)

newToFlume
  • 51
  • 1
  • 8

1 Answers1

0

JDBC in general will not support parallelism, so that presents a bottleneck for large-scale loads.

In that kind of primary/standby architecture, how about pulling data from the standbys via MySQL backups to TSV, then loading those in parallel into HDFS?

Paco
  • 602
  • 1
  • 9
  • 19