0

I'm working on a database system where is one master bd & table and many slaves, and the sales need to synchronize their data from the master.

Some points about this arrangement: 1) the slave db(S) will be offline from time to time, so when they get up again, the service will "sync" them. (So no replication) 2) I can't change the master db, so not triggers. (imagine the master db to not belong to me) 3) The sync will happen in a service and speed is not important. 4) There is not massive amounts of data - so I could even walk though each row. 5) The data in the rows varies wildly, from varchar(3) to varchar(500)

I was looking around and found Md5() - this i thought was great, because then i could just CONCACT() the row values, and row-for-row compare. The problem i found was that CONCAT (AND GROUP_CONCAT) is a little limiting, in that it has a character limit.

What's the best way to get a "row" CHECKSUM or HASH?

CHECKSUM ROW WHERE id = 1 would be great... whats the closest to this?

UPDATE: I have managed to get to that "SELECT @@group_concat_max_len;" returns the max (4294967295). But this code still does not work:

SELECT id, MD5(GROUP_CONCAT(MD5(`id`), ...many columns here... ,MD5(`col30`))) AS 'md5Hash' FROM company_table GROUP BY id;

It still only works on a handful of columns.

Conrad de Wet
  • 477
  • 6
  • 15
  • 1
    You can increase the character limit for `group_concat` with the `group_concat_max_len` variable, effectively making it as big as your system will maximally support. – bishop Feb 22 '16 at 19:56
  • If it is a service, you could process everything inside the service and run a hash on the returned results that you concatenate in code if you can't set `group_concat_max_len` as suggested by bishop. As you said, speed is of no concern, and that will get you your desired results. – Frank J Feb 22 '16 at 19:59
  • Given your requirements and constraints, hashing with `MD5(GROUP_CONCAT(..)))` seems reasonable to me. In particular, see [this answer](http://stackoverflow.com/a/26733784/2908724). A code level solution as @FrankJ suggested might ultimately be the best approach, as it's going to be more flexible than a database level one liner. – bishop Feb 22 '16 at 20:05
  • 1
    Take a look at the percona toolkit. With the command pt_table sync can you sync tables or schemas. the only thing is that the table structure must be the same. But you can also use the Master / Slave Replication. its starts everytime when the DBs are online. – Bernd Buffen Feb 22 '16 at 20:20
  • Thanks bishop, i forgot to mention that i tried the group_concat_max_len, but this didnt work either. I'm not sure if these settings are manipulable by my connection - imagine that i cannot change or rely on any changes to the remote system at all. – Conrad de Wet Feb 23 '16 at 06:59
  • I think if I can MD5(GROUP_CONCAT(MD5(col1), ..., MD5(col1))), i might get a consistent length to work with? – Conrad de Wet Feb 23 '16 at 07:00
  • I still dont understand why you cant use the standard asynchronous mysql replication. The master saves all the modification queries and sync with the slaves. Slaves offline from time to time its no problem. – PerroVerd Feb 23 '16 at 16:00
  • The table(s) in question is one or two out of many, and some of the other big tables are truly missive with billions or rows. I didn't think replication could be selective... can it? Also, i need some data in the slaves to become master data, that will add to the master... ha... look, just trust me replication is not an option. – Conrad de Wet Feb 23 '16 at 16:10
  • Ok, trust your word :) Just FYI you can have selective replication (replicate-ignore-table and others) and also multi master options with GTID but as you describe your case probably its too complex for generic solutions. – PerroVerd Feb 23 '16 at 17:16

1 Answers1

0

Ok, i'm going to answer this because i think i have gotten far enough.

So it turns out something like this:

SELECT id, MD5(CONCAT(IF(col1,CRC32(col1),CRC32('')), ... ,IF(col20,CRC32(col20),CRC32('')))) AS 'md5has' FROM table

Resulted in the least amount of string to CONCAT because the CRC32 can only produce up to a 10 digit value (4,294,967,295), so even if the concat or group_concat max is 1024, we can produce 10 digit CRC's then we can hash a row with 102 columns in one pass.

Best option was to build a multi pass system of hashing out tables with more than 102 columns in multiple passes.

A little waist-full, but still better then hashing client side... the IO saving is still massive, and besides who makes tables with 102 columns?!

Conrad de Wet
  • 477
  • 6
  • 15