Compare two big strings without storing them in database

Question

I want to detect whether a user has modified some data from it's previous value. This data is stored in multiple fields across different database tables.

Let us say that the data on 1st run was:

'Roger Federer', 'Ferrari', 'Keyboard', 'Romeo'

and on 2nd run be:

'Roger Federer', 'Ferrari', 'Keyboard', 'Juliet'

I need to compare if the data on 2nd run was modified or not. To achieve this, I created a key in which I stored the concatenation of all the fields. This would result in something like this: 'RogerFedererFerrariKeyboardRomeo' for the first one and 'RogerFedererFerrariKeyboardJuliet' for the second. I will then compare these keys. The problem with this approach is that I have more than 50 fields and the key will be huge in size, which will impact my database.

As I am not concerned about the actual data values and only want to see if the data was modified, can I create POJO/JSON for my data and compute CRC32/MD5 checksum as keys and compare them? What could go wrong in this approach? Is there a better alternative solution?

Which checksum would be best (efficient)? Which object type would be more efficient: POJO, XML or JSON? — Apoorva G, Jun 08 '20 at 11:41
Please take a look here https://stackoverflow.com/questions/122982/robust-and-fast-checksum-algorithm — silentsudo, Jun 08 '20 at 11:42

score 1 · Answer 1 · answered Jun 08 '20 at 12:06

1

You may consider using a 'revision' or 'lastModified' attribute on your User model.

Depending on your framework and needs, you can prevent concurrent modification as well.

Regards.

answered Jun 08 '20 at 12:06

Nicolas COSME

99
3

The problem with revision and lastModified can be they keep on updating even when the actual content is not modified and update is allowed. – silentsudo Jun 08 '20 at 12:11
like user is kept on updating his name and in db side we have just null check so when old value is updated it is allowed to insert along with incrementing revision and updating lastModified but infact the content remained same, this implementation can be good for knowing when was data last modified but the data integrity i have doubts. – silentsudo Jun 08 '20 at 12:39
My opinion is to let DB Framework (ORM) to deal with data integrity, revision etc... If you have a specific need in detecting data modification, you may consider it as an application feature and code it in this way (in a Service class etc...) – Nicolas COSME Jun 08 '20 at 13:01
Yes, i agree to this, but this is to avoid complication in code side instead we are moving complexity to db queries if no in code. – silentsudo Jun 08 '20 at 13:02
Will not work for me as this will return false positives in cases when a field is updated and reverted back in the next update. – Apoorva G Jun 08 '20 at 13:40

silentsudo · Answer 2 · 2020-06-08T14:58:33.717

If you have access to database then I'd suggest creating a new column hash.

For example Please follow along with this script

drop database if exists test_md5;
create database test_md5;

use test_md5;

CREATE TABLE if not exists `test_users` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `username` varchar(255) not null,
  PRIMARY KEY (`id`)
);

CREATE TABLE if not exists `test_posts` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `title` text NOT NULL, 
  `user_id` bigint(20) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `fk_post_user` (`user_id`),
  CONSTRAINT `fk_post_user` FOREIGN KEY (`user_id`) REFERENCES `test_users` (`id`)
);

-- Users 
insert into test_users(username) values('john');
insert into test_users(username) values('jane');

-- John's test_posts Content
insert into test_posts(title, user_id) values('hey i am john, this is my very first post', 1);
insert into test_posts(title, user_id) values('hey i am john, this is my very first post', 1);
insert into test_posts(title, user_id) values('hey i am john again, i\'ll be going off from website for couple of days', 1);

-- Jane's test_posts Content
insert into test_posts(title, user_id) values('hey i am jane, this website looks a biut dry', 2);

Please note that my second row has same content as row 1. Now when query them I add extract colum for the fields which makes up the data integrity of either single row item or from multiple table.

When we add posts.id column to concat function it makes the rows distinguishable.


SELECT posts.id, posts.title, users.username, 
md5(concat(posts.title, users.username)) 
as hash from test_posts as posts 
inner join test_users as users where posts.user_id = users.id

In the first iteration, you would return hash along with data.

In the second iteration, you would have the hash in request and hash from the sql response or from db layers, you can compare these 2 values and decide further what else needs to be done.

Another approach, is that you implement meaningful equals and hashcode for you every model representing in the database.

Another solution would be to convert string to JSON and compute checksum and compare, but that is more inefficient since it involves adding extras json library and conversion to JSON is costly magic.

Reference from here:

https://ckhconsulting.com/row-or-rows-md5-hash-with-mysql/

score 0 · Answer 3 · answered Jun 08 '20 at 12:03

0

A good way to do it without consuming much resources is to compare each field's values , the comparison can stop the moment when you find a difference

Example :

Roger Federer | Ferrari | Keyboard | Romeo
Roger Federer | Ferrari | Keyboard2 | Juilet

Comparing filed1 ( same ) --> field2 ( same ) --> filed3 ( different ) : stop comparison

answered Jun 08 '20 at 12:03

Karam Mohamed

843
1
7
15

This will not work for me as I am overriding the old values. – Apoorva G Jun 08 '20 at 13:36

Compare two big strings without storing them in database

3 Answers3