2

I'm working on a system that mirrors remote datasets using initials and deltas. When an initial comes in, it mass deletes anything preexisting and mass inserts the fresh data. When a delta comes in, the system does a bunch of work to translate it into updates, inserts, and deletes. Initials and deltas are processed inside long transactions to maintain data integrity.

Unfortunately the current solution isn't scaling very well. The transactions are so large and long running that our RDBMS bogs down with various contention problems. Also, there isn't a good audit trail for how the deltas are applied, making it difficult to troubleshoot issues causing the local and remote versions of the dataset to get out of sync.

One idea is to not run the initials and deltas in transactions at all, and instead to attach a version number to each record indicating which delta or initial it came from. Once an initial or delta is successfully loaded, the application can be alerted that a new version of the dataset is available.

This just leaves the issue of how exactly to compose a view of a dataset up to a given version from the initial and deltas. (Apple's TimeMachine does something similar, using hard links on the file system to create "view" of a certain point in time.)

Does anyone have experience solving this kind of problem or implementing this particular solution?

Thanks!

spieden
  • 1,239
  • 1
  • 10
  • 23
  • Is there any reason you aren't using replication? – Thomas May 10 '11 at 00:45
  • @Thomas: my guess is he probably needs it to be highly asynchronous, e.g. mobile phones connecting to a main db server. – Denis de Bernardy May 11 '11 at 16:07
  • Hmm, how would replication be related? I did some more research, and what we're looking for is similar to what Oracle Workspace Manager does. Basically we're trying to work around seemingly weak optimistic concurrency control in MySQL. – spieden May 11 '11 at 23:27
  • @dacc- The tools and devices in question you are using are important. For example, Microsoft has something called the Sync Framework for synchronizing two sources including mobile devices with a database. This is effectively a form of replication between two sources. If the two sources are databases, you may be able to use built-in replication features to each of the databases to perform replication between the two. Replication would only send changes to the target database instead of the entire set. – Thomas May 12 '11 at 02:08
  • The remote datasets are not stored in RDBMS, and the initials and deltas come in as XML. I guess we are implementing a kind of replication, now that you mention it. – spieden May 17 '11 at 19:10

2 Answers2

0

have one writer and several reader databases. You send the write to the one database, and have it propagate the exact same changes to all the other databases. The reader databases will be eventually consistent and the time to update is very fast. I have seen this done in environments that get upwards of 1M page views per day. It is very scalable. You can even put a hardware router in front of all the read databases to load balance them.

Charles Lambert
  • 5,042
  • 26
  • 47
0

Thanks to those who tried.

For anyone else who ends up here, I'm benchmarking a solution that adds a "dataset_version_id" and "dataset_version_verb" column to each table in question. A correlated subquery inside a stored procedure is then used to retrieve the current dataset_version_id when retrieving specific records. If the latest version of the record has dataset_version_verb of "delete", it's filtered out of the results by a WHERE clause.

This approach has an average ~ 80% performance hit so far, which may be acceptable for our purposes.

spieden
  • 1,239
  • 1
  • 10
  • 23