Database schema design pattern for data that merges

Question

We have a rather antiquated database that contains a large number of individuals along with a number of achievements that they have completed. There has historically been little done to discourage duplicated individual data so we've ended up in a situation where our data is quite dirty. A massively simplified version of this can be found here.

We're now redesigning the schema and user interface. We will be providing the user with a tool to merge their individuals together. In the provided example, Dave and David are clearly the same person and has achieved 4 achievements in total.

Given that users make mistakes and that there are a lot more tables involved than in the example, I'm looking for a schema design that facilitates easy merging of data, and notably, un-merging of data if (when!) the user inevitably makes a mistake.

Linked lists of some sort appear to be a solution but aren't exactly efficient for this use case. Are there any other concepts that might lend themselves to this situation? Any specific design patterns that might be appropriate?

Edit: As SQLFiddle is being rather flaky today here's the create/insert/select sql that was on sqlfiddle:

CREATE TABLE individual
    (`individual_id` int, `forename` varchar(50), `surname` varchar(50))
;

CREATE TABLE achievement
    (`achievement_id` int, `name` varchar(50), `description` varchar(50))
;

CREATE TABLE individual_achievement
    (`individual_id` int,`achievement_id` int)
;

INSERT INTO individual
    (`individual_id`, `forename`, `surname`)
VALUES
    (1, 'Dave', 'Deane'),
    (2, 'David', 'Deane')
;

INSERT INTO achievement
    (`achievement_id`, `name`, `description`)
VALUES
    (1, 'unit_1', 'Unit 1'),
    (2, 'unit_2', 'Unit 2'),
    (3, 'unit_3', 'Unit 3'),
    (4, 'unit_4', 'Unit 4')
;

INSERT INTO individual_achievement
    (`individual_id`,`achievement_id`)
VALUES
    (1, 1),
    (1, 3),
    (2, 2),
    (2, 4)
;

select * from individual i
join individual_achievement ai using (individual_id)
join achievement a using (achievement_id)

Edit 2: Just found this very similar question, hoping in 4 years there might other solutions too.

Couldn't see the SQLFiddle, but why aren't you merging achievements automatically rather than letter users do it? — plalx, Aug 07 '15 at 13:00
Reasonable question @plalx. We don't have enough information available to us be able to reliably detect duplicates. There are a lot of cases where we just have forename and surname to go on which is definitely not enough. in many cases we do have date of birth, but even then there's a possibility of that being wrong. Our plan is to highlight to the user those that we think might be duplicates but only the user will know for certain if their individuals are duplicates or not. — Rob Forrest, Aug 07 '15 at 13:13

Philip Kelley · Answer 1 · 2015-08-07T18:17:26.660

Here’s one tactic you could use.

First, create a new table, for now call it “Individual_v2”, with the exact same columns as original table Individual. (Ideally, you will eventually replace Individual with this table; realistically, people might still enter data into Individual, and you’ll have to “clean” the data by moving or merging it into Individual_v2.) Configure this table with links to Achievement. (For now, I’m assuming Achievement is clean.)

Then, create a “Mapping” table like so:

IndividualMapping

OldIndividual_Id
NewIndividual_Id
CreatedAt
CreatedBy
ApprovedAt  --  Nullable!
ApprovedBy  --  Nullable!

The “Created” columns are used to determine when and by who (or what) the mapping was created.

The “Approved” columns are used to determine if the data has been migrated to the new tables.

For every “old” item, you determine where it might map in the “new” table; if it maps to no existing item, create one for it in the new table.

Then, add an entry in the mapping table. If a new item was created, mark it as approved; if confidence is high, mark it as approved; otherwise, leave it “unapproved” and waiting for review. In due time a reviewer will look things over and approve the mapping, change the mapping to a different existing new item, or create another new item and map to it.

Once completed, the “real” work is done against the new table. The old table and the mapping table can be used to identify where the new data came from and, if necessary, to undo/change mappings.

There are a lot of unanswered implementation and support issues here, and overall it seems awkward as heck. Long-term, once you’ve resolved the problem of duplicate data you can drop the old (and the mapping) table, but until then you’ll have a fussy system.

addenda

I’m kind of talking through things here, without going into exhaustive analysis. The system I think you’re describing is going to be fussy and conceptually complex to work with, even if the tables are relatively simple, and final details are outside the scope of an SO questions. Too, a lot depends on what the overall goals and objectives of the system and its redesign are. I’m going to make some assumptions here:

The “existing” system will stay in place
Individuals (and their awards) so entered must be made immediately available, as they always have been.
Duplicates will continue to be entered; if, when, and as feasible, they are to be “consolidated” with pre-existing entries

Done this way, the system would work something as follows:

There’s a separate relation table between the Individuals_v2 and Achievement (Individual_Achievement_v2 for now, though there must be a better name).
The data in the “v2” tables is correct, good, and right. The “v1” tables are staging, history, log data.
Prepare an initial release, where all entries in the v1 tables are configured in the v2 tables. If rows can be consolidated during this step, so much the better. Everything gets logged in the “map” table, so that they can be cleanly back and redone if necessary.
Going forward from this release, new data is entered in the v1 tables, and simultaneously/immediately entered in the v2 tables as well. If a mapping to an existing item can be made, do so, otherwise create a new entry in the v2 tables. Always log activity in the “map” table.
Going forward, all “Live” queries hit the v2 tables. The v1 tables are (again) history, log, audit trail. Once populated, they are never modified, while the v2 tables (including the mapping table) can and will be.
As determined by the business, periodic reviews/checks are made on the data, to look for and fix duplicate entries that have appeared over time, as well as “invalid duplicates” (incorrect mappings). This is when you do the rollback/redo work, as tracked in the mapping and v1 tables.

You’ll maybe want some extra logging tables, to track things like “all data entered up through xx/xx/xxxx is valid, data entered since then must be reviewed”. I’m sure there are other issues and subtleties that will crop up—they always do…

I see where you are going with that, I guess the biggest assumption here is that a user is never again going to create a duplicated individual. If they did then we'd have to start the process all over again with individual_v3. For the individual_assessment table, would you add a column for individual_v2_id or simply update the individual_id column? (And then create another mapping table for that too?) — Rob Forrest, Aug 07 '15 at 14:20
Thanks Phillip, that's now a much more rounded solution. Pros: Can rollback a single individual merge rather than point in time (command pattern). Cons: For every table that needs to be merged, 2 additional tables are required. — Rob Forrest, Aug 10 '15 at 10:08
Yeah, its the many-to-many relationship that makes it tricky. Without that, it should be possible to cram versioning into a single table. — Philip Kelley, Aug 10 '15 at 13:31

Database schema design pattern for data that merges

1 Answers1