0

I'm working on a project where users could upload excel files into a MySQL database. Those files are the main source of our data as they come directly from the contractors working with the company. They contain a large number of rows (23000 on average for each file) and 100 columns for each row!

The problem I am facing currently is that the same file could be changed by someone (either the contractor or the company) and when re-uploading it, my system should detect changes, update the actual data, and save the action (The fact that the cell went from a value to another value :: oldValue -> newValue) so we can go back and run a versions comparison (e.g 3 re-uploads === 3 versions). (oldValue Version1 VS newValue Version5)

I developed a tiny mechanism for saving the changes => I have a table to save Imports data (each time a user import a file a new row will be inserted in this table) and another table for saving the actual changes

Versioning data

I save the id of the row that have some changes, as well as the id and the table where the actual data was modified (Uploading a file results in a insertion in multiple tables, so whenever a change occurs, I need to know in which table that happened). I also save the new value and the old value which is gonna help me with restoring the "archives data".

  • To restore a version : SELECT * FROM 'Archive' WHERE idImport = ${versionNumber}
  • To restore a version for one row : SELECT * FROM 'Archive' WHERE idImport = ${versionNumber} and rowId = ${rowId}
  • To restore all version for one row : SELECT * FROM 'Archive' WHERE rowId = ${rowId}
  • To restore version for one table : SELECT * FROM 'Archine' WHERE tableName = ${table}
  • Etc.

Now with this structure, I'm struggling to restore a version or to run a comparaison between two versions, which makes think that I've came up with a wrong approach since it makes it hard to do the job! I am trying to know if anyone had done this before or what a good approach would look like?

Cases when things get really messy :

  • The rows that have changed in a version might not have changed in the other version (I am working on a time machine to search in other versions when this happens)
  • The rows have changed in both versions but not the same fields. (Say we have a user table, the data of the user with id 15 have changed in 2nd and 5th upload, great! Now for the second version only the name was changed, but for the fifth version his address was changed! When comparing these two versions, we will run into a problem constrcuting our data array. name went from "some"-> NULL (Name was never null. No name changes in 5th version) and address went from NULL -> "some' is which obviously wrong).

My actual approach (php)

<?php
//Join records sets and Compare them
foreach ($firstRecord as $frecord) {

  //Retrieve first record fields that have changed
  $fFields = $frecord->fieldName;
  
  //Check if the same record have changed in the second version as well
  $sId = array_search($frecord->idRecord, $secondRecord);
  if($sId) {
      $srecord = $secondRecord[$sId];
      //Retrieve straversee fields that have changed
      $sFields = $srecord->fieldName;
      //Compare the two records fields
      foreach ($fFields as $fField) {
          $sfId = array_search($fField, $sFields);
          //The same field for the same record was changed in both version (perfect case)
          if($sfId) {
              $sField = $sFields[$sfId];
              $deltaRow[$fField]["oldValue"] = $frecord->deltaValue;
              $deltaRow[$fField]["newValue"] = $srecord->deltaValue;
              //Delete the checked field from the second version traversee to avoid re-checking
              unset($sField[$sfId]);
          }
          //The changed field in V1 was not found in V2 -> Lookup for a value
          else {
              $deltaRow[$fField]["oldValue"] = $frecord->deltaValue;
              $deltaRow[$fField]["newValue"] = $this->valueLookUp();
          }
      }
      $dataArray[] = $deltaRow;
      //Delete the checked record from the second version set to avoid re-checking
      unset($secondRecord[$srecord]);
  }

I don't know how to deal with that, as I said I m working on a value lookup algorithm so when no data found in a version I will try to find it in the versions between theses two so I can construct my data array. I would be very happy if anyone could give some hints, ideas, improvements so I can go futher with that.

Thank you!

Aybrl
  • 1
  • 1
  • If you allow me to comment on this, I'd say the job is not trivial. Versioning and change detection and then it often goes hand in hand with conflict resolution. What I like with your concept is that you're trying to keep it simple and what you could consider (if possible) is to define a direction of change. E.g. only a newer version can change an older one, but the older one can not change another older one (e.g. version 3 can change version 2 but not version 1. So that version 1 can not change version 3 and version 2 can not change version 3 either). – hakre Jul 16 '21 at 09:46
  • Direction of change, that's interesting. If I get it, versions won't be incremented automatically with each upload, instead, first I check which version is on the play then I decide whether this version is allowed to change the other version or not? In that way, I will have a direction of changes, I think! Thank you for your help I really appreciate it!! @hakre – Aybrl Jul 16 '21 at 10:02
  • Yes, that is the part about "it goes hand-in-hand with conflict resolution": You try to keep those conflicts out of the data first of all so you actually know what a (new) version is (technically a true conflict in the data can not be resolved by a computer program alone). E.g. a user when uploading a file is effectively changing the last version. Sometimes it is already enough to visualize changes with the upload and let the user approve these changes so that it is clear which effect the upload will have on the data or more correctly on the version of the data. – hakre Jul 16 '21 at 10:07
  • It's a feature in MariaDB. Switch to that fork of MySQL. – Rick James Jul 16 '21 at 20:07

1 Answers1

0

Is there a way to store database modifications with a versioning feature (for eventual versions comparaison [sic!])?

What constitutes versioning depends on the database itself and how you make use of it.

As far as a relational database is concerned (e.g. MariaDB), this boils down to the so called Normal Form which is in numbers.

On Database Normalization: 5th Normal Form and Beyond you can find the following guidance:

Beyond 5th normal form you enter the heady realms of domain key normal form, a kind of theoretical ideal. Its practical use to a database designer os [sic!] similar to that of infinity to a bookkeeper - i.e. it exists in theory but is not going to be used in practice. Even the most demanding owner is not going to expect that of the bookkeeper!

One strategy to step into these realms is to reach the 5th normal form first (do this just in theory, by going through all the normal forms, and study database normalization).

Additionally you can construe versioning outside and additional to the database itself, e.g. by creating your own versioning system. Reading about what you can do with normalization will help you to find better ways to decide on how to structure and handle the database data for your versioning needs.

However, as written it depends on what you want and need. So no straight forward "code" answer can be given to such a general question.

hakre
  • 193,403
  • 52
  • 435
  • 836