2

I want to design primary key for my table with row versioning. My table contains 2 main fields : ID and Timestamp, and bunch of other fields. For a unique "ID" , I want to store previous versions of a record. Hence I am creating primary key for the table to be combination of ID and timestamp fields. Hence to see all the versions of a particular ID, I can give,

Select * from table_name where ID=<ID_value>

To return the most recent version of a ID, I can use

Select * from table_name where ID=<ID_value> ORDER BY timestamp desc

and get the first element. My question here is, will this query be efficient and run in O(1) instead of scanning the entire table to get all entries matching same ID considering ID field was a part of primary key fields? Ideally to get a result in O(1), I should have provided the entire primary key. If it does need to do entire table scan, then how else can I design my primary key so that I get this request done in O(1)?

Thanks, Sriram

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
sriram
  • 712
  • 8
  • 26
  • 1
    I would put a clustered index on another unique/auto-incrementing key with a non clustered index on ID and timestamp. If you have any relationships outside of this table then this PK would be a pain as a FK(with the timestamp idea), however, if you are only interested in raw speed, a clustered index on ID and timestamp would yield the fastest result. The composite key gain would not be noticeable unless you are partitioning. – Ross Bush Jul 21 '15 at 00:39
  • 1
    First, you should decide what database you are using: MySQL or SQL Server. I am removing both tags and replacing them with "SQL", but you should still provide the right tag. – Gordon Linoff Jul 21 '15 at 00:41
  • Sorry, I meant the gain of a clustered PK index of ID and timestamp would be noticeable if you are partitioning or have a very, very large dataset. – Ross Bush Jul 21 '15 at 00:50
  • Thanks Gordon and lrb for your suggestion. – sriram Jul 26 '15 at 23:56

2 Answers2

2

The canonical reference on this subject is Effective Timestamping in Databases: https://www.cs.arizona.edu/~rts/pubs/VLDBJ99.pdf

I usually design with a subset of this paper's recommendations, using a table containing a primary key only, with another referencing table that has that key as well change_user, valid_from and valid_until colums with appropriate defaults. This makes referential integrity easy, as well as future value insertion and history retention. Index as appropriate, and consider check constraints or triggers to prevent overlaps and gaps if you expose these fields to the application for direct modification. These have an obvious performance overhead.

We then make a "current values view" which is exposed to developers, and is also insertable via an "instead of" trigger.

rmalayter
  • 519
  • 6
  • 12
  • Thanks rmalayter for responding with paper link. I am still going through the paper to understand why we require a separate table to track only change_user, valid_from, valid_until column. I will respond back with questions after I go through it. – sriram Jul 27 '15 at 22:38
  • The complexity is required if you want to track and enable *all* the temporal possibilities. That is, who made a transaction and exactly when, along with when you want that data to be visible or "valid", even if that is sometime in the *future*. If you don't need all of those features your schema and timestamp handling code can be significantly simpler. – rmalayter Jul 28 '15 at 12:48
  • 1
    The separate "key only" table is required because no popular SQL database supports referential integrity based on one piece of a multi-column primary key of the form (ID,valid_from,valid_until). Unless you want to rely on application code or triggers to maintain referential integrity, a practical implementation of the paper's concepts needs that "key only" table and a lot of joins. – rmalayter Jul 28 '15 at 13:00
  • Thanks a lot for explaining rmmalayter – sriram Jul 29 '15 at 17:11
1

It's far easier and better to use the History Table pattern for this.

create table foo (
  foo_id int primary key,
  name text
);

create table foo_history (
  foo_id int,
  version int,
  name text,
  operation char(1) check ( operation in ('u','d') ),
  modified_at timestamp,
  modified_by text
  primary key (foo_id, version)
);

Create a trigger to copy a foo row to foo_history on update or delete.

https://wiki.postgresql.org/wiki/Audit_trigger_91plus for a full example with postgres

Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
  • Thanks Neil for your suggestion. Definitely helpful. – sriram Jul 27 '15 at 04:29
  • The history table is a common pattern, and the history table and triggers can easily be generated from the "base tables" via code. But this pattern doesn't allow for "future" row values to be inserted or updated into a table so they become "effective" at a specifc date and time. Not as many apps have this requirement, but there are lots of systems that do billing, subscriptions, etc. where data needs to be recorded now and become visible later. – rmalayter Jul 31 '15 at 18:23