Database structure for storing historical data

Question

Preface: I was thinking the other day about a new database structure for a new application and realized that we needed a way to store historical data in an efficient way. I was wanting someone else to take a look and see if there are any problems with this structure. I realize that this method of storing data may very well have been invented before (I am almost certain it has) but I have no idea if it has a name and some google searches that I tried didn't yield anything.

Problem: Lets say you have a table for orders, and orders are related to a customer table for the customer that placed the order. In a normal database structure you might expect something like this:

orders
------
orderID
customerID


customers
---------
customerID
address
address2
city
state
zip

Pretty straightforward, orderID has a foreign key of customerID which is the primary key of the customer table. But if we were to go and run a report over the order table, we are going to join the customers table to the orders table, which will bring back the current record for that customer ID. What if when the order was placed, the customers address was different and it has been subsequently changed. Now our order no longer reflects the history of that customers address, at the time the order was placed. Basically, by changing the customer record, we just changed all history for that customer.

Now there are several ways around this, one of which would be to copy the record when an order was created. What I have come up with though is what I think would be an easier way to do this that is perhaps a little more elegant, and has the added bonus of logging anytime a change is made.

What if I did a structure like this instead:

orders
------
orderID
customerID
customerHistoryID


customers
---------
customerID
customerHistoryID


customerHistory
--------
customerHistoryID
customerID
address
address2
city
state
zip
updatedBy
updatedOn

please forgive the formatting, but I think you can see the idea. Basically, the idea is that anytime a customer is changed, insert or update, the customerHistoryID is incremented and the customers table is updated with the latest customerHistoryID. The order table now not only points to the customerID (which allows you to see all revisions of the customer record), but also to the customerHistoryID, which points to a specific revision of the record. Now the order reflects the state of data at the time the order was created.

By adding an updatedby and updatedon column to the customerHistory table, you can also see an "audit log" of the data, so you could see who made the changes and when.

One potential downside could be deletes, but I am not really worried about that for this need as nothing should ever be deleted. But even still, the same effect could be achieved by using an activeFlag or something like it depending on the domain of the data.

My thought is that all tables would use this structure. Anytime historical data is being retrieved, it would be joined against the history table using the customerHistoryID to show the state of data for that particular order.

Retrieving a list of customers is easy, it just takes a join to the customer table on the customerHistoryID.

Can anyone see any problems with this approach, either from a design standpoint, or performance reasons why this is bad. Remember, no matter what I do I need to make sure that the historical data is preserved so that subsequent updates to records do not change history. Is there a better way? Is this a known idea that has a name, or any documentation on it?

Thanks for any help.

Update: This is a very simple example of what I am really going to have. My real application will have "orders" with several foreign keys to other tables. Origin/destination location information, customer information, facility information, user information, etc. It has been suggested a couple of times that I could copy the information into the order record at that point, and I have seen it done this way many times, but this would result in a record with hundreds of columns, which really isn't feasible in this case.

So basically what you are saying is this: "I have too many columns in the order table. Therefore I'd like to put the order address in the customer table. To support this, I'd like to compromise the customer data with a complex history tracking scheme." Sounds like a bad idea to me. — Jeffrey L Whitledge, Aug 20 '10 at 18:46
No... not at all. What I am saying is that I need to be able to keep track of addresses, when they change, and be able to tie an order to a specific state (revision) of an address. Orders might not be the only table that ties to an address, not to mention we want to know when and who changed an address. — Ryan Guill, Aug 20 '10 at 19:02
BTW never assume nothing will ever be deleted. Plan for deletes which will ineveitably happen or create a trigger which will not allow deletes. — HLGEM, Aug 20 '10 at 20:46
I have a similar solution but Instead of using CustomerHistoryId I am using a version column in all look up tables called "ver". So In all lookup tables there is an id and a ver column which make a composite key. As mentioned above nothing is ever deleted but an isActive flag can be set to false . Two columns in the master table for each look up table is much better than duplicating all of the data. — Thomas Fonseca, Dec 07 '15 at 23:38

Conrad Frix · Accepted Answer · 2010-08-20T19:31:41.910

10

When I've encountered such problems one alternative is to make the order the history table. Its functions the same but its a little easier to follow

orders
------
orderID
customerID
address
City
state
zip



customers
---------
customerID
address
City
state
zip

EDIT: if the number of columns gets to high for your liking you can separate it out however you like.

If you do go with the other option and using history tables you should consider using bitemporal data since you may have to deal with the possibility that historical data needs to be corrected. For example Customer Changed his current address From A to B but you also have to correct address on an existing order that is currently be fulfilled.

Also if you are using MS SQL Server you might want to consider using indexed views. That will allow you to trade a small incremental insert/update perf decrease for a large select perf increase. If you're not using MS SQL server you can replicate this using triggers and tables.

edited Aug 20 '10 at 19:31

answered Aug 20 '10 at 17:54

Conrad Frix

51,984
12
96
155

1

yeah, I have seen it done this way before as well. But this is a very paired down example, in the real application I am thinking about this for, an "order" will have a bunch of foreign keys with a lot of data in the other tables. In the end I would have an "order" record with hundreds of columns. – Ryan Guill Aug 20 '10 at 18:07
7

This is the correct approach, because the address is a function of the order, not the customer, at the time the order is placed. If you want to simplify the order table, I suggest a foreign key to an address table. In fact, customers and orders can both store their addresses in the same address table without difficulty. This will also make it easy to include separate shipping and billing addresses, etc. – Jeffrey L Whitledge Aug 20 '10 at 18:25
1

@Jeffrey L Whitledge is making a critically important point here, the addres, the customer name, the price etc are now a function of the order not the customer or price tables, that is why there is no other good solution except to put them in order-related tables. – HLGEM Aug 20 '10 at 20:48

score 6 · Answer 2 · answered Aug 20 '10 at 18:35

6

When you are designing your data structures, be very carful to store the correct relationships, not something that is similar to the correct relationships. If the address for an order needs to be maintained, then that is because the address is part of the order, not the customer. Also, unit prices are part of the order, not the product, etc.

Try an arrangement like this:

Customer
--------
CustomerId (PK)
Name
AddressId (FK)
PhoneNumber
Email

Order
-----
OrderId (PK)
CustomerId (FK)
ShippingAddressId (FK)
BillingAddressId (FK)
TotalAmount

Address
-------
AddressId (PK)
AddressLine1
AddressLine2
City
Region
Country
PostalCode

OrderLineItem
-------------
OrderId (PK) (FK)
OrderItemSequence (PK)
ProductId (FK)
UnitPrice
Quantity

Product
-------
ProductId (PK)
Price

etc.

If you truly need to store history for something, like tracking changes to an order over time, then you should do that with a log or audit table, not with your transaction tables.

answered Aug 20 '10 at 18:35

Jeffrey L Whitledge

58,241
9
71
99

the tables I used were just for illustration. We are actually going to do exactly what you describe, separating the address from the customer (we are calling it location). My question is just about the scheme of storing the data only. – Ryan Guill Aug 20 '10 at 18:41
1

This answer does not help with the original question. He wants help with point in time, this solution continues to use an Address table and if an address is updated it will update any order as well as you have a foreign key relationship. – maguy Jul 17 '13 at 20:33
1

@maguy - It is not clear from my answer, but my intention was that the address data would never be updated. Instead, if a customer address changes, then a new address would be inserted, and the customer would receive a new address ID, leaving the existing order address unchanged. Similarly, if the address on the order needs to change. – Jeffrey L Whitledge Jul 22 '13 at 15:34

HLGEM · Answer 3 · 2010-08-20T20:45:01.257

Normally orders simply store the information as it is at the time of the order. This is especially true of things like part numbers, part names and prices as well as customer address and name. Then you don;t have to join to 5 or six tables to get teh information that can be stored in one. This is not denormalization as you actually need to have the innformation as it existed at the time of the order. I think is is less likely that having this information in the order and order detail (stores the individual items ordered) tables is less risky in terms of accidental change to the data as well.

Your order table would not have hundreds of columns. You would have an order table and an order detail table due to one to many relationships. Order table would include order no. customer id 9so you can search for everything this customer has ever ordered even if the name changed), customer name, customer address (note you don't need city state zip etc, put the address in one field), order date and possibly a few other fields that relate directly to the order at a top level. Then you have an order detail table that has order number, detail_id, part number, part description (this can be a consolidation of a bunch of fields like size, color etc. or you can separate out the most common), No of items, unit type, price per unit, taxes, total price, ship date, status. You put one entry in for each item ordered.

I see what you are saying, but like I said on Conrad Frix's answer, in the end my "order" table would have hundreds of columns, which really isn't feasible. I probably should add this to the question. — Ryan Guill, Aug 20 '10 at 18:09

score 2 · Answer 4 · answered Aug 20 '10 at 20:47

If you are genuinely interested in such problems, I can only suggest you take a serious look at "Temporal Data and the Relational Model".

Warning1 : there is no SQL in there and almost anything you think you know about the relational model will be claimed a falsehood. With good reason.

Warning2 : you are expected to think, and think hard.

Warning3 : the book is about what the solution for this particular family of problems ought to look like, but as the introduction says, it is not about any technology available today.

That said, the book is genuine enlightenment. At the very least, it helps to make it clear that the solution for such problems will not be found in SQl as it stands today, or in ORMs as those stand today, for that matter.

score 0 · Answer 5 · edited Oct 01 '21 at 21:29

0

I myself like to keep it simple. I would use two tables: a customer table and a customer history table. If you have the key (e.g. CustomerID) in the history table there is no reason to make a joining table, a select on that key will give you all records.

You also don't have audit information (e.g. date modified, who modified etc) in the history table as you show it, I expect you want this.

So mine would look something like this:

CustomerTable (this contains current customer information)
CustomerID (distinct non null)
...all customer information fields
    
CustomerHistoryTable
CustomerID (not distinct non null)
...all customer information fields
DateOfChange 
WhoChanged

The DateOfChange field is the date the customer table was changed (from the values in this record) to the values in a more recent record of the values in the CustomerTable.

You orders table just needs a CustomerID if you need to find the customer information at the time of the order it is a simple select.

edited Oct 01 '21 at 21:29

TurboLion

75
1
7

answered Aug 20 '10 at 17:54

Hogan

69,564
10
76
117

I don't know that this has any advantages over the OP's suggested schema - and in some ways this is a little more awkward (if you want to find historical customer info., as you probably will, this is that little bit more complex) - that said, I've also used this approach previously, and have seen it used by several other coders. – Will A Aug 20 '10 at 17:59
You are right, I would want date modified, who modified. (I mentioned it in the writing, but not in the table, I will edit to make that more clear). I actually designed this exactly this way in the beginning, but realized it is really an uncessary duplication of data. With a simple join (which should be very fast with appropriate indexes) I can do the same thing and save having to write all of that information twice each time. But I had the same thought at first. – Ryan Guill Aug 20 '10 at 18:11
What would Order reference? The customer ID? If so does changing the customer address automatically affect order info? – Conrad Frix Aug 20 '10 at 18:12
yes, order would reference customerID. And keeping an update to the customer from affecting the order information is exactly what I am after. I do not want a change to the customer's address, after the order is completed to affect the information about the order. – Ryan Guill Aug 20 '10 at 18:18
@Will : I was making an assumption -- current data is accessed more often than historic data. If this is the case (which is true for most cases) then this design is better -- for the obvious reasons. If you access historical data more often of as often as current than the linked table is better. – Hogan Aug 21 '10 at 13:40
@Ryan : This way should not contain duplicate data -- you don't keep the most recent version -- that is reserved for the table you are keeping a history of. You can make this more efficient by only storing the field(s) that change -- but that requires more coding -- it is more elegant if you have the time/inclination. This is the quick and easy solution and many DBAs just use a trigger to do it -- I hate triggers and don't recommend that. – Hogan Aug 21 '10 at 13:43
@Conrad : As I said in a not understandable way in the comment, it just stores the customerID, but if you need to get the information to match the data at the time of the order then you need to change your select statement and joins to account for the time the order was placed (I assume the order table will have a time placed field.) I would be not expect most systems to want this since it is an advantage to have the most recent contact information (for example you can get a phone number or email address that actually connects to someone.) – Hogan Aug 21 '10 at 13:45

Ben · Answer 6 · 2010-08-20T18:18:17.293

0

What you want is called a datawarehouse. Since datawarehouses are OLAP and not OLTP, it is recommended to have as many columns as you need in order to achieve your goals. In your case the orders table in the datawarehouse will have 11 fields as having a 'snapshot' of orders as they come, regardless of users accounts updates.

Wiley -The Data Warehouse Toolkit, Second Edition

It's a good start.

edited Aug 20 '10 at 18:18

answered Aug 20 '10 at 18:12

Ben

16,275
9
45
63

Datawarehouses/Datamarts are not necessarily OLAP. Datawarehouses and datamarts could be the source of OLAP Cubes but you could have a datawarehouse and no OLAP. – jasonco Aug 20 '10 at 18:49
@jasonco Datawarehouses feeds are generally OLTP, that's right but that's it. However, Datawarehouses do not calculate and in order to do that they need to forget about normalization hence having enormous tables as well as response times, that is OK since they're OLAP and not OLTP. @OP needs a Datawarehouse, according to his description of the problem, not just another DB. – Ben Aug 20 '10 at 19:34

score 0 · Answer 7 · answered Aug 20 '10 at 18:40

Our payroll system uses effective dates in many tables. The ADDRESSES table is keyed on EMPLID and EFFDT. This allows us to track every time an employee's address changes. You could use the same logic to track historical addresses for customers. Your queries would simply need to include a clause that compares the order date to the customer address date that was in effect at the time of the order. For example

select o.orderID, c.customerID, c.address, c.city, c.state, c.zip
from orders o, customers c
where c.customerID = o.customerID
and c.effdt = (
   select max(c1.effdt) from customers c1
   where c1.customerID = c.customerID and c1.effdt <= o.orderdt
)

The objective is to select the most recent row in customers having an effective date that is on or before the date of the order. This same strategy could be used to keep historical information on product prices.

Database structure for storing historical data

7 Answers7

Linked