Temporal Data in a Relational Database
Lately, I’ve been spending a great deal of time in the DB world. My most recent challenge has been in dealing with time-based, or temporal, data. In several real-world scenarios, we find that timing matters, particularly when dealing with transaction-based systems, such as banking, payroll, or HR.
Most relational databases with referential integrity work really well with current state data. But a system may need to keep up with the current, past, and future states as well. For example, if I’m going to create a paycheck for an employee, I may need to know something about his future pay-rate as in “On such-and-such date, Joe Blow will receive a paycheck and be paid x dollars.” His current state may be some amount less or more than that particular pay-rate on that date. To further complicate the matter, I will also want to know what we knew when we paid him on a particular date in the past. We may even back-date a pay-raise, creating a future state with an effective date in the past.
Whoa… Where’s the Doctor with the TARDIS when you need him? [Inside Geek Joke]
These problems aren’t new. Many DB gurus far more able than I am have been digging around these issues for the past 30+ years and nobody has come up with a perfect solution. Some, such as Richard Snodgrass have written entire academic books on the problem (Morgan-Kaufmann, ISBN: 1558604367 (1999)). My favorite discussion, though (primarily because of its simplicity), comes from the always ingenious Martin Fowler: http://martinfowler.com/eaaDev/timeNarrative.html. I highly recommend reading this article for a solid understanding of temporal data patterns.
For our project, we had a DB in place that handled temporal data by never performing an actual UPDATE against a record. This way, a row of data in a table represents a single point in time which can be given a Created Date and an Effective Date. When you need to UPDATE, you simply INSERT a new record with some identifying data, such as an Employee Number for example, to tie everything together. Pretty smart.
But one problem in dealing with relational data in a temporal model this way is in maintaining referential integrity as the data changes. When the updated record is inserted, how can you effectively maintain the relationship between, say, an Employee record, and time-sensitive Compensation data? As you can see, the scenario gets pretty messy quickly, so we need a more elegant solution than INSERTS with effective dating.
One solution that was tried was to create a single “Base” table, with immutable records that contained an ID that was used to tie everything together.
The revised “Base Table” model looked similar to this:
Although this model may work for a while, referential integrity is eventually lost because all entities, regardless of their relationship to each other, were tied back to only the single record in the base table. This had the effect of flattening the data model and forcing the maintenance of referential integrity into the application layer. Over time new applications would come on board and developers that either didn’t understand the entity relationships (or didn’t care) ignore integrity rules.
The model we eventually decided on was inspired by what Martin Fowler referred to as a “Temporal Object” pattern. Basically, we constructed the entire model to maintain referential integrity between all entities through foreign key relationships. Temporal data, such as pay rate, was extracted into separate entities with a foreign key relationship created between them and the table from which they were extracted.
The model we ended up with was similar to this:
Naturally a major concern was that adding these extra tables would complicate JOINs in our queries which, of course, it did. However, most queries added only a single extra JOIN, but there was definitely the potential for many more, depending on the specific query needs. Given the need for RI + temporality, the trade-off was accepted.
With the exception of auditing, most of our needs required only data from a single point in time. We managed this by using an Effective Date field in our temporal tables and passing in a date as a parameter for all our queries. We could retrieve a single employee record for a given point in time like this:
FROM Employee e
JOIN (SELECT * FROM EmployeeData WHERE EffectiveDate =
(SELECT MAX(EffectiveDate) FROM EmployeeData WHERE
((EmployeeId = @EmployeeId) AND (EffectiveDate <= @EffectiveDate)))) d
ON e.EmployeeId = d.EmployeeId
e.EmployeeId = @EmployeeId
Note that rather than joining directly on the temporal data table, we join on a query that retrieves the MAX effective date starting at the date we pass in as a parameter. This would allow us to retrieve the most recent data as of EffectiveDate we passed in.
There are many other approaches to this problem, of course, but this approach provides the benefit of supporting both temporal record maintenance as well as referential integrity while being only slightly more complex.