Passion & Opportunity ? continue : break

How to track data changes and keep history data?
Written: 2019-01-14 22:27:40 Last update: 2019-05-20 09:15:25

Preface

These days most people involved or work in IT companies needs to manage some large data and sometimes have a very strict requirements to data operation, these days many projects have to fulfill these 2 requirements, the ability to track data changes and to keep history data, naturally due to these requirements (and possibly other security issues) we are not allow to use delete operation (in fact many projects that I have involved have no delete feature, for security and data integrity reason).

Creating a database is easy but implementing a proper architecture and mechanism to fulfill these requirements are a bit tricky. There could be so many tables in one database but the requirements for tracking data changes and keep history data normally only applied for a very important and frequent update such as customer detail (email, address, phone number, etc.) and product detail (price, description, etc.).

This article only describing 3 common ways that I have found and experienced, I want to share my thoughts and recommendation about how to track data changes and to keep history data and also to provide answers for 2 most commonly asked questions when data has changed and which data was/were changed.

For the sake of full and easy-to-follow demonstration, I use MySQL for database here but the same logic may be used for other kind of databases (Oracle DB, MS SQL server, etc.)

Solution 1: using 2 separate tables for Master table 'product' and History table 'product_history', this is the most common solution which I have found, to describe the working logic for this exercise we will only use a very simple tables schemas

CREATE TABLE `product` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `productNo` int(11) NOT NULL DEFAULT '0',
  `title` varchar(45) DEFAULT NULL,
  `price` float NOT NULL DEFAULT '0',
  `insertDT` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updateDT` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unique_productNo` (`productNo`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
  • 'productNo' is important for comparison and to avoid duplicate value, so we set it to have attribute UNIQUE KEY `unique_productNo`.
  • 'insertDT' and 'updateDT' are to answer when the data was first inserted and when data was last changed respectively, these 2 columns should not be set/updated by manually, the value should be set automatically by system.
  • 'updateDT' will use attribute datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, so 'updateDT' value will be automatically updated even only do update on other column(s). NOTE: to support this feature we must use at least MySQL version 5.6.5 (lower versions do not support this feature).
  • To answer which data changed, we need to create a History table schema like below
CREATE TABLE `product_history` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `productNo` int(11) NOT NULL DEFAULT '0' COMMENT 'Inventory number',
  `title` varchar(45) DEFAULT NULL,
  `price` float NOT NULL DEFAULT '0',
  `insertDT` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updateDT` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
  • 'productNo' is a no longer a UNIQUE KEY because in this History table we allow many records with the same values.
  • 'updateDT' does not use ON UPDATE CURRENT_TIMESTAMP anymore because this is History table so it will not be updated.

These 'product' and 'product_history' tables are only for logic exercise, in a real project, these tables could have more columns such as quantity, weight, images, discount attributes, shipping attributes, etc. Both tables must have the same exact count of columns and exact names, if Master table has 20 columns then History table will also has 20 columns. To support tracking data changes feature, so every update operation to previous data in 'product' will need to create a backup history record into 'product_history'.

The next step is to create an automation to copy record from 'product' to 'product_history', for this automation we create MySQL Trigger with attribute BEFORE INSERT to 'product' (Master) table, for this demo we name it as 'trigger_before_insert_product', this Trigger will be executed automatically before each 'insert' to 'product' table, if 'productNo' value is already existed then it will copy the record from 'product' to 'product_history', using Trigger is useful to avoid writing this logic by code and to avoid the need to execute separate query manually.

DELIMITER $$
DROP TRIGGER IF EXISTS `trigger_before_insert_product`;
$$
CREATE TRIGGER `trigger_before_insert_product`
BEFORE INSERT ON `product` FOR EACH ROW
BEGIN
  IF EXISTS(SELECT * FROM `product` WHERE productNo = NEW.productNo)
  THEN
    -- copy (insert) existing data from Master to History, without 'id' column (auto_increment)
    INSERT INTO `product_history` (productNo
      , title
      , price
      , insertDT
      , updateDT
      )
    SELECT productNo
      , title
      , price
      , insertDT
      , updateDT
    FROM `product`
    WHERE productNo = NEW.productNo;
  END IF;
END;

Next step is we create a function to insert record to 'product' table, a function in MySQL called 'Stored Procedure' (SP), this SP will check if 'productNo' exist or not, if not existed then do insert else do update.

CREATE PROCEDURE `insertNewProduct`(IN inProductNo INT, IN inTitle VARCHAR(45), IN inPrice FLOAT)
BEGIN
  INSERT INTO product (productNo, title, price)
  VALUES (inProductNo, inTitle, inPrice)
  ON DUPLICATE KEY UPDATE
  title = VALUES(title)
  , price = VALUES(price);
END
To insert a new product we simply call the SP like these 4 samples below:
CALL insertNewProduct(111, 'Apples', 4.49);
CALL insertNewProduct(222, 'Bananas', 4.49);
CALL insertNewProduct(333, 'Cherries', 5.49);
CALL insertNewProduct(444, 'Dates', 6.49);
The example result after these 4 insert queries executed:
SELECT * FROM product;

--------------------------------------------------------------------------------
id | productNo | title     | price | insertDT            | updateDT            |
--------------------------------------------------------------------------------
1  | 111       | Apples    | 4.49  | 2019-01-01 15:51:14 | 2019-01-01 15:51:14 |
2  | 222       | Bananas   | 4.49  | 2019-01-01 15:51:17 | 2019-01-01 15:51:17 |
3  | 333       | Cherries  | 5.49  | 2019-01-01 15:51:24 | 2019-01-01 15:51:24 |
4  | 444       | Dates     | 6.49  | 2019-01-01 15:51:26 | 2019-01-01 15:51:26 |
If we executed those 4 queries slowly with a few seconds in between then we can see different value for 'insertDT'. At this point the table 'product_history' should not have any record because all the products are using new and unique 'productNo'.

Next step is to try to insert the same 'productNo' (222) to trigger update for both 'title' (Bananas B2) and 'price' (4.99), such as:
CALL insertNewProduct(222, 'Bananas B2', 4.99);
The result will be similar to:
SELECT * FROM product;

--------------------------------------------------------------------------------
id | productNo | title     | price | insertDT            | updateDT            |
--------------------------------------------------------------------------------
1  | 111       | Apples    | 4.49  | 2019-01-01 15:51:14 | 2019-01-01 15:51:14 |
2  | 222       | Bananas B2| 4.99  | 2019-01-01 15:51:17 | 2019-01-01 15:52:56 |
3  | 333       | Cherries  | 5.49  | 2019-01-01 15:51:24 | 2019-01-01 15:51:24 |
4  | 444       | Dates     | 6.49  | 2019-01-01 15:51:26 | 2019-01-01 15:51:26 |
And if our Trigger is working properly as expected then there will be a new record created in 'product_history' table
SELECT * FROM product_history;

--------------------------------------------------------------------------------
id | productNo | title     | price | insertDT            | updateDT            |
--------------------------------------------------------------------------------
1  | 222       | Bananas   | 4.49  | 2019-01-01 15:51:17 | 2019-01-01 15:51:17 |
This solution is very simple and very easy to understand, the main logic is before we update/change data in 'product' table the existing record will be copied and inserted into 'product_history'.

pro
  • Simplest logic (easy to implement)
  • The Master 'product' table will have fast performance for reading, writing and update
con
  • Large storage requirement because of 2 separate tables with the same column count (too many redundant if actual total column is 20 or more but the the change was only 1 or 2 columns).
  • Very hard to find which column(s) is/are changed, in this sample there are only 2 non-unique fields 'title' and 'price' but in real project the total fields maybe a lot more, there is no easy/short query because need to compare ALL columns, but we can do a simple query to do visual comparison:
    SELECT * FROM product WHERE productNo = 222
    UNION
    SELECT * FROM product_history WHERE productNo = 222 ORDER BY updateDT DESC;
    
    --------------------------------------------------------------------------------
    id | productNo | title     | price | insertDT            | updateDT            |
    --------------------------------------------------------------------------------
    2  | 222       | Bananas B2| 4.99  | 2019-01-01 15:51:17 | 2019-01-01 15:52:56 |
    1  | 222       | Bananas   | 4.49  | 2019-01-01 15:51:17 | 2019-01-01 15:51:17 |
    


Solution 2: using 1 table to store both Master data and History data. This solution is trying to simplify the complex requirement of using 2 separate tables in 'Solution 1' and also avoid the need to create MySQL Trigger.

  • In this example, we will create a new table and name it as 'product_joined_master_and_history', this table similar to 'product' table in 'Solution 1' and it has a new column 'isMaster' as a flag indicate whether the row is Master or History record (i.e.: 'isMaster' data type is bit(1) [true/false]).
  • 'updateDT' will not have auto updated constraint, so the table schema is a little bit different:
    CREATE TABLE `product_joined_master_and_history` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `productNo` int(11) NOT NULL DEFAULT '0',
      `title` varchar(45) DEFAULT NULL,
      `price` float NOT NULL DEFAULT '0',
      `insertDT` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
      `updateDT` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
      `isMaster` bit(1) NOT NULL DEFAULT b'0',
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
    
  • Create a Stored Procedure (SP) as a function to insert data and to set proper 'isMaster' value:
    CREATE PROCEDURE `insertNewProductAsMaster`(IN inProductNo INT, IN inTitle VARCHAR(45), IN inPrice FLOAT)
    BEGIN
      -- get the 'insertDT' if the same 'productNo' is existed
      SET @PRODUCT_DATE_INSERT = (SELECT insertDT FROM product_joined_master_and_history WHERE productNo = inProductNo LIMIT 1);
    
      -- set all existing records (same productNo) as History
      UPDATE product_joined_master_and_history
      SET isMaster = false
      WHERE isMaster = true
      AND productNo = inProductNo;
    
      -- insert a new master, use IFNULL to avoid NULL value if not exist
      INSERT INTO product_joined_master_and_history
      (productNo, title, price, insertDT, isMaster)
      VALUES (inProductNo
    	, inTitle
    	, inPrice
    	, IFNULL(@PRODUCT_DATE_INSERT, NOW())
    	, true);
    
    END
    
    To insert new products, we simply call the SP like below:
    CALL insertNewProductAsMaster(111, 'Apples', 4.99);
    CALL insertNewProductAsMaster(222, 'Bananas', 4.99);
    CALL insertNewProductAsMaster(333, 'Cherries', 5.99);
    CALL insertNewProductAsMaster(444, 'Dates', 6.99);
    
    The result (sorted by updateDT):
    SELECT * FROM product_joined_master_and_history ORDER BY updateDT DESC, id DESC;
    
    -------------------------------------------------------------------------------------------
    id | productNo | title    | price | insertDT            | updateDT            | isMaster |
    -------------------------------------------------------------------------------------------
    4  | 444       | Dates    | 6.99  | 2019-01-01 16:57:20 | 2019-01-01 16:57:45 | 1        |
    3  | 333       | Cherries | 5.99  | 2019-01-01 16:57:20 | 2019-01-01 16:57:38 | 1        |
    2  | 222       | Bananas  | 4.99  | 2019-01-01 16:57:20 | 2019-01-01 16:57:27 | 1        |
    1  | 111       | Apples   | 4.99  | 2019-01-01 16:57:20 | 2019-01-01 16:57:20 | 1        |
    
    Since all these 4 products are new (no duplicate), so their's flag 'isMaster' are all '1' (true), next we will try to insert a product with an existed 'productNo' (444).
    CALL insertNewProductAsMaster(444, 'Dates D2', 7.49);
    
    The result:
    SELECT * FROM product_joined_master_and_history ORDER BY updateDT DESC, id DESC;
    
    -------------------------------------------------------------------------------------------
    id | productNo | title    | price | insertDT            | updateDT            | isMaster |
    -------------------------------------------------------------------------------------------
    5  | 444       | Dates D2 | 7.49  | 2019-01-01 16:57:23 | 2019-01-01 16:57:45 | 1        |
    4  | 444       | Dates    | 6.99  | 2019-01-01 16:57:23 | 2019-01-01 16:57:23 | 0        |
    3  | 333       | Cherries | 5.99  | 2019-01-01 16:57:22 | 2019-01-01 16:57:22 | 1        |
    2  | 222       | Bananas  | 4.99  | 2019-01-01 16:57:21 | 2019-01-01 16:57:21 | 1        |
    1  | 111       | Apples   | 4.99  | 2019-01-01 16:57:20 | 2019-01-01 16:57:20 | 1        |
    
    See the record 'id' = 4 has field 'isMaster' changed from 1 (true) to 0 (false), there is a new record 'id' = 5 has the same 'productNo' = 444 and with 'isMaster' = 1 (true)

pro
  • Easier to maintain because everything is stored in only 1 table.
  • There is no need to use 'Trigger', all logics are inside an SP (function)
  • This is my subjective opinion, this structure is more suitable for data which is require infrequent update, in other word it is perfect for frequent read and less update/change, in my experience using middle spec PC as server this logic able to maintain a good performance if the ratio of 'Master' records are at least 90% from total records ('History' records are 10% or less).

con
  • Error prone, if for unknown reason server is crashed (eg: maybe shutdown during operation) during SP operation to create new record operation then maybe a 'productNo' has no Master ('isMaster' = true) record or even worse that we may have more than 1 Master records. NOTE: I got these error a few times, not sure why but maybe this happen during development stage which has program bug.
  • Slow query performance because all History data are inside the same Master table, this table is heavy for both read and write operations, in my previous eCommerce project for only 6 months of operations, we got only a few thousands of Master products but we got more than a million of History records, hence slower for longer business operation.
  • Need more frequent to backup data (as archive) before delete old history data, (delete policy example: if record is History ('isMaster' = false) and 'updateDT' has passed 6 months).
  • Like solution 1 above, it is hard to create long query to find which column(s) is/are changed for a particular 'productNo', but the query to do visual comparison is easier than solution 1 because only query against in 1 table, below is a query to get 1 'productNo' (444) history:
    SELECT * FROM product_joined_master_and_history WHERE productNo = 444 ORDER BY updateDT DESC, id DESC;
    
    -------------------------------------------------------------------------------------------
    id | productNo | title    | price | insertDT            | updateDT            | isMaster |
    -------------------------------------------------------------------------------------------
    5  | 444       | Dates D2 | 7.49  | 2019-01-01 16:57:23 | 2019-01-01 16:57:45 | 1        |
    4  | 444       | Dates    | 6.99  | 2019-01-01 16:57:23 | 2019-01-01 16:57:23 | 0        |
    

Solution 3: using a small and fixed columns for History table (2 separate tables like 'Solution 1').

CREATE TABLE `product` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`productNo` int(11) NOT NULL DEFAULT '0',
`title` varchar(45) DEFAULT NULL,
`price` float NOT NULL DEFAULT '0',
`insertDT` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updateDT` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_productNo` (`productNo`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
The Master 'product' table schema above is the same exact schema like solution 1.

For keeping history data, we create a new table named 'product_track_history', this table schema is designed specifically to track changes of each columns and keep history data of Master 'product' table.
CREATE TABLE `product_track_history` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `productNo` int(11) NOT NULL,
  `columnName` varchar(45) NOT NULL,
  `oldValue` varchar(200) DEFAULT NULL,
  `newValue` varchar(200) DEFAULT NULL,
  `changeDate` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
  • This structure is proposed to solve a much easier way to track data changes while also able to keep history data.
  • The 'product_track_history' columns are fixed and independent of 'product' table column count, ie: 'product' table may grow very large to 40 or 50 columns but 'product_track_history' will be the same.
  • 'columnName' is to indicate which column is changed.
  • 'oldValue' and 'newValue' in this sample exercise only store MAX 200 characters (change if necessary), store all non-text data type column (int, float, double, bigint, datetime, etc.) as text.
  • 'changeDT' is for tracking when the data is changed/updated.

Create an MySQL Trigger BEFORE UPDATE to 'product' (Master) table, this Trigger will be executed automatically to create a new record in 'product_track_history'.
DELIMITER $$
DROP TRIGGER IF EXISTS `product_before_update`;
$$
CREATE TRIGGER `product_before_update`
BEFORE UPDATE ON `product` FOR EACH ROW
BEGIN
  -- we compare ONLY the columns which we want to track

  -- compare title
  IF OLD.title != NEW.title
  THEN
    INSERT INTO `product_track_history`
    (productNo, columnName, oldValue, newValue)
    VALUES (OLD.productNo, 'title', OLD.title, NEW.title);
  END IF;

  -- compare price
  IF OLD.price != NEW.price
  THEN
    INSERT INTO `product_track_history`
    (productNo, columnName, oldValue, newValue)
    VALUES (OLD.productNo, 'price', CAST(OLD.price AS CHAR), CAST(NEW.price AS CHAR));
  END IF;

  -- if have more columns to compare then continue below with the same logic above
END;
Create a Stored Procedure (SP) to insert record to 'product', this SP will check if 'productNo' exist or not, if existed then do update instead of insert. (copy the exact SP from 'Solution 1' above).
CREATE PROCEDURE `insertNewProduct`(IN inProductNo INT, IN inTitle VARCHAR(45), IN inPrice FLOAT)
BEGIN
  INSERT INTO product (productNo, title, price)
  VALUES (inProductNo, inTitle, inPrice)
  ON DUPLICATE KEY UPDATE
  title = VALUES(title)
  , price = VALUES(price);
END
For testing and avoid confusion with previous data in 'product' table, we need to truncate (delete all data) 'product' table then we insert the first unique record.
CALL insertNewProduct(888, 'Mango', 6.49);
The result:
SELECT * FROM product;

------------------------------------------------------------------------------------
id | productNo | title        | price  | insertDT            | updateDT            |
------------------------------------------------------------------------------------
1  | 888       | Mango        | 6.49   | 2019-01-01 16:07:14 | 2019-01-01 16:07:14 |


SELECT * FROM product_track_history;
--------------------------------------------------------------------------------------
id | productNo | columnName     | oldvalue     | newValue      | changeDT            |
--------------------------------------------------------------------------------------
   |           |                |              |               |                     |
There is 1 record in 'product' table and there is no record in 'product_track_history', next we will try to insert duplicated (existed) 'productNo' (888) and we only change the 'title' column from 'Mango' to 'Mango Green' like:
CALL insertNewProduct(888, 'Mango Green', 6.49);
The result:
SELECT * FROM product;

------------------------------------------------------------------------------------
id | productNo | title        | price  | insertDT            | updateDT            |
------------------------------------------------------------------------------------
1  | 888       | Mango Green  | 6.49   | 2019-01-01 16:07:14 | 2019-01-01 16:09:17 |


SELECT * FROM product_track_history;
--------------------------------------------------------------------------------------
id | productNo | columnName     | oldvalue     | newValue      | changeDT            |
--------------------------------------------------------------------------------------
1  | 888       | title          | Mango        | Mango Green   | 2019-01-01 16:09:17 |
We can see in the 'product' table has an updated 'Mango Green' record and in the 'product_track_history' there is a new record showing the 'productNo' (888) has 'columnName' value as 'title' changed from 'Mango' to 'Mango Green'.

Next we will try to update it more with only change 1 column 'price', such as
CALL insertNewProduct(888, 'Mango Green', 5.49);
Then we update it again to change 2 columns 'title' and 'price', such as
CALL insertNewProduct(888, 'Mango Yellow', 10.49);
The final result
SELECT * FROM product;

------------------------------------------------------------------------------------
id | productNo | title        | price  | insertDT            | updateDT            |
------------------------------------------------------------------------------------
1  | 888       | Mango Yellow  | 10.49 | 2019-01-01 16:07:14 | 2019-01-01 16:11:27 |


SELECT * FROM product_track_history;
--------------------------------------------------------------------------------------
id | productNo | columnName     | oldvalue     | newValue      | changeDT            |
--------------------------------------------------------------------------------------
1  | 888       | title          | Mango        | Mango Green   | 2019-01-01 16:09:17 |
2  | 888       | price          | 6.49         | 5.49          | 2019-01-01 16:11:20 |
3  | 888       | title          | Mango Green  | Mango Yellow  | 2019-01-01 16:11:27 |
4  | 888       | price          | 5.49         | 10.49         | 2019-01-01 16:11:27 |
We see that each column changes will have record in 'product_track_history' table, it is very simple and easy to understand.

The main feature of this structure is how easy it is to get the history of changes for 1 'productNo' (888), we can use the following query to see what was changed and when was changed (sorted by time):
SELECT * 
FROM product_track_history
WHERE productNo = 888
ORDER BY changeDT DESC, id DESC

--------------------------------------------------------------------------------------
id | productNo | columnName     | oldvalue     | newValue      | changeDT            |
--------------------------------------------------------------------------------------
4  | 888       | price          | 5.49         | 10.49         | 2019-01-01 16:11:27 |
3  | 888       | title          | Mango Green  | Mango Yellow  | 2019-01-01 16:11:27 |
2  | 888       | price          | 6.49         | 5.49          | 2019-01-01 16:11:20 |
1  | 888       | title          | Mango        | Mango Green   | 2019-01-01 16:09:17 |
pro
  • History table is very simple and fixed column, it is independent of Master table column count.
  • Very easy to find which and when data is changed.
  • 'product_track_history' table storage size maybe smaller than 'product_history' table storage size from 'Solution 1', it is because 'product_track_history' only store truncated text (limited length) in 'oldValue' and 'newValue'. But if absolute security and data integrity are required then we need to use BLOB data type to store 100% of the values for both 'oldValue' and 'newValue'.

con
  • Imagine if a product record is updated for 10 columns, 'Solution 1' will only create 1 record in 'product_history' but in this 'Solution 3' will have 10 new additional records in 'product_track_history', so 'product_track_history' will have more records than 'product_history'.
  • The Trigger Before Update may get too complicated if there are so many columns to check, also may impact query performance.

Summary

In normal business operation (when there is no problem), we don't actually care about which data is changed and when it is changed, the need to track data history is important only when there is problem or emergency. Creating the right database schema is not only very important but also to help us to solve the problem easily and prevent us from being too panic to face a sudden problem, especially during off-work time.

Even though this article using MySQL for demo, but the similar logic can be use for other databases.

My personal recommendation is to use Solution 3, as an added bonus now we can develop a simple and beautiful UI page to showcase this functionality. Note that Solution 3 using the same 'product' table as in solution 1 and also using the same Trigger as in solution 1, so to upgrade solution 1 to solution 3 may not be very difficult. Hopefully, the logic described in this article can help other database developers not to face the problem I had .. LOL

Please feel free to share your comments or critics or maybe suggest me for another solution ^.^