It is a common task to record and save data coming from multiple sources and keep it for a long time to be able to analyse it in the future. The traditional data storage is a relational database such as Oracle or MS SQL Server. However the cost of such long-term data storage per unit of data volume is relatively high.

Relational databases are usually optimized for data querying and aim at efficiently executing arbitrary and potentially complex SQL queries. All the data is commonly kept ready to be retrieved. However, most of the data which gets captured may rarely need to be accessed and most of the time a portion of data is related only to a small part of the rest of the data; at the same time the total amount of the data may grow very large. Also, the demand for accessing the data is rarely homogeneous, i.e. certain portions of the stored data tend to be analyzed more often than the rest (e.g. data for the last month may be accessed much more often than 10 years old data). Thus keeping all of the data in the same state of readiness is usually excessive. Most of the resources spent on an RDBMS storage in this case remain unproductive.

The optimization of the relational database storage to address the issues is likely to be technically complex and an expensive version of the RDBMS may be required. Without the optimization the maintenance cost rises with the size of the database and the solution's scalability is likely to be an issue (problems with index and table defragmentation, old data deletion may be difficult and expensive to solve).

The repository provides an alternative solution. It specializes in storage of timestamped data and does not spend resources on facilitating data analysis. It cannot execute generic SQL queries; instead it achieves good performance in reading or writing timestamped data sequentially. Data is written to disk in a compressed format and structured as an arbitrary tree of data streams. Subsequent reading of the selected data streams in a selected time range is fast, as opposed to reading in random order. What's important is that the performance depends on the size of the repository only to the extent that the filesystem's performance depends on it (that is, so long as the filesystem can efficiently retrieve a file content by its path, the repository performs well). Achieving the same performance and scalability in RDBMS requires much more resources and expertise.

The overall architecture of a hybrid system using the repository and a relational database is as follows.
The incoming data is stored in the repository. When the need arises to analyze the data the required portion is read from the repository and loaded into RDBMS, necessary reports are produced and the data is truncated - dropped from the relational database. Thus, the size of the relational database is kept to a minimum.
The active/new data does not interfere with the idle data, there are no issues with maintenance, defragmentation, no expensive (transactional) deletion of the old data (which is expensive when the old data is physically mixed with active data), no contention between reporting and storing new incoming data. As the database size decreases it may become possible to use a less expensive or free version of the RDBMS (MS SQL Server Express 2008 R2 has 10GB size limit which should be enough in most cases if it is not used to store history data permanently).

Last edited Nov 16, 2011 at 7:42 AM by vasokolbaso, version 3


No comments yet.