Persistent Database Storage

Persistent database storage refers to a mechanism or method used to store data in a database system in a way that ensures durability and reliability. It is designed to retain data even in the face of system failures, power outages, or other disruptions. Persistent storage is a critical component of a database system as it allows data to be stored and retrieved reliably over a long period. In a persistent database storage system, data is typically stored on non-volatile storage devices such as hard disk drives (HDDs), or solid-state drives (SSDs) devices. These storage devices retain data even when the power is turned off, ensuring that the data remains accessible after a system restart.

The following is a general overview of the persistent data storage design for RaimaDB. The information stored in the persistent storage files is not directly readable or writable outside the RaimaDB APIs provided.

General Design Overview

Balancing performance and data safety is the primary goal of the RaimaDB persistent data storage design. The design needed to fulfill the ACID An acronym that stands for: atomicity, consistency, isolation, and durability. It refers to important reliability properties of database transactions that a database management system must be able to guarantee. requirements for databases. To accomplish this, the following characteristics are built into the design:

The public API of RaimaDB has been designed where all updates to the database must be performed within a transaction. This allows the system to know what updates belong together for an update or insert operation. If the physical write of data to the database is interrupted, the system is always capable of distinguishing an incomplete operation which should be ignored.
All updates to the database are written to the end of the database image. Writing to the end of the file takes advantage of operating system file handling and, more importantly, allows the write to proceed without destroying the previous data content. When the new data has been successfully written to the database image, the new data will be referenced by all future read operations.
The "file sync" operation is expensive on persistent data storage devices. However, the operation is necessary to ensure the data can be recovered successfully in case of an abnormal shutdown of the system. Confining the updates to the end of the database file allows the operating system to efficiently process the sync operation and allows RaimaDB to efficiently detect an incomplete transaction write while limiting the number of sync operations to just one (1)
Navigation through the database image is managed using tree nodes which contain pointers to the actual location of the data in the database image.
The database image is segmented into "pack" files. The files are incremented sequentially in naming and their size is limited to 2 GiB. The size can be configured to a smaller size if necessary. The "pack" file configuration allows the system to return persistent storage space back to the file system if the file is no longer referenced.

Storage Principals

The first principle we rely on is full recursive “copy on write.” The persistent data structure used in RaimaDB 16.0 is a tree (or a tree of trees depending on the level of abstraction). Two tree structures are used: B-trees and R-trees. The database will have one tree for each key defined in the schema, but there are other trees needed as well. The details of this are beyond the scope of this article. To conclude this discussion, we mention that to tie all of these trees together, there is also a main tree which has references to the root nodes of the other trees.

The second principle we rely on is always appending to the end of the last pack file. Therefore, pack files are never updated in the middle. This means that expensive updates in the middle where only a small fraction of a page is updated are avoided, and there is also no need to manage free space. Space that is no longer referenced simply ends up not being accessed.

Appending and Writing Changes “Bottom Up”

There are quite a few implications that follow from these two principles. One implication that naturally follows is that all updates must be done by appending them to the end of the pack. RaimaDB commits a transaction by first writing new and updated items that do not have direct or indirect references to other updated items, followed by tree nodes that reference them, their parent tree nodes, their grandparent tree nodes, and so on until the root node of the main B-tree has been written.

This design means there will be additional nodes that must be written even though there were otherwise no changes to them. This is typically not a concern for large transactions, where multiple updates are more likely to occur on the same B-tree node. On the other hand, the extra overhead may be substantial for small transactions with just a few changes.

The best performance can be expected for use cases that involve time series or circular tables.

Snapshot

With recursive “copy on write,” snapshots can be implemented very efficiently. When a snapshot is requested, the current root node of the main B-tree can be used by the client requesting the snapshot and its view of the data is derived from what can be reached from this node. The only difference between a snapshot and a read using read locks is that a client requesting a read lock will block updates to the tables on which the read locks are requested, while a snapshot will not. No other additional expensive internal data structures are needed to handle the snapshot. The client requesting the snapshot can simply handle requests for content from the Transactional File Server (TFS) exactly how it would handle a normal read with locks. It gets a main B-tree root node and based on the references therein, it requests other nodes from the TFS.

As a result, the pack file format utilizing copy-on-write is well suited for snapshots. When a snapshot is requested, data that has since been updated can be referenced in the pack. From the pack file implementation standpoint, we just need to ensure that pack files that have data visible to a snapshot are not deleted.

Durability

Since data is only appended to the end of the last pack file, whose file format is carefully designed, durability can be achieved in most cases simply by syncing the last pack file. This has some performance benefits compared to other databases that need to sync several files, including the older versions of RaimaDB.

It turns out that most syncs can be omitted if durability is not required. A system crash may cause some data loss in this case. However, since we are only appending to a file, if a main B-tree root node is found, all the data referenced from this root node should be valid. This assumes a file system implementation that does not lose blocks in the middle of a file when only appending occurs to that file. There is one precaution RaimaDB must take for this to work: Any time a new pack file is requested, the previous pack file has to be synced (and on Linux and Unix, we also have to sync the directory). This to make sure that any reference from one pack file to a previous pack file is not lost.

We have also implemented an unsafe mode, where we do not sync pack files at all. This mode should only be used if losing data in the database is not a concern or it can be guaranteed that the operating system does not shut down or crash without syncing files. It is OK for the database to crash since the content written will be in the file system cache. Note that this may not be the case with some embedded systems, where the file system is implemented in user space.

For recovery, the pack file format has been crafted to make it possible to read the pack file both forward and backward. For every 64KiB, there is special recovery information that makes it possible to scan the file forward and backward without having to read the whole file from the beginning. This property is important for efficient and reliable recovery. Recovery does not write anything to the pack but instead finds the last main B-tree root node and verifies the end of the pack. If a database is found to be in a certain incomplete state, a new pack file will be created. This simply means that crashes may leave contents in the pack files that are not directly or indirectly referenced by the main B-tree root node.

The pack file format allows multiple update transactions to be active in parallel, but when they commit, appending to the pack file will alternate among those that are currently committing. A small transaction may be able to append all its data in one chunk while larger transactions will need to take turns writing their data. The main B-tree needs to be written exclusively and each time that is done, changes from other update transactions or the vacuumer (a garbage collector, we cover next) have to be integrated with the changes for the transaction at hand.

Vacuuming

The vacuuming process is a key component of the copy on write technique. Once the original copy of data is no longer necessary for recovery or for multi-version reading (snapshot), the old data needs to be discarded and given back to the persistent storage file system. The vacuum process moves current data to the end of the current pack file and when the oldest pack file no longer has any references to current data (or data not yet committed or data referenced by a shapshot), the pack file can be deleted and the space freed for the persistent storage file system. See the Database Vacuuming section for more details.

Vacuuming becomes important with the copy-on-write implementation to control the total database size on the disk. See the Database Vacuuming section for more details.

By employing this technique, persistent database storage ensures that data remains safe, durable, and available for retrieval even in the face of unexpected failures or disruptions. It forms the foundation for reliable and consistent data management in various applications ranging from embedded devices to enterprise systems and beyond.

Organization

The persistent database image will be stored in a sub-directory within the Document Root (docroot) (docroot) directory. For example, the HELLO_WORLD database image will be stored in a sub-directory of the docroot named: HELLO_WORLD.rdm

Pack Files

The database directory will store the pack files (*.pack). The pack files serve the purpose of a database transaction log for recovery purposes but are also the object store for the database. The pack files are completely self-contained and portable (platform agnostic). This means, for example, that the pack files created on a PowerPC or ARM architecture can be copied and used by a computer with an X86 architecture.

The pack files are sequentially named and the database image, when large enough, will require multiple pack files to store the complete database image.

A valid database CANNOT have any breaks in the naming sequence. In other words, if the first pack file is p00000010.pack and the last pack file is p00000013.pack, the intermediate pack files p00000011.pack and p00000012.pack MUST exist in the directory.