Database Vacuuming

A new era of mass storage devices

One of the main reasons for database vacuuming is storage, not only because of the capacity but also to extend the lifetime of the devices.

Hard disk drives (HDDs) have ruled the market for many years. Recently a new technology has emerged with solid-state drives (SSDs). Both of these technologies have seen a tremendous development with improvements in speed, reliability, capacity and price. As SSDs become cheaper, they have been replacing HDDs in many applications.

How HDDs work on a conceptual level is well understood by most developers but for SSDs there are a lot of misconceptions. We will therefore give a brief introduction to both technologies to set the scene for the following discussion.

Hard Disk Drives (HDDs)

HDDs store data on rotating magnetic platters using read/write heads mounted on robotic arms. Due to the use of moving parts, reads and writes on HDDs yield high latency compared to integrated circuits (ICs). Depending on the actual HDD model, the data may not be written immediately but instead stored in the controller to be written later. This approach will improve write performance at the cost of durability. To minimize the impact on durability, HDDs are often equipped with a large capacitor or a battery backup making it possible to write the data or store the data in volatile memory in the event of power outage

Because of how HDDs work, data is always read and written in units of pages. Modern HDDs have a page size of 4,096 bytes (or 4 KiB) and the actual layout of the data on the platter is not directly visible to the host.

Solid-State Drives (SSDs)

SSDs store data on special ICs without any moving parts, providing low latency on their read and write operations. compared to HDDs. These ICs are assembled in a particular way for SSDs to be cost effective. These ICs will be referred to as flash memory from here on. The description here is simplified and not entirely accurate as the details also depend on the actual manufacturer and model of the SSD.

SSDs use the same page size as modern hard drives even though the overhead of accessing a page is almost negligible compared to that of a HDD. Reading a small number of bytes from an SSD is much more efficient than from a HDD. The discussion here assumes the page size presented to the host to be the same as the page size of the flash memory. As with the HDD, the actual layout of the data in the flash memory is not directly visible to the host.

The flash memory is organized as a number of sectors and each sector has a number of pages. Each page can be written a limited number of times and before the page can be updated, the whole sector where the page resides has to be erased. This scheme has some ramifications.

When data is updated, SSDs write it to a “fresh” page (copy-on-write), or a page that has not yet been written to since the last time the sector was erased. This write makes the page “in use.” The page that holds the old data is said to be “stale.”

In addition to normal writes, pages in use are regularly copied from sectors with a lot of stale pages to sectors with fresh pages. A sector that has all its pages marked as stale is erased, making all the pages on that sector fresh. This has the net effect of more fresh pages available for normal writes. Erasing a sector can be an expensive operation and often is the bottleneck for the write performance. This process is referred to as garbage collection.

More efficient garbage collection is possible with more sectors with a lot of stale pages. There are in principle three types of data changes; data creations, data updates, and data deletes. Traditional hard drives do not deal with data deletion as pages are instead ignored by the file system. For a solid-state drive, garbage collection can be performed more efficiently if it knows about these ignored pages. This is implemented by the ATA command TRIM or the SCSI command UNMAP.

There are a lot of other aspects about SSDs that need not be mentioned here for the following discussion. Please see Wikipedia for further details.

Database Technologies

Traditional databases use data structures that minimize the number of pages read. This approach is crucial when traditional HDDs are used as the time it takes to retrieve the data is proportional to the number of pages read. This is also true for SSDs to some extent as the I/O channels are optimized for HDDs, but the data can be retrieved an order of magnitudes faster from an SSD than from an HDD.

Databases normally implement durability (one of the properties of ACID) using a transaction log, which may or may not include the actual user data. The pages written for the transaction log and the pages written for the updated database image are likely to end up on the same sector since they are written around the same time. This is especially true for small transactions. The pages for the transaction log will normally be made stale before the recently written pages of the database image. This means that those sectors may soon end up with some stale pages that will take up space, or the sectors may soon reach the threshold for garbage collection. Performance may decrease in both cases.

RaimaDB 16.0 takes a different approach. On the file system level, the database consists of a number of pack files that hold user data, indexes, and metadata needed for recovery and vacuuming. These pack files are collectively referred to as a "pack."

The File System Perspective of the Pack

This section describes some characteristics of the pack as observed by the file system and how they affect the I/O performance of the SSD.

A pack consists of a number of pack files, each with a maximum size of 2 GiB. Pack files are written sequentially by appending to the end. The total size of pack files are held below a threshold. When pack files are deleted, they are always deleted in the order they were created.

Since pack files are written sequentially, a sequence of pages for a given pack are likely to be clustered together, especially on a large transaction load, compared to other I/O loads. When such clustering is assumed, these sectors are not likely to be subject to garbage collection until the pack file is deleted.

If the above assumption is not the case, the database load may have to be kept on a separate partition or a separate SSD to achieve the performance needed. These details are beyond our control, but from a database design perspective, this is the best we can do to achieve I/O performance on an SSD.

Vacuuming in Raima Database Manager (RaimaDB)

RaimaDB 16.0 performs a copy on write and never reuses the unused space in pack files. It always writes to the end of the last pack file. Unused space in pack files are cleared with a process called "vacuuming." Vacuuming is similar to garbage collection discussed earlier, except that vacuuming operates on pack files instead of sectors on an SSD. If done right, vacuuming will improve garbage collection considerably.

Read vs Writes

One aspect that should not be overlooked is read versus writes. As we have pointed out above, aggressive vacuuming will limit the overall size of the database. Often, more importantly, aggressive vacuuming will cluster data that is in use together with less wasted space in between. This is significant when it comes to what is actually located on each page. Keeping data clustered can mean that more data that is in use may be able to fit in fewer pages and thus may stay in the working set of the file system cache, potentially reducing or eliminating the costly I/O to the underlying block device.

Trade-off between writes and total size

There is also a trade-off between the amount of data written and the total size of the pack. Depending on how the application should be designed, or how the administrator or the user has to configure the system, it may be hard to know exactly how to strike the right balance between the two. That is probably the main challenge with this approach. However, keep in mind that the alternatives to this approach are not very efficient. Historically, we have seen how inefficient it can be to reuse space within pack files or use a transaction log.

The net result of this approach is that data may be written several times, but the difference here is that writes for RaimaDB 16.0 are clustered and, with a proper threshold for vacuuming, will yield excellent I/O performance and make garbage collection in the SSD very efficient as discussed previously.

Pack File Format

The pack file format is important for efficient vacuuming. The approach RaimaDB 16.0 takes is to traverse the data structures and move anything that currently resides in pack files that are subject to vacuuming by appending it to the last pack file. Data structures that contain references to data that are copied must also be moved. The data structures in RaimaDB 16.0 therefore have information about the oldest pack file that can be reached directly or indirectly. This makes it possible to vacuum content efficiently without having to traverse more data than necessary. This approach also has the benefit that content gets clustered based on their keys, making subsequent lookups more efficient.

Need for Locks

Furthermore, vacuuming can be done without acquiring write locks. This means that readers are not particularly affected by vacuuming. Updates are affected as vacuuming for a given table cannot be done in parallel while an update is ongoing for that table; however, it is expected that the amount of data to be vacuumed can be an order of magnitude less than the amount of data written for an update, though often it may just be a factor of two or three depending on how aggressively we vacuum.

Conditions Triggering Vacuuming

The vacuuming process executes when RaimaDB detects that one of the following conditions have been met. When the vacuuming begins, it with the lowest number pack file ID that meets the conditions for vacuuming. This is not necessarily the first pack file in the sequence. The vacuuming process will stop when there conditions for starting vacuuming have been addressed. When the total space used in the pack files drops to the currently defined VACUUM_PERCENTAGE, the vacuum process will start the process of vacuuming the used space into the current pack file. The default VACUUM_PERCENTAGE is 20%. See vacuum_percentage for information about this database option.

Manual Triggering

The vacuum process can also be manually triggered using the tool rdm-vacuumor by using an API call. The tool is typically used for removing all of the unused space out of a database image to create the smallest database image possible for inclusion in an archive or installation package.