Risks of Unexpected Power Loss on Solid State Drives (SSDs)

Shutting Down Correctly

During a normal “clean” system shutdown the host signals to the storage device that a shutdown is occurring and power will soon be removed. Upon receiving this shutdown signal, the storage device is expected to flush any data from volatile storage to non-volatile storage and then send an acknowledgement signal to the host stating that the drive is ready for power removal. The host is expected to wait for this acknowledgement before proceeding with the shutdown process.This process ensures that power is removed only after data is stored safely on the device’s non-volatile storage.

Risks of Unexpected Power Loss

Sangoma Freepbx/pbxact systems,NSG and SBC appliances uses standard industrial grade SSD for data storage.The purpose of this document is to inform customers about the risks associated with unexpected power loss on the data integrity of solid state drives (SSDs). These risks can range from severe, in the case of the loss of an entire drive,to mild, in the case of a few kilobytes of lost data, to none, in the case of a drive with power loss protection circuitry.An overview of the common SSD memory hierarchy, along with the typical data path for a write operation, are discussed with areas of risk highlighted. Modern industry solution which are currently being used to eliminate the effects of unexpected power loss are discussed.

Memory Hierarchy

SSD uses integrated circuit assemblies as memory to store the data persistently.Memory hierarchy of SSD system comprised of both volatile and non-volatile storage mediums. Volatile storage is a type of computer memory that needs power to preserve stored data. If the computer is switched off, anything stored in the volatile memory is removed or deleted.Volatile storage mediums, such as DRAM, offer very fast access times, but require a constant source of power to retain data.Non-volatile storage mediums, such as NAND, do not require a constant source of power to store data but have much slower access times. SSDs store the bulk of their data in non-volatile storage to maintain data integrity, and store a small amount of data which is used most frequently or most recently in volatile storage. This method offers a compromise between access times and data retention.

 

Write Operation

Below diagram shows a typical write operation from host to SSD. Data starts on the host and is sent to the SSD’s volatile cache.The SSD storage controller monitors this operation and is responsible for directing the flow of data as well as various other tasks. Once data is transmitted, the host waits for the SSD to provide an acknowledgement  stating that the data was received before proceeding. Under most configurations the SSD will acknowledge receipt of the data as soon as the data reaches the SSD’s volatile cache.This improves performance and helps to lower the number of writes that need to go to the non-volatile storage, but it also puts the data at risk if power is unexpectedly removed. After the data is in the volatile cache and the host receives the acknowledge signal, the host continues processing and the drive firmware later decides when the data should be copied to the drive’s non-volatile storage. 

Unexpected Power Loss (Dirty Shutdown)

 

A “dirty shutdown” occurs anytime power is removed from a drive without the clean shutdown process described above.Some common examples of this occurrence are: unexpected power outages, accidental power removal from a computer, users unplugging storage device from a host system while power is on, and battery power loss. When power is removed from a non-power loss protected drive, without the clean shutdown process, the data which is stored in the drive’s volatile cache will be lost.Since the data residing in the volatile cache is data which was used most recently, or data which is used most often, the data which is stored in the volatile cache could be very important user data, data used by the host operating system, or even data which is used by the SSD for normal operation. The effect of this power loss will not cause major issues during an idle or read operation, but if a write operation is occurring, there is potential of some data loss or worse. Power loss during a write is also known as Write Abort since the write operation is aborted prior to completion.

Metadata loss

In addition to the loss of the data written to the drive from the host, it is also possible for a drive to lose important metadata pertaining to how data is mapped on the drive. Each time data is accessed or modified on the SSD, metadata is modified pertaining to the state of the data. This information allows for the storage controller to choose what data should be in the drive’s volatile cache at any given moment as well as the ability to implement techniques which increase performance and endurance. A loss, or corruption, of metadata can result in seemingly lost data, in which the drive no longer has a pointer to the location of the actual data. More dramatically,  a loss of metadata can result in an SSD which can no longer function for read or write operation [1].

Wear leveling

One of the most common storage improvement techniques is wear leveling.This technique allows for the drive to age evenly by mapping new writes to physical  NAND cells which have had the fewest  number of total writes.This is necessary due to NAND cells having a finite life determined by the number of writes occurring on them. This technique is transparent to the host which uses a Logical Block Address (LBA) when executing read/write operations to the drive. The drive then translates the LBA to the Physical Block Address (PBA) by way a flash translation table (FTL). The FTL is stored in cache and it is the responsibility of the controller to flush this table to non-volatile memory at times it deems appropriate or during a clean shutdown.The metadata associated with each page of data written also has information that can be used to rebuild the FTL if it is lost, but this rebuild takes time at the next power-on.If both the FTL and metadata are corrupted due to a sudden power loss event, the data stored on the SSD can become lost or even worse, the SSD, as a whole, can become inaccessible. SSD vendors go to great lengths to try and prevent inaccessibility even in the event of a sudden power loss.

Background Processes (Garbage collection)

Another area of concern is unexpected power removal during execution of background processes such as garbage collection. Due to technical limitations, to overwrite existing data in NAND memory  requires  an erase of the old data. Additionally, NAND memory can only be erased in large collections of bits referred to as blocks and written in smaller collections of bits called pages, see below diagram. When a command is sent to overwrite existing data in NAND, instead of overwriting the data on the drive which would require a copy, erase, and write cycle, the drive can instead write the new data to unused pages and remap the LBA to the new PBA. The old data is then flagged as invalid. Later, while the drive is not busy servicing the host, the valid pages located in the block which holds invalid pages can be copied over to free pages such that the block can be erased and reused. To do this, the mapping data from LBA to PBA is being updated. For this reason, if power is unexpectedly removed while garbage collection is taking place, faults can occur in the translation capabilities of the drive.

 Prevention of Unexpected Power Loss

Uninterruptible Power Supply (UPS)

An UPS is typically used to protect hardware such as computer, telecommunication equipment, data centers or other electrical equipment where an unexpected power disruption could cause injuries, fatalities, serious business disruption or data loss. An external battery backup power supply can be used to power the host system which in turn powers the SSD. This method protects the drive from unexpected power loss events occurring from power outages, but not from the host system’s internal power loss events such as unexpected removal or system failures.UPS systems tend to be very costly but can offer protection from the most common forms of unexpected power loss.

 

Pros:        Allows entire system protection from unexpected power loss (as long as system is shutdown cleanly before battery supply is depleted).

Cons:       Expensive.Does not protect drive against host system power failures or unexpected removal.

Unable to render {include} The included page could not be found.