This is the second half of the Pavilion blog focusing on 3 important design areas of storage products: Bandwidth, Latency, and Density. This is the second entry in this series, focused on Latency and Density.
The access latency from the Host to the media is composed of host storage stack latency, network stack latency, IO controller latency and media access latency. The first three components here are fairly standard and we minimize the number of memory copies and limit data touches to keep latency at a minimum. The media access latency is largely governed by the type of media (NVMe NAND in our case) and associated drive controller overhead.
Figure 1 – NAND Access Latency Tail
Further, normal writes take significantly longer time. For example, at the media level, typical read operations are in the 50-60us range, write operations are in 600-800us range and erase operations are in 3ms+ range. These numbers are current with the technology available today and they are improving each component. The net result is that read latency can become inconsistent depending on temporal state of operations (write or erase in progress) at the block in question and the user can experience significant outliers.
Pavilion Array uses two methods to mitigate the effect of these outliers. There are proposals in the NVMe standards body to control the scheduling of garbage collection and other media related management operations under the general category of Advanced Background Operations (ABO) and IO determinism. This allows the IO Controller to pre-emptively (and periodically) schedule the ABOs to clean house and thus reduce the average latency at any given time. Some SSD vendors have started implementing the features in their new offerings. Pavilion Flash Array uses this knowledge along with an additional mechanism in place to mitigate the latency outliers. As part of the enterprise reliability feature set, the Pavilion Array has the capability to handle up to two drive failures in an 18-drive group, and up to 8 failures system-wide accross 72 drives. The IO Controller that schedules ABO tracks the busy drives and blocks that are in progress. When new reads arrive from the Host that target a drive/block where ABO is in progress, the IO Controller will deem the current drive under maintenance as a temporary failure and recover the data from the complementary set of drives. While this has a slightly increased cost of drive access and data re-compute cost, it is significantly cheaper than the wait time for the ABO to complete. The mechanism also uses heuristics and timers to decide when to access the complementary set and when to wait for the ABO.
A second reason that read access outliers occur is due to a write followed by a read. The write can take a long time to complete and the read can get stuck behind the operation. The Pavilion Flash Array keeps a cache of recent writes and avoids a read access if there is hit. The drives also employ this mechanism, but they are limited to the amount of memory available in the drive controller.
Thus, The Pavilion Flash Array not only provides the lowest latency but also mitigates the latency tail (shifting it left) and provides consistent latency. This design advantage translates into a benefit for the end-customer who can now reduce the size of their buffers on the host (and thus reduce cost or improve effectiveness), get better application performance throughput and operational predictability.
There are three dimensions to density as seen from a customer’s point of view:
- Drive density or total amount of media available per RU
- Compute density or total CPU processing capability available per RU
- Network Bandwidth or total number of Network ports (xBW) available per RU
The best drive density available in the market today among designs that are based on standard U.2 NVMe SSDs is about 24 Drives/2RU. At 6.4 TB per drive, that is about 76 TB/RU. Pavilion Flash Array can house 72 SSD/4U or about 115 TB/RU using the same 6.4 TB drives.
Typical dual-socket x86 system’s 2RU storage arrays are built with two Xeon E5-266x class CPUs from intel and have 14 cores or more at 2GHz+. The total compute capability is thus limited to these two CPUs that perform both control plane and dataplane functionality. Pavilion Flash Array in 4RU form-factor has 20 Broadwell-DE 1548 class CPU with 8 cores @2GHz each dedicated to dataplane processing.
The maximum number of network ports in a typical 2RU server is either 8x40G or 8x100G (depending on the NIC silicon choice). Even though this is the nominal number of network interfaces it is important to note that they cannot be run concurrently at full bandwidth due to inherent limitation on the number of PCIe lanes available. The Pavilion Flash Array can support up to 40x100G in a 4RU box.
These significant differences in each dimension of density can be achieved because the system is built with a fabric as the centerpiece instead the CPU. Once the scalable fabric is realized the density tradeoffs are operating within the constraints imposed by power, cost and mechanical design parameters. We will discuss fabric design and scalability in a separate article.
Besides the standard servers with U.2 SSDs, there are other storage arrays that have custom SSDs that can achieve good density. However, this path leads to a business model where SSD designs have to keep up with flash silicon vendor technology, and might not be as economically feasible.
The end-customer benefits from a dense design because in addition to housing a large amount of media, the Pavilion Flash Array is sharable, manageable and provides ample network connectivity.
In summary, The Pavilion Flash Array can provide performance advantages in multiple dimensions – Throughput, Latency and Density. Pavilion Flash Array can sustain and maintain this edge because of structural advantage of being a fabric centered array as opposed to a CPU centric traditional array. The design concept of the array is contrasted between a traditional array and Pavilion Array in the figure below.
Figure 2 – Traditional Flash Array vs. Pavilion Flash array
A traditional storage array is designed around the CPU as the central element. This CPU acts as both the data plane (I/O) and control plane processor. The scalability and performance of this type of a design is inherently dependent on the I/O and compute processing capabilities of the CPU. The Pavilion Flash Array is a fabric centric design. The scalability and performance is limited by the radix and the fabric and thermal/mechanical considerations. The fabric provides separate I/O channels for each CPU in the data plane and performance can scale with the aggregate I/O and compute processing capabilities of all the CPUs. There is a separate management CPU that controls the platform resources, services and storage management.