This is the 5th part of a 6 part article where I am talking about Best Practices for Designing a Highly Available Storage System from different perspectives. Today I am going to talk about, what it takes to design Storage LUNs for a Highly Available Storage.
RAID‐level data protection
All the LUNs bound within a Virtual Provisioning pool will have data loss from a complete failure of a pool RAID group.
The larger the number of private RAID groups within the pool, the bigger the failure domain would be.
It is important to choose a level of protection for the pool in‐line with the value of the pool’s contents. Three levels of data protection are available for pools and those are:
1. RAID 5 has good data protection capability. If one drive of a private RAID group fails, no data is lost. RAID 5 is appropriate for small to moderate sized pools. It may also be used in small to large pools provisioned exclusively with SAS and flash drives which have high availability factor.
2. RAID 6 provides the highest data availability. With RAID 6, up to two drives may fail in a private RAID group and result in no data loss. Note this is true double‐disk failure protection. RAID 6 is appropriate for any size pool, including the largest possible.
3. RAID 1/0 has high data availability. A single disk failure in a private RAID group results in no data loss. Multiple disk failures within a RAID group may be survived. However, a primary and its mirror cannot fail together and if they fail then data will be lost. Note, this is not double‐disk failure protection. RAID 1//0 is appropriate for small to moderate sized pools.
The trade off factors in choosing one of these are: availability, performance and capacity utilization.
If the priority is absolutely on availability then RAID 6 is the recommendation.
If it is capacity utilization or performance, and we have solid design for data protection in place (backups, replication, hot spares, etc.) then having a RAID level 5 on FAST pool is likewise a sound decision.
Number of RAID groups
A fault domain refers to data availability. A Virtual Provisioning pool is made up of one or more private RAID groups. A pool’s fault domain is a single pool private RAID group. That is, the availability of a pool is the availability of any single private RAID group. Unless RAID 6 is the pool’s level of protection, we should avoid creating pools with a very large number of RAID groups.
Rebuild Time and other MTTR functions
A failure in a pool based architecture may affect a greater number of LUNs than in a traditional LUN architecture. In this scenario a Fault Domain is the key factor to decide.
Quickly bring RAID groups to normal operation becomes important for the overall work recovery time (WRT) of the storage system.
We should always keep hot spares of the appropriate type available. The action of proactive hot sparing will reduce the adverse performance effect a Rebuild would have on backend performance. In addition, always replace failed drives as quickly as possible to maintain the number of available hot spares.
A typical EMC VNX Storage with Hot Spare looks like this.