Perhaps this is one of those topics which has been touched, revised many a times by some of our industry experts like Duncan Epping, Scott Lowe etc. First let me revisit those articles which has already been written by them.
- vSphere 5.0 HA and Metro / Stretched Cluster Solutions
- Some questions about stretched clusters with regards to power outage
- Distance vMotion = Stretched Cluster ?
Now once you read this articles (if not already), you may ask me, what is the new in my article. Well, a valid question and I did not want to reinvent the same wheel. So, I took a conscious step in touching this topic.
In this article I am only going to touch about the considerations as solutions are already available and well explained by our industry experts.
Stretched Cluster Considerations:
- For compute you need to have Hypervisor at both the locations and need to have sufficient capacity there to hosts the migrated or restarted (vSphere HA) virtual machines.
- You should have a mirrored storage. But hey do you know what other storage considerations are there for stretched clusters?
- You should have Layer 2 adjacency for Network and need to consider bandwidth and latency. Do you know how much latency is supported for distance vMotion?
Now let me show you a simple diagram of stretched cluster.
For stretched clusters, you should have mirrored storage at both sites, and that storage must be completely synchronized before a VM can be migrated from one location to another. Whether the synchronization is going to be synchronous or asynchronous, that is a separate business decision and SLA is the business driver for that.
If you are mirroring data synchronously, there should be no issue with the data being synchronized, assuming there is sufficient bandwidth and minimal latency. If you are mirroring asynchronously, any outstanding writes must be completed before the VM can be moved.
But hey stop, this is not as easy as it may sound, it may be problematic for high I/O applications if the bandwidth cannot accommodate the amount of data that is being written (churn aka change rate). Let us assume you have a Oracle VM and you try to move that VM, while still actively processing transactions, to a remote node. Consider what will happen in this condition.
Where the data is being mirrored is also a consideration. Do you know when data is being written to two locations from the hypervisor, then that can significantly complicate your storage and network configurations. If data is being mirrored at the storage system, you need to carefully consider
- How does that impact array performance? Or does it?
- If there is an appliance that is mirroring the data, what protocols does it support?
- What about the scalability of this solution?
In any of these situations, you also have to consider these below conditions:
- How does the mirroring solution handle a communication failure? Or is it going to handle it at all?
- If it is synchronous, does the write fail or does the local system cache the information to write it at a later point of time?
- If the above condition is true, then how much can be cached? Is there a potential budget constraint for storing the cached data?
- Does asynchronous mode do anything to accelerate synchronization after a failure?
- Does the cluster enter a split brain scenario if the two sites are disconnected for a period of time? (Read Duncan’s article)
- Consider what happens if a hypervisor fails and VMs need to be restarted? (Read Duncan’s article)
- Do you restart them locally or on a remote node? If you are using vSphere 5.x HA then let Hypervisor decide this?
- Can you set an affinity for one side or another?
- Can you keep related services together, so that the app is not at one site and the database at the other? Let DRS Affinity/Anti-Affinity rule play this role. (Read vSphere Clustering Deepdive)
- What if there is not sufficient capacity to keep both at one site?