Lessons Learned about Multisite Clustering using HP StorageWorks P4000 SAN Solutions

I am working on HP StorageWorks P4000 SAN Solutions everyday in a different different customer scenarios. This has given me a lot of opportunity to explore this product line in a great detail. So now in this post I am going to share some detailed information about P4000 Multisite Cluster. Here you go.

 

Now I am showing you how I setup my multi-site cluster. For the sake of perfect replication across the sites I placed the odd hosts into different grouping. So based on this theory I placed host 1 and  3 to the site A and host 2 and 4 to site B. Look at the diagram below.

The data in which blocks are written to the LUN is determined by the order in which the nodes are added to the cluster.  When using 2-way replication, blocks are written to two consecutive nodes.  When designing a cluster the order of the placement of the nodes is extremely important if the SAN will be placed in two separate racks or even better span two locations.

Because the 2-way replication writes blocks on two consecutive nodes, adding the storage nodes to the cluster in alternating order will ensure that data is written to each rack or site. When nodes are added in the incorrect order or if a node is replaced, the general setting tab of the cluster properties allows you to “promote” or “demote” a storage node in the logical order. This list is the leading for the “write” order of the nodes.

Now if I want to see this as a pictorial diagram it would look something like as below. Because I did not create a separate site for FOM, I am running the failover manager local on a server, creating a logically separated site.

Now at this point I have created a VDI volume and presented to my VDI cluster servers (Vmware ESXi 4.1 U1). I have selected a NR 10 (Two way replication). So at this point to my understanding it would be written consecutively in two different node and in two different site as I placed the NSM in such a way. So this volume has been taken firstly by Node 2 in DR Site (that is ok for my testing). Pictorial representation is as below:

Now we know that when setting up a P4000 Cluster, a Virtual IP (VIP) needs to be configured. A VIP is required for iSCSI load balancing and fault tolerance. One NSM node will act as the VIP for the cluster, if this node fails, the VIP function will automatically failover to another node in the cluster.

The VIP will function as the iSCSI portal, ESX servers use the VIP for discovery and to log in to the volumes.  ESX servers can connect to volumes two ways. Using the VIP and using the VIP with the option load balancing (VIPLB) enabled.  When enabling VIPLB on LUNs, the SANiQ software will balance connections to different nodes of the cluster.

 

Configure the ESX iscsi initiator with the VIP as destination address.  The VIP will supply the ESX servers with a target address for each LUN.  VIPLB will transfer initial communication to the gateway connection of the LUN.  Running the vmkiscsi-tool command shows the VIP as portal and another ip address as target address of the LUN. Below command is showing that Initial Remote Address is the VIP and the Current Remote Address is actually the Gateway Connection for the LUN.

 

Now to test the failover I powered off my Node 2 and voila after 5 seconds it detects that there is a failure in the Gateway Connection (Node 2 -> DR Site) and it did switch to the Production Site -> Node 1. Look at the below picture and see the LUN now started hosting by the Production Site -> Node 1.

Well that is good. Now if we look at the Availability of the NSM node 1 it is showing that if I lose this NSM node I will lose the connectivity for this Volume also. This is perfectly OK as we know that this is just a two way replication and the data block has been written to only to two nodes. But a pretty interesting point here is this Availability Tab gets updated in every 5 seconds. Well this is not a threshold for checking the Gateway Connection and calling for a failover is 5 seconds but just the refresh timeout. So now the question is whose job is to do that? Is this what FOM is doing or the NSM node itself calling this routing? Well answer to this question is that would be the coordinator and the VIP holder.

So what happens when there is no FOM setup in a logically separated site? Well if you don’t have any FOM setup you should have Virtual Manager setup at management group level. Virtual Manager is a special category manager whose job is to check the integrity of the quorum data and present it at the time of failover. Virtual Manager don’t run at all the time. So you need to start the Virtual Manager manually at the time of site crash or node crash. Until you start the Virtual Manager you won’t have access to the volume. So in a nutshell it is better if we have a FOM setup in a logically separated site to provide the automated failover functionality.

Now I powered off my other node in DR site which is Node 4 (DR Site -> Node 4). So this is called my total DR site disaster. But as expected it did not create any harm for me just that it changed the Availability Status for Node 1. So now my volume is totally running from Production Site and has a total quorum dependency on NSM node in Production site. This is also an expected behavior.

Now I made my NSM Node 4 up and running and the Availability status for the Node 1 became stable. So the total quorum dependency has been released now. Now to do the final test I did power off my NSM Node 1 which is now hosting the Gateway Connection role and as expected I lost the connection from my volume and lost access to the datastore from ESX side.

Now I noted an interesting point. After I lost the connection to my volume now I started my NSM Node 2. I was hoping that because of my volume has been replicated to two different sites and two different nodes (Production Site -> Node 1 and DR Site -> Node 2) if I make the second site available at this time may be it will catch hold that LUN and will make it available. Well it actually started calling that routing but it ends up at sitting there for 10-15 minutes for a 50GB LUN and I lost my patience. So finally it did not resync the volume by Node 2. Is this an expected behavior?

Yes it is an expected behavior because the original node that went offline would need to resync to the node that had the latest data, which is down, so the volumes are unavailable.

At this point another interesting question came up to my mind and that is lets assume that the other site is gone and unrecoverable; so in that case can we manually force to resync the data from the earlier node? I know that we will loose the last changed data but that’s OK rather to loose the whole data. Answer to this question is yes we can. That is normally been done by Engineering and it’s undocumented and under NDA so I can’t share that information.

 

During this whole exercise couple of questions came out and I am just trying to answer them.

1. Can I rebalance the Gateway Connection role manually or call for a routine which will do at a random for me?

Answer: No you can’t however set up a dual VIP (dual subnet) multisite and set a preferred site

2. I have a Virtual RAID setup in all of my VSA (like simulation of RAID 5 inside the Physical node) and I am using the NR 10 in this 4 node multisite cluster. So when I am creating a volume with NR 10 where actually it would reside? Which site? Which node will take the ownership of that volume

Answer: As to how exactly the stripe of data is written across the cluster it will be one stripe on one VSA and a copy of that stripe on another VSA. As to how that data is written to the VSA drives, that knowledge may be part of a NDA and unavailable outside of engineering. You can tell which node, therefore site, is hosting the authorized/latest copy by check the gateway connection IP on the volume. This will be the volume in which the iscsi session is initially writing the data. The server accessing the volume from a particular site should be accessing the volume from the node on the same site.

3. Now if I look at the volume I presented to two my servers I can see the gateway connection is setup through the first server I added to this multisite cluster. Can I change this by any chance?

Answer: If that server/P4000 node fails, the gateway will go to the other node. In 8.x there is no automatic VIP balancing. It is only available through CLIQ. In 9.0 there is automatic VIP balancing in place and if necessary the sessions will be rebalanced. I do not believe in this situation that it would send the iscsi session back to the other node as there would not be a balancing need to do so.

4. Can we have multiple gateway connection for a particular volume or can I have a specific gateway connection server for a particular volume?

Answer: No, the gateway connection is determined when the iscsi session is logged on. Load balancing also takes place at this time if enabled.

 

Suggested Reading: HP StorageWorks P4000 Multi-Site HA/DR Solution Pack user guide

Lefthand SAN – Lessons learned

Leave a Reply

You must be logged in to post a comment.