Re: Restart vCenter results in DRS load balancing

 

Two weeks back I was reading an awesome article from Frank Denneman which holds a real insider information. Well frankly I did not encounter with such a situation where DRS did not migrate any VM after host exited from Maintenance Mode in a cluster.

Though I am not doubting on anything but just to get satisfied I started looking at it very closely and thought of doing a quick test over it. Now first look at what the real problem is and then I will share my test scenarios and results.

Real issue here is in a DRS cluster every time a host was brought out of maintenance mode, DRS did not migrate VMs back to it. But a restart of the vCenter played the catalyst role and DRS immediately call the migration routing. So what is the root cause of it. Well Frank very well described it in his article Restart vCenter results in DRS load balancing. So let us take a sneak peak of what the explanation is:

 

Restarting vCenter removes the cached historical information of the vMotion impact. vMotion impact information is a part of the Cost-Benefit Risk analysis. DRS uses this Cost-Benefit Metric to determine the return on investment of a migration. By comparing the cost, benefit and risks of each migration, DRS tries to avoid migrations with insufficient improvement on the load balance of the cluster.

 

Now as I said earlier I did not encounter such a problem ever. So I thought of conducting some tests on my cluster. Here are the scenarios and the specification of those tests followed results.

Scenario 1

  1. One single cluster; DRS enabled but no HA.
  2. Four HP Proliant DL 380 G6 Server added to this cluster.
  3. Each Server equipped with 6GB of RAM and Two Duo Core Processors.
  4. DRS Cluster is setup as Fully Automated and Fully Aggressive (5 Stars).
  5. 30 Windows 2008 R2 VMs running with each dual CPU and 2GB RAM.
  6. My cluster’s ideal standard deviation value is 0.114 where as my actual standard deviation value is 0.967.
  7. My cluster is totally imbalanced.

Now I placed my fourth node in maintenance mode and it very well get placed and all my VMs get migrated to others. Well I just looked at the load distribution chart for this cluster now and it was a mess. After some time I thought of taking the host out of maintenance mode to see what happens next. Well just to note down a point before this there were at least more than 200 Migrations happened. So I brought the host back and voila within 5 seconds it started pushing those VMs back to this host. So now I started thinking about stretching myself to south bound direction. Here we go for the scenario 2.

 

Scenario 2

  1. One single cluster; DRS enabled but no HA.
  2. Four HP Proliant DL 380 G6 Server added to this cluster.
  3. Each Server equipped with 6GB of RAM and Two Duo Core Processors.
  4. DRS Cluster is setup as Fully Automated but totally in Conservative mode (Just 1 star).
  5. 30 Windows 2008 R2 VMs running with each dual CPU and 2GB RAM.
  6. My cluster’s ideal standard deviation value is 0.114 where as my actual standard deviation value is 0.967.
  7. My cluster is totally imbalanced.

So this time looking at conservative mode cluster I tweaked my VMs a bit. On all of my VMs I started cpubusy script and within no time all my VMs started putting 100% load on host CPU. Immediately followed by that I placed my same node again into maintenance mode. This time I thought may be some time factor playing the crucial role of holding the cached information inside the vCenter Server. So I let tat situation run for an about more than 48 hours and within that several vMotion happened inside the cluster. After that time period I pull the host back from maintenance to see what happens next. Well I see the same result as Frank has seen and described in his article (article linked earlier). I did not see my VMs coming back to my host. So I thought of doing a trick here and I changed the cluster config to adjust it for 3 stars. Voila within a minute it started pushing those VMs back to this node.

 

So now it put me in more curious mode as to what to consider next for this. Well I am puzzled with some of the doubts which I could not find out. Points are noted as below:

  1. Looking at the result of my both the scenario I am forced to think that Cost-Benefit Risk Analysis will be driven/override by the DRS Cluster settings.
  2. Is the Cost-Benefit Risk Analysis data time pertinent?
  3. By comparing the cost-benefit and risk of each migration, DRS tries to avoid migrations with insufficient improvement on the load balance of the cluster. But what happens if you put the DRS in Fully Aggressive mode and started putting more and more load to it and placing the cluster in totally unstable state?
  4. What is the breadth for this historical cached vMotion Migration data?

 

Well I am still finding these information but it is difficult to get clarified being an outsider than Vmware.

 

 

About Prasenjit Sarkar

Prasenjit Sarkar is a Product Manager at Oracle for their Public Cloud with primary focus on Cloud Strategy, Oracle Openstack, PaaS, Cloud Native Applications and API Platform. His primary focus is driving Oracle’s Cloud Computing business with commercial and public sector customers; helping to shape and deliver on a strategy to build broad use of Oracle’s Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings such as Compute, Storage, Java as a Service, and Database as a Service. He is also responsible for developing public/private cloud integration strategies, customer’s Cloud Computing architecture vision, future state architectures, and implementable architecture roadmaps in the context of the public, private, and hybrid cloud computing solutions Oracle can offer.