As we are inching towards Hybrid Cloud solution model or even a fully dependent Public Cloud model, we are moving close to the point where we need solid BCDR plan in place. However, most of the Backup solution made today does not take care of the main pain point of a Service Provider and that is SLA. Before I move forward let me take you to the basics of the SLA model.
Recovery point objective
An RPO is the targeted maximum amount of time that can be tolerated between mirroring your data. Some enterprises may require an RPO of zero, meaning they cannot lose any data. This requires continuous, synchronous replication. Other enterprises may be able to tolerate data gaps of seconds, minutes, hours or even days if they have to revert to a secondary site.
Recovery time objective
An RTO is the target interval between when an application outage occurs and when the application must be back up and running. This includes the time it takes to detect the failure, prepare the backup site, initialize the failed application and perform any network configuration to reroute requests to the backup site. The lower the RTO, the shorter the time between disaster and recovery.
If you look at most of the Backup Vendors you will see during backup, all VMs for them get equal proportion of physical resource (CPU, Mem & Network). So, there is no way a Service Provider can provide more resources for a customer’s critical VM and get the backup done quickly based on SLA. However, in many customer environments, specially in Service Provider environment, SLA is the foremost thing. So what you see today is as follows.
What you see today
But what would you like to see is follows:
What we like to see
In a SP environment, customers want to preserve their data. But not all of their. VMs are so precious to have strict backup SLA. That means some VMs can wait to be backed up. However, they want their VIP VMs to be backed up faster than their non VIP VMs. Problem is there is no way a SP can segregate different class of VMs and prioritize those VMs over the normal VMs.
Now I am presenting a solution that will solve this issue. A solution that can throttle the backup threads based on SLA level.
A Solution from 20K Feet: Class based Backup Profiles for Thread Throttling
Using a Backup Service Class we can define the priority of some VMs. This priority will be driven by SLA. This solution will provide a VM Backup Class Profile to the customer through VMware vCenter Server.
These profiles will hold the class information. Customers just need to attach a VM to a Backup Class Profile. Backup Controller will take the class and throttle that particular thread. So that the VIP VM with Gold Class profile will get more CPU, Memory and Network resource to get backup quickly.
Today there is no way some one can solve this problem as there is no SLA driven class based backup profile concept available. What max other can do is to provide more resource to the backup controller to backup all of their VMs. However, they can’t assign more resources to a particular VM, either manually or automatically.
A typical workflow of the solution is as follows.
A future work
We would also like to throw some light on the future work on this classic model. Something that a Backup Vendor also should do keeping SPs in mind.
VM’s along with a priority tag of GOLD/SILVER/BRONZE, will also have a secondary tag to specify Master-Slave relationship. This Master-Slave tag will help the backup scheduler to backup the VM’s marked as Master first and then the related Slave VM’s.
Scenario : There are multiple VM’s that needs to be backed up, amongst which there are 3 vm’s, namely DB-Master, DB-SLAVE1 & DB –SLAVE2. All these 3 vm’s are of GOLD class, i.e. have highest priority, however when backing up the scheduler will backup the DB-Master first and then the Slave’s. This will help in providing time for the slave vm for any data replication that is still in process. So as to facilitate getting master and slave backup at same state.
Smaller Payload gets processed first
This talks about case where two VM’s have same backup priority, but different expected time to completion. In such a case the scheduler will give more resources to the VM’s whose expected time to completion is less, So as to reduce the backup queue.
Scenario : VM-A & VM-B have HIGH backup priority, for them the backup job started at 11:00 AM. Now at 11:30 a new VM-C (having HIGH priority) joins the queue. Now in such a case, the scheduler before adding the VM-C to run queue, will check if the VM-A and VM-B are nearing completion, So as to make sure that the SLA for VM-A & VM-B backup job is intact.