Thursday, December 1, 2022

What I've Learned From Using Instant Clones in vSphere

Instant clone is a technology to create a powered on VM using as source another running VM. An instant clone VM shares memory and disk state with its source VM. Once it is powered on, the instant clone is a fully manageable independent vCenter Server object.  The clones can be customized and have unique MACs, UUID. This makes the technology very appealing for use cases where large number of VMs need to be created in a short time from a controlled point in time - think about VDIs. 

My use case was on-demand labs generated from the same lab template(s). A lab template is made of 3 to 6 VMs of different sizes running interdependent applications. Users login to a web app and then request one or more new labs from the available templates. The web app would start in the background lab provisioning for all the requests via vCenter Server. 


Using full clones would have meant a higher load on the systems and also a longer time to wait for a lab to be ready - boot time of the all the VMs in the cloned lab plus time for services to start in guest OS of each VM. Additionally there was no information on how many labs would be requested at a time. There were also multiple source lab templates having a worse case scenarios of tens to hundreds of VMs being requested within a minute. I chose instant clones as the way forward.  

When using instant clone there are 2 provisioning workflows: running source VM and frozen source VM, as seen in the picture below taken from Understanding Clones in vSphere 7 performance study published by VMware.

In running source VM, a temporary stun is initiated to allow for checkpoint the VM and create the delta disks. Then the source is back to its running state. Each new instant clone will depend on the the shared delta disk potentially hitting the vSphere limit of 255. These delta disks are redo logs and are not tied to snapshot chain, hence not visible in UI. The limit for supported snapshot chain in vSphere is still 32. In case the limit is hit, cloning will fail as described in KB article 67186. To avoid this limitation, you could use frozen source VM provisioning workflow in which the source is frozen and no longer running and the delta disks are only created for child VMs. 

Since the lab templates were actually running different services that did not cope very well with being frozen for longer periods of time, I used running source VM workflow. To create the clones I borrowed and adapted the code from William Lam found here instant clone PowerCLI module (thank you!). He also has some very good articles on the technology. 

What I did not realize at the time is that it will impact the performance of the labs once the number of delta disks increased. The cloned labs were temporary by nature and removed after a specific run time. However the delta disks on the source VMs were not cleaned up and just kept increasing which in the end impacted user experience. So I needed to introduce a cleaning mechanisms. 

The simplest way to clean up source VM was by using an idea that I got from Veeam Snapshot Hunter and to create a snapshot for the lab template VMs (source VMs) and then immediately initiate a delete all command. This will clean up all the delta disks from the source VMs. The PowerCLI script would run nightly as a scheduled job.  

$labPrefix = "lab-1-*"
$vms = Get-VM -Name $labPrefix
foreach ($vm in $vms) {
    $snapTime = get-date -Format "MM/dd/yyyy HH:mm"
    $description = $vm.Name + " " + $snapTime
    New-Snapshot -VM $vm -Name "delta disk cleanup" -Description $description -Memory:$true -Confirm:$false
    Get-Snapshot -VM $vm -Name "delta disk cleanup" |  Remove-Snapshot -RemoveChildren -Confirm:$false    
}

The plan is to test Vim.VirtualMachine.PromoteDisk(unlink=True) method in the future.

A few take away points:
- instant clone is a very fast cloning technology and it also optimizes resource usage (memory, disk)
- if the number of cloned VMs from the same source is very large ( > 200) use frozen source VM workflow
- when using running source VM, make sure to include a cleanup mechanism of the delta disks
- time synchronization in the source VMs is very important (as always)
- if you need full performance, use full clones 

No comments: