Tuesday, November 3, 2020

About vSphere Cluster sizing, vMotions and DRS

This post applies for the ones that are (still) running older versions of vSphere. In my case it is vSphere 6.5

We came across recently on situation where we had a 16 host cluster and a very peculiar situation. Out of the 400 VMs in the cluster, 200 where hosted on 2 hosts and the rest of 200 on the remaining 14 hosts. We were in the process of migrating VMs to this cluster from another one, but we were expecting DRS to actually distribute the VMs more evenly across the hosts. No, the VMs did not compete for memory or cpu, however having 100 VMs on the same host while other hosts are running less than 20 can cause issues in the case of a host failure event. 

We were aware of DRS not being aware of vSphere HA, but this was not the case since the VMs were live migrated. The VMs did not compete for memory or CPU because the hosts have sufficient resources and the average memory size in this case was small. 

The issue was fixed with manual redistribution across the cluster and human selection of the hosts during vMotion. One of the lessons we learnt is to better design the host size to match average VM size and have a pretty good idea on how many VMs we want to run on a host. If 100 is not acceptable, then make the hosts smaller (or the VM bigger :-) )

Let's do a bit of math too:

We have 300 VMs with an average memory size of 12 GB and and average CPU size of  3 vCPU's for a total of 3600 GB of RAM and 900 vCPUs. For 1:3 physical core to vCPU oversubscription we need 300 physical cores - a 24 core CPU with HT enabled will provide 48 cores. On a dual socket server we can get 96 threads so we could fit 300 VMs on 3 servers with 1:3.125 oversubscription ratio. Add 1.5 TB of RAM on each ESXi host and you have your 100 VMs per host. But this is exactly the case we wanted to avoid. The alternative is we downsize to smaller CPU's, less RAM and more physical ESXi hosts. Let's aim for 60 VMs per host. We know we will need 5 hosts to accommodate the load with 60 cores each ESXi host and 720 GB of RAM. Between the two, I would choose to the second one. I think I should right size the host capacity to fit the workload rather than putting a lot of resources in there. And don't forget about the N+1 failure tolerance of the the cluster.