Thursday, December 10, 2020

NSX Load Balancer - Redirecting Traffic to Maintenance Page

In this post we'll look at 2 situations when we need to redirect vRealize Automation traffic to a maintenance page. The type of traffic doesn't really matter as long as the traffic goes through the load balancer, but for a less abstract post we'll use vRA 7.x. The use cases are: 

  • vRA services are down (for example IaaS manager pool is gone) - in this case it would help if traffic is redirected from the vRA server login portal to a "sorry server"
  • scheduled maintenance window (for patching)  - you need vRA working normally, but you don't want anyone else to login and start playing around 

For both cases we'll be using simple application rules in NSX load balancer (well, if the services are actually behind a NSX load balancer). In a highly available architecture, every service in VRA will be behind a load balancer. For simplicity we'll look only at VRA appliances as for the rest it can be easily extrapolated. 

When a user tries to connect to VRA portal, it will make a request using the virtual IP assigned to the load balancer virtual server. The virtual server (VRA Appliance Virtual Server) has a pool of servers (VRA Appliance pool) associated to which it can direct the traffic. The blue path represents normal situation, when the user reaches VRA appliances is the portal. The green path does not actually exist and is the subject of the post. What we need is in case all servers in the VRA Appliance pool are down to redirect the user to another page. For this we need a few additional elements.

First we need a VM that runs an HTTP server and is able to serve a simple html page, called in the diagram above "Sorry Server". We installed apache Apache, enabled SSL and created in the document root path a structure similar to VRA login URL (below document root is /var/www/html) to serve a custom index.html page.


At NSX level we add the "sorry server" to a new pool, called "vra-maintenance-pool". We also create application rules to check availability of  VRA appliances.  Application rules are written using HAProxy syntax and they are used to manipulate traffic at the load balancer side. It's a simple rule, where we first check if there are any servers up and running in VRA appliance pool using an access control list (acl). If the pool is down, acl becomes true and we use another backend pool - the maintenance one:

# detect if vra appliance is still up 

acl vra-appliance-down nbsrv(vra-appliance-pool) eq 0

# use pool "vra-maintenance-pool" if app is dead

use_backend vra-maintenance-pool if vra-appliance-down

The rule is then linked to the virtual server of the VRA appliances. Whenever a request comes to the virtual server, the rule is checked and if vra-appliance-pool is down, users will be redirected to the maintenance page. You can extend the rules and redirect users to maintenance pool for other situations that may render VRA useless such as IaaS manager servers down or other IaaS services are down. 

Another usage for application rules is restricting access to VRA during scheduled maintenance. In this case the rule will use ACL to restrict IP's accessing VRA virtual servers by matching the source IP of the request.

# allow only vra components and management server 

acl allowed-servers src

# send everything else to maintenance page

use_backend vra-maintenance-pool if !allowed-servers

Traffic is redirected to maintenance pool when it comes from a source different than the VRA itself or the management server. Happy patching! 

Wednesday, December 2, 2020

Using NSX load balancer as a monitoring tool for RESTful APIs

We are going to look at a different use case for NSX load balancer - monitoring tool for external API's. 

Our core platform is integrating with other systems using RESTful API's. These systems even though they are built with high availability in mind, they are sometimes highly unavailable. They are also part of the critical path for our core platform. Not being able to reach the systems creates troubles  in the form of incident tickets because we fail to deliver services to our customers. So we needed a way to monitor those API's.

We know that the systems are monitored, but we don't have access to those tools. We have our own tools, but they do not offer a simple and efficient way to check the status of a RESTful API. Ideally we don't want to introduce another monitoring tool. However, the core platform is running on top of NSX and it actually uses NSX load balancers for its internal service. So why not use load balancers to monitor the external services? 

We created a service monitor and a pool in the load balancer for each of the external systems. This way NSX monitors the status of the RESTful API of the system and generates alerts whenever it is down. The status of the pool is then checked by the core platform. All communication between core platform and APIs goes directly. It does not use the load balancer.

The pool contains the RESTful API endpoint of the system that we use to connect directly from the core platform. 

The service monitor uses GET requests to check the availability of the RESTful API. 

Nothing fancy, basic configuration for a load balancer. Half-configuration actually because here we stop as no traffic goes through the load balancer to these pools. But whenever the external system is not reachable, the load balancer knows it because now the external system is a member in on of its pools: 

The status of the member in the pool is accessible through the RESTful API of NSX Manager. 

GET /api/4.0/edges/{edge-id}/loadbalancer/statistics

<failureCause>layer 7 response error, code:400 Bad Request</failureCause>
<lastStateChangeTime>2020-12-02 18:20:43</lastStateChangeTime>

This way the core platform knows the status of its external systems before doing anything. More important core platform can now act on that status. In this case it will wait a specific period of time until trying again to use the system. 

It is a pretty simple solution. It is also pretty obvious that the APIs should have been monitored. We actually relied too much on the availability of those API's and used a fire and forget approach. The approach was far from optimal and it impacted our KPIs and created additional operational workload. 

Tuesday, November 3, 2020

About vSphere Cluster sizing, vMotions and DRS

This post applies for the ones that are (still) running older versions of vSphere. In my case it is vSphere 6.5

We came across recently on situation where we had a 16 host cluster and a very peculiar situation. Out of the 400 VMs in the cluster, 200 where hosted on 2 hosts and the rest of 200 on the remaining 14 hosts. We were in the process of migrating VMs to this cluster from another one, but we were expecting DRS to actually distribute the VMs more evenly across the hosts. No, the VMs did not compete for memory or cpu, however having 100 VMs on the same host while other hosts are running less than 20 can cause issues in the case of a host failure event. 

We were aware of DRS not being aware of vSphere HA, but this was not the case since the VMs were live migrated. The VMs did not compete for memory or CPU because the hosts have sufficient resources and the average memory size in this case was small. 

The issue was fixed with manual redistribution across the cluster and human selection of the hosts during vMotion. One of the lessons we learnt is to better design the host size to match average VM size and have a pretty good idea on how many VMs we want to run on a host. If 100 is not acceptable, then make the hosts smaller (or the VM bigger :-) )

Let's do a bit of math too:

We have 300 VMs with an average memory size of 12 GB and and average CPU size of  3 vCPU's for a total of 3600 GB of RAM and 900 vCPUs. For 1:3 physical core to vCPU oversubscription we need 300 physical cores - a 24 core CPU with HT enabled will provide 48 cores. On a dual socket server we can get 96 threads so we could fit 300 VMs on 3 servers with 1:3.125 oversubscription ratio. Add 1.5 TB of RAM on each ESXi host and you have your 100 VMs per host. But this is exactly the case we wanted to avoid. The alternative is we downsize to smaller CPU's, less RAM and more physical ESXi hosts. Let's aim for 60 VMs per host. We know we will need 5 hosts to accommodate the load with 60 cores each ESXi host and 720 GB of RAM. Between the two, I would choose to the second one. I think I should right size the host capacity to fit the workload rather than putting a lot of resources in there. And don't forget about the N+1 failure tolerance of the the cluster. 

Tuesday, October 27, 2020

PowerCLI - Optimizing Scripts

Not sure when and how this drive to optimize appeared in me, but I've seen it taking control in different situations and driving other people crazy. So I thought why not give it a try on the blog also. Maybe something good comes out of it. 

I will go over a couple of simple concepts that will help with the execution of PowerCLI scripts. The first one is API calls and to be more specific the number of calls. 

Let's use the following example: 

  • we have a list of VMs and we need to get the total used space by those VMs

The use case can be approached in two ways:
  • get the size for each VM from the list and make the sum 
foreach ($vmName in $vmList) {
	$totalUsedSpace  += (Get-VM $vmName).UsedSpaceGB
In this case we would use Get-VM cmdlet for each VM in the list. When we use Get-VM we actually make an API call to vCenter Server. So that means for a list of 10 VMs we do 10 API calls. If we do this with 100 VMs that means 100 calls. The immediate effect is that the script takes a long time to execute and we increase the load on vCenter Server with each call. 

But we can get all VMs from vCenter Server in one call and then go through the VMs
  • get all VMs in vCenter Server and then check each VM in the list 
$allVms = Get-VM

foreach ($vm in $allVms) {
	foreach ($vmName in $vmList) {
		if ($vm.Name -eq $vmName) {
			$totalUsedSpace  += $vm.UsedSpaceGB

The advantage of the previous example is we do only one API call. The disadvantage is we take all objects and then we have imbricated "for" loops. And this can take a long time especially when there are a lot of objects in vCenter Server. Let's take a look at some execution times.

The data set is made of more than 6000 VMs in vCenter Server ($allVms) and we are looking for the size of a list of 100 VMs ($vmList)

For 100 VMs, the first script (on API call for each VM in list) takes around 275 seconds. The second script takes 14 seconds. Even if we do "for" in "for", it takes way less time than to do hundreds of API calls. Obviously, if we increase the $vmList size to 300 VMs the first script will take almost 3 times the time while the second will only increase by a few seconds to 16 seconds. That comes from the increased complexity of running the imbricate loops. The more we increase $vmList, the more time it will take. At 600 VMs in $vmList, it takes 21 seconds to run. At this moment we'd like to see if there is another way to actually decrease the complexity of the script and make it faster. 

Let's see how to get rid of the imbricated loops. For this to work, we'll use a hash table (key-value pair) created from vm names and their used space. Once the hash table is created then we search in it only for the vms in $vmList:
$allVms = get-vm 
$hashVmSize = @{}
foreach ($vm in $allVms) {
$totalUsedSpace = 0 
foreach ($vmName in $vmList) {
	$totalUsedSpace  += $hashVmSize[$vmName]

For the new script the time to execute for the 600 VMs in $vmList is less than 12 seconds (almost half the time of the imbricated loops script). The complexity reduction comes from the number of operations executed. If we count inside the loops how many times it is executed, we'll see that for the hash table script we run the "for" loops for almost 7000 times, while for the second script (imbricated loops), we run the loop for more than 38 million times (yes, millions). 

What we've seen so far:
  • calling the API is expensive - 1 call is better than 10 calls 
  • imbricated loops are not scalable (not even nice) 
  • hash tables can help
This doesn't mean that if you want to see the size of 2 VMs you should bring 6000 VMs from vCenter Server. It means that next time when your script takes 20 minutes, maybe there is a faster way to do it. 

Tuesday, September 29, 2020

VMware Cloud - What's New

As part of this year's VMworld news, we'll take a quick look at what's new in VMware Cloud. In the past 5 years, VMware Cloud expanded from the beginning with IBM Cloud to all major cloud providers (AWS, Azure, GCP and so on). It also acquired new technologies to help develop its any cloud any app strategy. It's only fair to say that the following picture is impressive and has to be put here to get an understanding of what actually VMware Cloud has come to. 

This is a dear picture since I had my (small) part in the IBM team that made possible the IBM Cloud Partnership (yeah, a bit of bragging never hurts). However in this post we'll look at three other topics:

  • VMware Cloud in Dell EMC
  • VMware Cloud on AWS
  • VMware Cloud DRaaS

VMware Cloud on Dell EMC

VMware Cloud on Dell EMC is a fully managed on-premises cloud solution. The client buys a rack preinstalled with servers and preconfigured with VMware Software Defined Datacenter (SDDC). VMware is managing the whole infrastructure, while the client's only worry would be to find room in the datacenter for the rack - his own or co-located. So far this offering is available only in US, but will soon come to other parts of the world so it's time to take a closer look. 

The solution comes with regulatory compliance and certifications - ISO 27001, CSA, SCO2, CCPA, GDP

For the hardware nodes, a t-shirt size approach is available with the latest addition of X1d.xlarge node type which scales to 1.5 TB of memory and almost 62 TB of NVMe storage. You can put up to 24 such nodes in a rack, which sums up to some pretty impressive values.  

A new feature allows for segmenting the 24 nodes in a rack in different workload clusters for use cases such as special licensing requirements. It allows for up to 8 cluster (3 node cluster) in a rack. To get your workloads in and out you can now use VMware HCX based migrations. 

This is a VMware managed service which includes end to end life cycle management of the solution, support 24 * 7 and a 4 hour break fix SLA. 

VMware Cloud on AWS

Since its release in 2017, it has become one of the most talked about and developed integration in the portfolio. It not only ensures that you can run native VMware workloads on AWS bare metal across a global footprint of datacenters, but it also provides direct access between VMware VMs and their services to AWS services. 

When we talk about what's new in this area we can refer to:

  • core SDDC 
    • i3en.metal instances with VSAN compression enabled
    • HCX enhancements to avoid hairpining / tromboning and ease migration by logically grouping and migrating applications 
    • multi-edge SDDC for improved North-South network bandwidth 
    • multi SDDC connection with VMware Transit Connect 
    • vCenter Linking within an SDDC group which will bring the inventory of multiple vCenter Servers under the same pane of glass 
  • operations and automation 
    • vRealize Operations Cloud enhancements 
    • vRealize Log Insight Cloud enhancements
    • vRealize Network Insight Cloud for network visibility 
    • vRealize Cloud Automation enhancements for IaC, Terraform service
    • vRealize Orchestrator support for workflow automation
  • workloads 
    • Tanzu support for K8S runtime and management
  • disaster recovery 
    • improvements to hot DRaaS solution with VMware Siter Recovery Manager
    • new on-demand DRaaS with VMware Cloud Disaster Recovery 

VMware Cloud Disaster Recovery 

The newest offering in VMware's Cloud portfolio is based on Datrium recent acquisition and it provides on-demand Disaster Recovery as a Service. Even if it can be seen as overlapping with the traditional Site Recovery Manager and vSphere Replication based solution, the main difference is that for the new offering you get to pay what you are using. Instead of having to build up your whole DR site somewhere in the cloud, DRaaS on-demand allows you to keep a minimal number of hosts by replicating to a cloud based storage. When required, VMs are powered on demand on the VMC SDDC capacity. 

The following table summarizes the differences between on-demand DRaaS and hot DRaaS. 

In this case you want to optimize costs before RPO, this is the way to go. 

Monday, June 15, 2020

Being a vExpert

I applied to vExpert program for the first time 2 years ago. I was kind of pushed by one of my VMUG co-leaders who was accepted earlier that year. So, I applied too. I still remember how it felt when I got the acceptance letter. It gave me joy and a peculiar sense of pride. All of a sudden there were 2 vExperts in Romania. I didn't realize the true potential of it until I had access to this beautiful community of people driven by their passion and to the resources made available to me:
  • access to a global network of techies through dedicated Slack channels 
  • VMware licenses for my lab for 1 year 
  • dedicated webinars for vExperts
  • parties at VMworld (this will wait a bit for now) 
  • more people and knowledge through the subprogram 
  • increased visibility on social media: Twitter, LinkedIn
More goodies: I got licenses and complimentary subscriptions from partners like Hy Trust, Runecast, Veeam, Pluralsight and others.

Having the tools I was able to rebuild my own lab and test new products and features. I was able to expand my knowledge with new technologies. This boosted my confidence, but it also made me think how I can give back more. 

Giving back for me it's mostly time through the blog posts, the VMUG meetings I help organizing and sometimes even speak at and the chats I have with my peers. But you can take other paths than me: be a public speaker, write a book or an article, be a customer evangelizing within your organization or at public events, be a passionate member of a partner organization. There are many ways in which you can apply, you just need to apply (here).   

Sunday, June 14, 2020

Veeam NAS Backup - File Restore Options

We are going to look at restore options available for files backed up using NAS backup job. Once you have successfully completed a file share backup, you get the following restore options:

  • restore entire file share 
  • rollback to a point in time - it is actually a fast entire file share restore 
  • files and folders - allows to pick individual files and folders

Let's take a look at each of them.

Restore file share 

Once started, the restore wizard will ask to select a specific restore point 

The location of the restore can be the original server or to another server

You will choose how to process the restore when the files already exist in destination: keep existing files, replace older file, replace newer files, overwrite. In the same tab you can select to keep the security attributes and permissions of the restored files.

Press Finish on Summary page and wait for the restore process to finish. In my case 51 files have been restored out of  750 on the share:

Rollback to a point in time
In case you don't want to go through the whole process from above, select the second restore option, select the point in time and the file share will be reverted to it:

Files and folders
This options let's you restore specific files or folder. It opens a searchable file explorer. Three different views are present in the file explorer:

  • latest - presents a list with the latest versions of the files 

  • all time - displays all versions of the file on the share 

  • selected - presents the version of a file existing in specific restore point 

Looking at the different views from above you will notice that not only file size is different, but also the number of objects displayed in each view. This is based on the actual state of the file share when the backup was taken. 

After the file has been selected, the restore the file to the same location or copy it locally. 

In case you do not overwrite the original file, you will see the files in the share

Tuesday, June 2, 2020

Backup vSAN 7 File Share with Veeam Backup & Replication 10

This is a blog post about two of the new features that were released this year in vSAN 7 and in Veeam Backup & Replication v10:
  • vSAN File Service
  • Veeam NAS Backup 

As you've already guessed, we're going to create a file share on vSAN, put files on it and backup it up with Veeam Backup & Replication (VBR) using the native NAS support.

At a high level the environment looks as following:

A 2 node vSAN 7 cluster runs in a nested environment. Each node has 4 vCPU, 32 GB of RAM, 20 GB Cache SSD and 200 GB Data SSD. A witness appliance has been also deployed. VBR v10 is configured with the minimum hardware: 2 vCPU and 8GB. Both proxy and repository roles are installed on the same VM with backup server, as well as the embedded MSSQL Express DB. Network connectivity between vSAN and VBR is 1 Gbit.

The configuration of the vSAN cluster or the installation of VBR are not in scope of this blog as there are many resources out there. I've used William Lam's nested library for the vSAN nodes and downloaded the witness appliance directly from VMware site.

The prerequisites for the following steps are: a running vSAN cluster and a VBR installation. You will also need 2 IP's reserved for File Server virutal appliances (one for each node) and a working DNS server with records added for the 2 IP's

Part 1 - vSAN File Service

vSAN File Service provides file shares on top of vSAN storage. It supports NFS 3 an NFS 4 protocols. It comprises of vSAN Distributed File System (vDFS) and it is integrated with vSAN Storage Policy Based Management.

Enable vSAN File Services

At the cluster level go to Configure - vSAN - Services and press enable File Service

Select whether the File Server agent OVF appliance will be download automatically or it will be manually uploaded:

Give a namespace for file share, type in DNS server address and domain

Select a portgroup, type the netmask and default gateway IP

Type in the IP addresses for each of the nodes

Review the configuration and press finish. Installation process will start deploying the agents on each of the vSAN nodes.

Once it is finished you will 2 new components deployed on the vSAN cluster: the File Service Agents. These are virtual appliances running Photon OS 1 and Docker and they act as NFS file servers.

Create NFS file share

It's time to create the share and put some files to it. To create the share Configure - vSAN - File Services Share. Enter the name of the share, storage policy, soft and hard quota limits.

You can also assign labels to the file share. Quotas can be changed at a later time. Next define access control: what IP addresses have access to this share, what type of access and if to protect the share with root squash.

Lastly review the settings and create the share. Once created you will see it vSAN - File Services Share:

Now the file share is ready to be used. Because in part 2 we will use Veeam's NAS backup, we've added a small load of files on the share using a linux VM. First get the share path by selecting the file share and pressing copy on URL:

Next ssh to your favorite Linux VM and mount the file share:

To create random files with random size, we've used a simple script found on here and which was slightly adapted.

for n in {1..500}; do
    dd if=/dev/urandom of=file$( printf %03d "$n" ).bin bs=100 count=$(( RANDOM + 1024 ))

The workload tested was:
- 200 files ranging  between a few hundreds of KB and a few MB
- 50 files ranging from a few MB and few tens of M

The script can be adapted to create larger workloads with thousands of files. Since the whole lab is nested, performance testing was not in scope.

Part 2 - Veeam NAS Backup 

The architecture for NAS backup requires at a minimum a file share (in our case NFS from vSAN), file proxy, cache repository, backup repository and a Veeam Backup Server.

Additionally, secondary and archive repositories can be added in infrastructure. In our lab infrastructure all components are deployed on a single VM - the backup server.

File proxy - component that acts as the data mover transporting data from the source (file share) to the backup proxy. It is used in backup and restore activities.

Cache repository - is a location to store temporary metadata. This metadata is used to reduce the load on the file share during incremental backups.

Backup repository - main location of the backups

Add File Share 
In VBR Console go to Inventory - Add File Share

Select NFS share and specify the path to the share

Select the File proxy, cache repository and backup speed

Review the summary and finish the configuration.

Having the file share added, we will add it to a backup job. Right click it and add it to a new backup job (or use an existing one).

Select the file share to backup

Select destination repository, backup policy and archive repository if any.

Select a secondary target if required (backup copy for short term)

Enable the job to run automatically

Review the settings and run the job.

The initial execution backups up all 250 files on the share. For the incremental, we've removed and recreated the first 50 files on the share:

for n in {1..50}; do 
  rm -f file$( printf %03d "$n" ).bin
  dd if=/dev/urandom of=file$( printf %03d "$n" ).bin bs=100 count=$(( RANDOM + 1024 ))

As you can see the cache repository is pretty effective in determining which files to be backed up:

Thursday, May 28, 2020

Static route on dual homed vSphere Replication appliance

Recently went through the process of upgrading and troubleshooting a vSphere Replication environment. What was particular about that environment is the vSphere Replication appliances had 2 network interfaces.

The first interface (eth0) has the default gateway, but it is not used for replication traffic. The second interface (eth1) is connected to the portgroup that also connects to ESXi replication vmkernel portgroup. So, replication traffic is supposed to go over eth1. Main site and DR site have networks from different subnets, but connectivity is possible over the replication network. Since hosts in protected site (main site) need to communicate to vSphere Replication server  in DR site we need to force this communication to go over the replication network.

The solution is pretty simple, add a static routes on the appliances to reach the opposite site over the replication network as following:

route add -net gw

The route is not persistent and it will be lost upon reboot. To make it persistent, we need to add it to a configuration file. vSphere Replication 8.1 and 8.2 are running on VMware Photon OS 2.0. Normally you add the static route in the configuration file for the network where you want to have it. In my case in /etc/systemd/network/



However this did not work and the route was not picked up at reboot. Then I tried a different approach. I needed to be sure the route add command would be run every time the appliance restarts, so I added it as a service. I first created the service configuration file called staticroute.service ( a name of my choice). The file is created in /lib/systemd/system/ and contains the following:

Description=Add static route for eth1

ExecStart=/usr/sbin/route add -net gw


Finally I've created a symbolic link for the file:

cd /lib/systemd/system/
ln -s ../staticroute.service staticroute.service

Once you do that you can run ls -la to display the files and you will see your staticroute.service

This will ensure the static route is created at every reboot. Make sure to add the routes in both sites. To test the communication you only need to traceroute the ESXi host replication IP from the opposite site.

Monday, May 11, 2020

VMs not Powering On in Nested ESXi Running on vSphere 7.0 and Options for Nested Lab

After upgrading physical home lab to vSphere 7.0, I've tried to power on the VMs in my nested environment to prepare a demo for an upcoming VMUG meeting. However, I couldn't get any VM to start in the nested ESXi 7.0 running on top of a physical ESXi 7.0. What actually happened is the nested ESXi host crashed.

I found out the following article  warning about this issue affecting an entire family of CPU's - Intel Skylake. My home lab runs Intel Coffe Lake CPUs on gen 8 Intel NUC's and it seems they are affected too.  It does not affect older CPU's as it is the case with my Ivy Bridge i5. Bottom line, until a patch or fix comes into main stream vSphere 7.0, you won't be able to power on a VMs in a nested ESXi 7.0 running on top of an ESXi 7.0. The rest of functionality is there and working.

I had to do my demo using the physical vSphere 7 and later come back to the lab to find a workaround. I found out there are two options that actually work at the moment:

  • option 1 - physical ESXi 7.0 running nested ESXi 6.7
  • option 2 - physical ESXi 6.7 running nested ESXi 7.0
Keeping the physical ESXi on 7.0 and downgrading nested 6.7 may seem the simpler path unless your use case is to test the new features and products. You could do it with the physical hosts, but that means to run all your tests on the base ESXi's and it could lead to partial or full lab rebuild. This approach invalidates the idea of having a nested lab. So now you are left with option 2: temporarily downgrade physical ESXi to 6.7. My use case requires to power on nested VMs, so option 2 is my choice.

I keep the physical lab on a very simple configuration with the purpose of being able to easily rebuild (reconfigure) the hosts. Before going to downgrade, a few aspects need to be considered:
  • are any VMs upgraded to the latest virtual hardware (version 17) - those VMs will not work on vSphere 6.7
  • cleanup vCenter Server: remove hosts from clusters and from vCenter Server inventory. Reusing the same hardware will cause datastore conflict if a cleanup is not done.
  • how the actual downgrade will take place (pressing Shift+R at boot start will not find any older install even it was an upgrade from 6.7)
  • hostnames and IP addresses

Having all this in mind, I embarked on the journey of fresh ESXi 6.7 installs that will allow to run nested ESXi 7.0. 

Friday, May 8, 2020

vSphere Distributed Resource Scheduling - DRS

 DRS is a core technology for resource management in a vSphere cluster. It has been around since ESX3 and it's a battle proven feature without vSphere clusters would not look the same. But what it actually does?
At a high level, it enables to use the resources of ESXi hosts in a cluster as an aggregated pool of resources. Drilling a bit into what it does we'll see that:

  • it provides virtual machine admission control - are there enough resources in the cluster to power on a VM
  • it provides initial placement of a VM - what is the most appropriate host to power on the VM
  • it is responsible for resource pools - quantifiable and aggregated resources to be consumed by a VM or group of VMs
  • it is responsible for resource allocation to VMs or resource pools using shares, reservation and limits
  • it balances the load in the cluster 

vSphere 7 comes with an important change in the logic DRS uses. Until vSphere 7, DRS would try to balance the load looking at the cluster. If a host was overloaded at some point in time, it would try to balance it by migrating VMs to less utilized hosts. Checking cycle was 5 minutes. Starting with vSphere 7 the focus has shifted to VM. DRS calculates a per VM score called virtual machine happiness. Looking at the VM and running every minute, provides a better way of load balancing and ensuring placement of VMs.

Let's look at some of the features in DRS as they appear in the UI. As you can see above at the cluster level you can see the score of the cluster (an average of the scores of each VM) as well the score buckets for VMs. All my VMs are happy in the 80-100%, meaning they have all the resources they require. Going to VM view, we'll see the individual VM scores as well as some of the monitored metrics such as CPU % ready, swapped or ballooned memory:

DRS is enabled at cluster level. Once enabled, four tabs get activated.

Automation tab
The first choice is how much freedom you give to DRS: Automation Level.

There are 3 levels you can choose from

  • manual - generates recommendations for initial placement and migrations. But you have to actually apply the recommendations. Hence it is manual intervention every time.Very good when you need to do some troubleshooting. 
  • partially automated - initial placement of VMs is done by DRS, but migrations are kept at recommendation level. 
  • fully automated - DRS will take care of both initial placement and migrations
Once you have decided which automation level to use, you will choose the threshold for which migrations should be made. The slider is scaled from conservative to aggressive. DRS looks at an imbalance in the cluster the five levels on the slider determine how big that imbalance can be. A conservative setting will not generate migration recommendation for load balancing. An aggressive setting will calculate a very small imbalance threshold.  This translates from almost no migrations (except for specific cases like putting a host into maintenance mode) to a lot of migrations. 

Predictive DRS has been introduced with vSphere 6.5 and it utilizes metrics from vRealize Operations Manager to balance predicted cluster load and workload spikes. 

Virtual Machine Automation enables VM level override of DRS and HA settings. When enabled, you can specify at Cluster - Configure - VM overrides the VMs for which you would change the default settings like having them excluded from migration recommendations:

Additional Options tab

VM Distribution instructs DRS to try and evenly distribute the VMs on hosts. It is a soft limit that will not be enforced over migration recommendations. 

CPU Over-Commitment enforces the defined ratio of vCPU/core. When enabled, DRS will not allow to power on VMs if the ratio is overpasses. This enables to keep some clusters in the realm of performance. The max value is 32, this being the maximum vCPU/core for vSphere 7. 

Scalable shares is a new feature introduced in vSphere 7. You can find very good articles here and here. In a nutshell, scalable shares takes care that the shares allocated to a VM are actually taking into consideration the share priority (high, normal, low) and avoids situations where VMs in resource pools with lower priority can get more resources than VMs in resource pools with higher priority. This situation is called resource pool priority-pie paradox.

Power Management tab

When activated, Distributed Power Management (DPM) looks at the cluster utilization and consolidates VMs on hosts based in order to power off hosts and save energy. For more details, you may look at this article 

Advanced options tab

The tab displays advanced options that have been set for DRS through the UI or manually.

This has been a small introduction to DRS as it looks now in vSphere 7. There are a lot of features and details that have been barely touched or not touched at all. For a deep dive, I recommend the famous Clustering Deep Dive book although I am waiting for an updated version.