Thursday, December 26, 2019

Check ESXi MTU settings with PowerCLI

Sometimes the simplest tasks can be time consuming in large environments. This time is about MTU settings on the ESXi host,

First let's see a bit about MTU (Maximum Transmission Unit). It is a setting that defines the largest protocol data unit that can be sent across a network (largest packet or frame). Default setting is 1500 bytes. Having a bigger MTU increases the performance for particular uses cases as large amounts of data transmission over Ethernet. So it's always been set to larger values (9000 bytes) for accessing iSCSI and NFS storage. For a vSphere environment it means it could (and should in some cases) be increased for almost all types of traffic: vMotion, vSAN, Provisioning, FT, iSCSI, NFS, VXLAN, FCoE.

Let's take the use case of accessing a NFS datastore, as seen in the picture below:

The biggest challenge with MTU is to have the environment properly configured end-to-end.This means when you want your ESXi to make use of large MTU for accessing a NFS datastore, you'll need to make sure that distributed virtual switches, physical network interfaces, vmkernel portgroups, physical switches at system level and per port level and filers are configured with the proper MTU. What happens in our example when some elements are configured with the default MTU (1500)? In the case the vmkernel portgroup is set at 1500, then no you will see no performance benefits at all. If one of the physical switches is configured with 1500 bytes, then you will get fragmentation of the packets (performance degradation).

Hoping this short theoretical intro to MTU was helpful, I will jump ahead to the topic: checking ESXi MTU with PowerCLI. We are not treating how to check physical switches and storage devices within present article.

At ESXi level we need to check 3 settings: distributed virtual switch, physical network interfaces (vmnic used for uplinks) and vmkernel portgroups. To accomplish this we make use of two different PowerCLI cmdlets: Get-EsxCli and Get-VMHostNetworkAdapter

The beauty of Get-EsxCli is that it exposes esxcli commands and you can access the host through vCenter Server connection (no root or direct login to the ESXi host is required). The not so nice part its you have to use esxcli syntax in PowerCLI as you will soon see.

Main checks

We will first look at the script checks. Please keep in mind $h is a variable initialized with Get-VMHost cmdlet.
  • distributed virtual switch - will get the switch name, configured MTU and used uplinks; the loop ensures all dvswitches are checked
(Get-EsxCli -VMHost $h -V2).network.vswitch.dvs.vmware.list.Invoke() | foreach {
    Write-Host "DVSName  $($_.Name) MTU  $($_.MTU) UPLINKS  $($_.Uplinks)"
}

  • vmnics - check configured MTU, admin and link status for each interface (there is no issue in having unused nics configured differently) 
(Get-EsxCli -VMHost $h -V2).network.nic.list.Invoke() |  foreach {
    Write-Host "NIC:"$_.Name "MTU:"$_.MTU "Admin:"$_.AdminStatus "Link:"$_.LinkStatus
}


  • vmkernel portgroups - check configure MTU and IP address
$vmks = $h | Get-VMHostNetworkAdapter | Where { $_.GetType().Name -eq "HostVMKernelVirtualNicImpl" }
foreach ($v in $vmks) {
    Write-Host "VMK $($v.Name) MTU $($v.MTU) IP $($v.IP)"
    }


The script

Putting it all together, we'll add the three checks in a foreach loop. The loop iterates through all the clusters and within each cluster through all the hosts. The script creates one log file per cluster containing all the hosts in that cluster and their details:


foreach ($cls in Get-Cluster) {
    $fileName = $cls.Name + ".log"
    Write-Host "# CLUSTER $($cls)" -ForegroundColor Yellow
    foreach ($h in $cls |   Get-VMHost) {
        Write-Host "$($h)" -ForegroundColor Yellow
        Add-Content -Path $fileName -Value "$($h)"

        (Get-EsxCli -VMHost $h -V2).network.vswitch.dvs.vmware.list.Invoke() | foreach {
            Write-Host "DVSName  $($_.Name) MTU  $($_.MTU) UPLINKS  $($_.Uplinks)"
            Add-Content -Path $fileName -Value "DVSName $($_.Name) MTU $($_.MTU) UPLINKS $($_.Uplinks)"
        }
        (Get-EsxCli -VMHost $h -V2).network.nic.list.Invoke() |  foreach {
            Write-Host "NIC:"$_.Name "MTU:"$_.MTU "Admin:"$_.AdminStatus "Link:"$_.LinkStatus
            Add-Content -Path $fileName -Value "NIC: $($_.Name) MTU: $($_.MTU) Admin: $($_.AdminStatus) Link: $($_.LinkStatus)"
        }
        $vmks = $h | Get-VMHostNetworkAdapter | Where { $_.GetType().Name -eq "HostVMKernelVirtualNicImpl" }
        foreach ($v in $vmks) {
            Write-Host "VMK $($v.Name) MTU $($v.MTU) IP $($v.IP)"
            Add-Content -Path $fileName -Value "VMK $($v.Name) MTU $($v.MTU) IP $($v.IP)"
         }
    }
}

Opening one of the log files, you see similar output to below:
esx-01.rio.lab
DVSName dvs-Data1 MTU 9000 UPLINKS vmnic3 vmnic2 vmnic1 vmnic0
NIC: vmnic0 MTU: 9000 Admin: Up Link: Up
NIC: vmnic1 MTU: 9000 Admin: Up Link: Up
NIC: vmnic2 MTU: 9000 Admin: Up Link: Up
NIC: vmnic3 MTU: 9000 Admin: Up Link: Up
VMK vmk0 MTU 9000 IP 192.168.10.11
VMK vmk1 MTU 9000 IP 192.168.20.11
VMK vmk2 MTU 9000 IP 192.168.30.11

In this case everything looks good at ESXi level. Easy part is over, so start digging in the physical switches CLI and any other equipment along the path to ensure end to end MTU consistency.

Saturday, November 30, 2019

PBKAC and Nested ESXi VSAN

This post is a short one about a PBKAC (Problem Between Keyboard and Chair) and how I discovered it while setting up a nested VSAN cluster in my home lab.

I've recently decided to install a nested VSAN environment. I've created 4 nested ESXi hosts running 6.7 U3, added disks to them (flash disks) and tried to configure VSAN. And it was not smooth at all.

First, I've noticed I couldn't add flash devices as capacity tier. Disks were seen only when I manually marked them as HDD in vCenter Server. Even after doing this, the "NEXT" button was grayed out and couldn't complete the cluster creation.

Being a stubborn engineer, I did not read the manual, but I've dropped the UI and went all the way esxcli. Created the VSAN cluster in command line with no luck this time also. The cluster ended up in a split cluster scenario while UI showed confusing health messages.

Decided there is something wrong with my 6.7 U3 image (?!?) and Sunday coming to an end, I've temporarily postponed the project. Restarted it a few days later to find out that I was using vCenter Server 6.7 build 13639324 (U2a).So yeah, vCenter Server 6.7 U2a and VSAN 6.7 U3a doesn't sound like a good match. Immediately, I've updated vCenter Server to U3a and VSAN cluster creation took 5 minutes.

It wasn't a waste of time, as I've dusted off some powershell and I've learnt a valuable lesson: always check the version of your components. But it was definitely a PBKAC.

Saturday, November 9, 2019

Surprises at VMworld 2019

It's been an interesting year where VMware announced a ton of new and exciting stuff from projects Pacific and Tanzu to integrating Carbon Black to almost all of the products. Hence this VMworld has been one of the best so far: kubernetes, containers, integrated security, even more networking, infrastructure as a code, terraform, ansible, machine learning... almost forgot, vSphere on ARM. A lot of words and talks not in the traditional virtualization admin's vocabulary. Things are changing and doing it fast. Looking at how VMware redefined itself, caught up with missed trains and how it will lead some of those trains, I think we are going through some of most radical shifts in IT and I personally feel like I've missed at least one train.



I am an old style IT guy: started installing servers in racks and learning by heart RJ45 cross and direct cabling schemes. When cloud came along, it was not very difficult to apply what I had learnt in datacenters. But I stayed somehow focused on the private and sometimes hybrid cloud. Never went fully cloud native, never gotten knee deep into deploying and operating applications in the public cloud. Naturally, DevOps went from hype to mainstream. And as time passed by, machine learning came along and AI started doing crazy stuff like beating humans at League of Legends (I did play a lot of DotA back in the days and not a single game of Go :-) ) A lot happened in the last 10 years.

I've been going to VMworlds since 2013 and kind of grew up with it. Still, this year I felt like I've been taken a bit by surprise. VMworld is, as always, a place to meet old and new friends and a place to learn about new trends. Barcelona adds to that whole vibe of the event. It was a blast to see people I've worked with on different projects years and years ago or to meet my fellow VMUG leaders.

As for the technical part:
- vSphere on ARM looks promising - comparing to last year there were several devices installed with ESXi  from tablets, to network cards to high performance ARM 2U rackable servers
- machine learning and Watson applied to log monitoring and analysis
- vRA 8 redesign - container based, no Windows
- NSX-T everywhere - in your DC, in the cloud and  with a single pane of glass
- can't wait for the next vSphere release

Solution exchange walk was also interesting:
- if AWS presence was expected, having Azure and Oracle (although the latter was a bit hidden) was cool
- most of the traditional networking/security vendors were not present anymore
- LG and Samsung had pretty big stands
- a lot of backup vendors old and new
- ComputerWeekly was also there with a stand
- Penetration Testing as a Service - a platform from where you could hire pen testers to do some... pen testing
- VMware's booth is getting bigger and bigger

My take away from last week is I have to do a lot of reading and learning, way more than the usual, but this is what makes IT interesting.

Sunday, September 29, 2019

Troubleshoot WordPress Connectivity Error to Database

I am an irregular user of Linux. For some demo scenario, I needed to install WordPress. All good and pretty straight forward. I've deployed two CentOS VMs, installed MySQL on one (DB layer) and Apache and PHP on the the other (front end). I've configured MySQL, created DB, added users. I've also tested connectivity from front end to the DB layer. I've also installed WordPress, but when connecting to wp-admin site, I got a connection error message to the DB.


Weird, since I could actually connect to the DB using mysql client and the same credentials.
So it is not a connection error or a problem with the DB.

Since mysql client from front end works, I've tested php connectivity. For this I've put together the following small script in a file called conn.php and put the file in /var/www/html/wordpress/

<?php
try {
    $dbh = new PDO('mysql:host=IP_ADDR;port:3306;dbname=WP_DB_NAME','USER', 'PASS');
    print "Connected succefully";
}
catch (PDOException $e) {
    print "Error!: " . $e->getMessage() . "<br/>";
    die();
}
?>

Next I opened the browser and executed the php script which resulted in an expected connection error:

At this point I just run the same script, but this time interactively from php command line:

And I got a less expected connection successful message. By default SELinux is enabled and it will not allow httpd to talk to the mysql service. A quick test is to temporarily disable SELinux:
setenforce 0

and running again the script will connect successfully: 


So the main culprit is SELinux. Going forward there are 2 options: disable SELinux permanently (which is acceptable in demo labs) or enable Apache to connect to other service by changing the SELinux policy (which will be another post)





Thursday, September 19, 2019

My First AWS Exam

Recently I took AWS Solution Architect Associate exam. It was one of my oldest dreams and somehow a personal challenge. I have been involved at some point in my career with integrating vRealize Automation and AWS. So I got a bit of experience working with EC2 instances, VPCs, ELBs. But then I changed project and position and I did not touch AWS for a while. Recently I've restarted to work with AWS but this time more with S3, IAM and Lambda functions. So I've decided to give it a try and get certified.

I've prepared for the exam taking acloud.guru training course. The course is well structured and educative. I have enjoyed the hands on labs (especially the Alexa skill building one). Practice tests will come in handy, but the course feels a bit insufficient on these. So if you choose to go through more practice tests, whizlabs can be one option. It's not a brain dump and the questions will help you get more knowledge about AWS services as well as get used to the scenario based type of questions.  Another step is read the FAQ for the services in the exam blueprint. And lastly you need to have hands-on experience - so get your hands dirty.

The exam was tougher than I expected. At some point I was thinking that I will not pass (yes, I did pass). Previous IT experience helps in choosing the right answer when you actually don't know it. You also need to read very carefully the question, which being a scenario based question can be harder to understand for English speakers like me. That's why I prefer to read it out loud (it helps being alone in the exam room).

All in all, it was a good experience and I am interested in trying my forces with another AWS exam.


Tuesday, August 27, 2019

Create vCenter Server Roles Using PowerCLI - Applied to Veeam Backup & Replication

Security is important and having a minimal set of permissions is a requirement, not an option. Having this in mind (and being asked a few times by customers), I put together a short script that will create the vCenter Server roles required by Veeam Backup & Replication service account and Veeam ONE service account. The two accounts have different requirements, with Veeam ONE being the most restrictive as it needs mostly read only.

The script itself is pretty straight forward, the more time consuming is getting the privilege lists. So here you are:


And now for the actual scripting part:


$role = "Veeam Backup Server role"
$rolePrivilegesFile = "veeam_vc_privileges.txt"
$vCenterServer = "your-vcenter-server-FQDN"
Connect-VIServer -server $vCenterServer
$roleIds = @()
Get-Content $rolePrivilegesFile | Foreach-Object{
    $roleIds += $_
}
New-VIRole -name $role -Privilege (Get-VIPrivilege -Server $vCenterServer -id $roleIds) -Server $vCenterServer

The script will create a new vCenter Server role assigning it privileges from the file given as input.

If you ever require to get the privileges from vCenter Server then the next piece of code will help (thanks to VMware communities)

$role = "VBR Role"
Get-VIPrivilege -Role $role | Select @{N="Privilege Name";E={$_.Name}},@{N="Privilege ID";E={$_.ID}}

You will use the privilege ID format for creating the new role.

Saturday, July 27, 2019

Veeam Replication - Automatically Fix Detected Invalid Snapshot Configuration

This one comes directly from the field. It is about  that moment when something goes wrong at a client's site and you need to fix. In this case, replicas went wrong. The link between the two sites not being reliable at all and some other networking issues at the DR site (yes, it's always network fault) determined replica VMs to end up in a corrupted state. If you want to keep the replica VMs that are already in the DR site then the fix is pretty simple: remove snapshots and remap vmdk's still pointing to delta files. Manually doing this is cumbersome, especially since some VMs can have multiple disks and when we are talking about tens or hundreds of  VMs. Luckily, some things can be scripted.

We will present some of the modules in the script, since the whole script is published here on GitHub. For ease of readability we will remove some of the try-catch blocks and logging messages from the original script and comment here only the logical part.

Script is given as input parameters vCenter Server where VM replicas are, backup server hostname, replica job status, replica job fail message and replica suffix. The replica suffix is important since it uses it to find VMs

$vbrServer = "vbr1"
$vcServer = "vc1"
$status = "Failed"
$reason = "Detected an invalid snapshot configuration."
$replicaSuffix = "_replica"

The script needs to be executed from a place where both vCenter Server and Veeam backup server are reachable and where PowerCLI module is imported as well as Veeam PowerShell snapin.

Add-PSSnapIn -Name VeeamPSSnapin
Connect-VBRServer -Server $vbrServer
Connect-VIServer -Server $vcServer

Next, the script will get all VMs for which the replication job has failed ($status) with the given reason ($status). For this, we use Get-VBRJob cmdlet, FindLastSession() and GetTaskSessions() methods. Once the VM in a replica job matches the chosen criteria, it i added to an array ($vmList)

$vmList = @()
# get failed replica VM names 
$jobs = Get-VBRJob  | Where {$_.JobType -eq "Replica"}
foreach($job in $jobs)
{
 $session = $job.FindLastSession()
 if(!$session){continue;}
 $tasks = $session.GetTaskSessions() 
 $tasks | foreach { 
        if (($_.Status -eq $status) -and ($_.Info.Reason -match $reason)) {
            $vmList += $_.Name
            }
        }
}

Once we have the list of VMs who's replica failed, it's time to get dirty. The VM name we are looking for is made of the original VM name (from the array) and the replica suffix.

$replicaName = $_ + $replicaSuffix

First, delete the snapshots. Fot this we use PowerCLI cmdlets: Get-VM, Get-Snapshots, Remove-Snapshot.


$replica = Get-VM -Name $replicaName -ea Stop
$replica | Get-Snapshot -ea Stop | Sort-Object -Property Created | Select -First 1 | Remove-Snapshot -RemoveChildren -Confirm:$false -ea Stop

Next, remap the disks if they are still pointing to the delta file. In order to do that, we get all the disks for the replica VM (Get-HardDisk) and we check if the disk name of the replica VM contains the specific delta file characters ("-0000"). This is how we determine if it's a delta disk or a normal disk. Delta disk name is parsed to generate the source disk name (removing the delta characters from vmdk name). Once this is done, it's just a matter of reattaching the source disk to the VM (Remove-HardDisk, New-HardDisk)

# get disks for a replica VM
$disk = $replica |  Get-HardDisk -ea Stop
# process each disk 
$disk | foreach {
    $diskPath = $_.Filename
    # check if disk is delta file
    if ($diskPath -Match "-0000") {
        $diskPath = $_.Filename
        # for each delta file parse the original vmdk name
        $sourceDisk = $diskPath.substring(0,$diskPath.length-12) + ".vmdk"
        # get the datastore where the delta is 
        $datastore = Get-Datastore -Id $_.ExtensionData.Backing.Datastore
        # check the original vmdk still exists on that datastore
        if (Get-HardDisk -Datastore $datastore.Name -DatastorePath $sourceDisk) {
            # remove delta
            Remove-HardDisk -HardDisk $_ -Confirm:$false -ea stop
            # attach original disk
            $newDisk = New-HardDisk -VM $replica -DiskPath $sourceDisk -ea Stop
        } else {
            Write-Host "WARN Could not find $($sourceDisk) on $($datastore.Name) "
            Add-Content -Path  $logFile -Value "WARN: Could not find $($sourceDisk) on $($datastore.Name) "
        }
    }
}

Last thing to do is to consolidate disks. We run Get-VM and check if consolidation is needed. If it's needed, we just run ConsolidateVMDisks_Task() method.


$vm = Get-VM -Name $replicaName -ea Stop
if ($vm.Extensiondata.Runtime.ConsolidationNeeded) {
    $vm.ExtensionData.ConsolidateVMDisks_Task()
}

Now the replica VMs are re-usable. There is some manual job to be done, though. You need to map the replica VM in the replication job and run the job.