Sysadmin Stories: 2013

Monday, July 1, 2013

TCP ports "greater than 65535"

Let me tell this nice story. And yes, there are no TCP ports above 65535, but earlier today after receiving the credentials, I connected without a second thought to A.B.C.D:73382. I was even happy it worked, because the first time I tried "by mistake" port 63382 and I had problems with the login. And this time it worked. But there is a problem - the port number should have been smaller than 65535. Still, take a look at the following picture and notice the strange TCP/IP connection:

It is TCP port 73382. And the connection works (more or less optimal, but it works). But what actually happens is the interesting part. The connection in the image is a port translation from 73382 to 3389. This means there is a device doing the translation - in our case a linux box. And there is a program listening on port 73382. Well, not exactly...

TCP header uses 16 bits for representing the source port and 16 bits for representing the destination port.

The 16 bits limit the maximum number of ports to 65536 (2^16) having the numbers between 0 and 65535. Number 73382 needs actually 17 bits for its binary form - 10001111010100110. The only way to fit 17 bits into 16 bits is to truncate the 17 bits. The only question is what part to truncate - left bit or the right bit. To find out the answer I used good old wireshark:

The communication takes place on port 7846 - which on 16 bits looks like this 0001111010100110. So, it truncates the left most bit.What happens is that a small piece of code (perl or python) will successfully accept to open a socket using numbers bigger 65535 - only it fits the number on 16 bits. Remember, on a linux system "everything is a file; if it is not file, it is a process".
There are client programs that using a simple input check do not allow using strange numbers, however RDC is not one of them. And after all, it was actually a nice debug.

Thursday, June 20, 2013

Command line upgrade of ESXi 5.1 to 5.1 U1

The reason I`ve chosen this method is not having VUM installed and also wanting to remember a bit of esxcli. The process is pretty simple:

download the zip bundle (update-from-esxi5.1-5.1_update01.zip) and put it on a datastore accessible by all hosts (I am using a NFS share)
check to see if the host has the datastore mounted, if not - add the datastore

~ # esxcli storage nfs list

~ # esxcli storage nfs add -H 192.168.1.1 -s /mnt/vol1-nfs -v shared1-nfs-sata

check the host version

~ # esxcli system version get
   Product: VMware ESXi
   Version: 5.1.0
   Build: Releasebuild-799733
   Update: 0

check the update file and see if it requires the host in maintenance mode (in this case, it does not - however since the update will request a system reboot, I think it is better to place the host in maintenance mode and have it cleaned up of VMs)

~ # esxcli software sources vib get --depot=/vmfs/volumes/shared1-nfs-sata/update-from-esxi5.1-5.1_update01.zip
...
Maintenance Mode Required: False

install the update

~ # esxcli software vib update --depot=/vmfs/volumes/shared1-nfs-sata/update-from-esxi5.1-5.1_update01.zip

Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true

reboot the system (after all VMs have been migrated to other hosts) and check the new host version

~ # reboot
...
~ # esxcli system version get
   Product: VMware ESXi
   Version: 5.1.0
   Build: Releasebuild-1065491
   Update: 1

Saturday, June 15, 2013

vCloud Director 5.1 upgrade and missing sysadmin permissions

I`ve recently managed to crash my test lab. One of the rebuild tasks was to reinstall vCloud Director - the hard way: dropping the DB and recreating it. Next I was running the vCloud Director installation script (5.1.1), recreating the DB schema and so on. After a week I`ve decided to upgrade the vCD installation from 5.1.1 to 5.1.2.

After downloading the bin file (vmware-vcloud-director-5.1.2-1068441.bin) from VMware site and uploading it on my vCD cell , I`ve executed the bin and a long list of "Yes" ends up with an error:

Unable to update database statistics. Database user has insufficient privileges to update database statistics. To complete this step manually, run 'EXEC sp_updatestats' as the DBO or a member of the sysadmin fixed server role.

Next, I`ve been looking in the DB server (MSSQL) - logging in with vCD DB admin user (vcddbadmin) and checking if the rights are ok. And they were, since the user was dbowner. I`ve ran 'EXEC sp_updatestats' in query editor and ended up with another error:

Msg 15247, Level 16, State 1, Procedure sp_updatestats, Line 15
User does not have permission to perform this action.

It was stored procedure time: vcddb01 -> Programmability ->Stored Procedures -> System Store Procedures and after a few pages of scrolling down: sys.sp_updatestats. I`ve opened the stored procedure (Context menu - Modify) and on line 15 there was a check to see if the current user was sysadmin. Which, obviously it was not.

Below, a small SQL script that will display if the current user is sysadmin or not:

if is_srvrolemember('sysadmin') = 1
print ' It is sys admin'
else if is_srvrolemember('sysadmin') = 0
print ' It is NOT sys admin'

To solve the problem, I`ve added sysadmin role to the vCloud Director DB admin user (vcddbadmin) and ran again the stored procedure:

This time it worked ok and updated database statistics. I`ve checked back the installation procedure for vCloud DB and I`ve taken a look at KB 2034540 to see if I`veforgotten anything while creating the user for vCD DB. But it seems I did not. The only reference is to db_owner role. Since I am not a DB guy, I`ll just leave it as is, happy to have solved my problem.

Wednesday, June 12, 2013

Command line installation of vSphere Auto Deploy

After a long absence, partially motivated by changing my job while organizing a VMUG meeting, I`ve found a bit of time to restart working at the blog.

In this post I am presenting command line installation of Auto Deploy. It is assumed that vCenter Server exists (in this case called vcsrv01.domain.test). The installation file is also needed - it can be found in vCenter ISO at vctools\VMware-autodeploy.exe. Before running any command, copy the exe file on the server where it will be installed or on a centralized repository (file share). Open a command prompt and run the command line installation:

start /wait D:\VC\vctools\VMware-autodeploy.exe /L1033 /v" /qn VM_PATH_DATA=D:\DATA VM_DATA_SIZE=4 VM_PORT=6501 VM_PORTADD=6502 VC_SERVER_ADDRESS=vcsrv01.domain.test VC_SERVER_PORT=80 VC_SERVER_USER=VC-Admin VC_SERVER_PASSWORD=AdminPass"

Two types of arguments are passed in the command line:

first arguments for windows installer

/L1033 - english language
/qn - no user interface
/v" " - passes arguments to msi

second - software specific arguments:

VM_PATH_DATA - root destination data directory for the installation
VM_DATA_SIZE - space for VM_PATH_DATA in GB
VM_PORT - Auto Deploy service TCP port number
VM_PORTADD - Auto Deploy management TCP port number
VC_SERVER_ADDRESS - IP or FQDN for vCenter Server
VC_SERVER_PORT - vCenter Server port number
VC_SERVER_USER - vCenter Server user name
VC_SERVER_PASSWORD - vCenter Server user password

Several VMware components can be installed or upgraded using command line parameters. For more information see the following technical paper from VMware.

Thursday, March 28, 2013

VMware Distributed Power Management (DPM)

One of the features Enterprise license brings to VMware vSphere is Distributed Power Management (DPM). This feature saves power by dynamically adjusting cluster capacity according to existing workloads. DPM will power off and power on hosts in the cluster based on load average of each host. During power off, VMs are automatically consolidated on remaining hosts.

For powering on ESXi hosts, DPM uses one of the following technologies: iLO, IPMI or WoL. WoL packets (Magic packets) are sent over vMotion interface by another host in the cluster. DPM puts hosts into so called "standby" by actually entering the host in ACPI S5 power state. In this state power consumption is minimal, no user mode or system code is run and the systems`s context is not preserved. Some components are still running to ensure starting of the host. This state is also called "soft off".

DPM evaluates CPU and memory utilization for each host in the cluster and tries to keep the host within a specific range. By default utilization range is 45% to 81%. The range is computed from two settings:
DemandCapacityRatioTarget = utilization target of the host - by default it is 63%
DemandCapacityRatioToleranceHost = variation around utilization target - by default it is 18%
Utilization range = DemandCapacityRatioTarget +/- DemandCapacityRatioToleranceHost
Default settings can be changed in DRS - Advanced settings and can be set between 40% and 90% for DemandCapacityRatioTarget and between 10% and 40% for DemandCapacityRatioToleranceHost.

DPM uses two time intervals for evaluating power on and power off recommendations. For power on the period is 300 seconds (5 minutes), while for power off the period is 2400 seconds (40 minutes). This means DPM considers more important responses to increased load. However, this also means that a sudden increase in the load will be be considered by DPM after 5 minutes and will be resolved after the host boots up, which may add another 5 to 10 minutes. The values can be changed by setting parameters VmDemandHistorySecsHostOn (default 300 seconds) and VmDemandHistorySecsHostOff (default 2400 seconds) to a value between 0 and 3600 seconds.

DPM will ensure that at least one host will be left running. The settings MinPoweredOnCpuCapacity (default 1 MHz) and MinPoweredOnMemCapacity (default 1 MB) are used to control how many hosts will be left running. Default values ensure minimum one host is up and can be changed. For example if the cluster has hosts configured with 24 GHz and 128 GB setting the parameters 24001 MHz and 131073 MB will reserve 2 hosts running at all time. Even with default values, when HA cluster is enabled, DPM will leave 2 hosts powered on to provide fail over resources in case of one host failing.

For more details about how DPM selects hosts for power off and cost/benefit analysis, I strongly recommend the book VMware vSphere Clustering Deepdive and the white paper VMware Distributed Power Management: Concepts and Usage

Configuring DPM

Before enabling DPM on a cluster it is best to test that hosts can be powered off and started. This can be done from vSphere client by putting a host in standby. Check first that the network adapter supports WoL:

Put the host in standby and confirm moving all powered off and suspended VMs to other hosts in the cluster:

After the host is powered off, bring it back from vSphere client:

After all hosts in the cluster have been tested, one last check can be done by going to Cluster - vSphere DRS - Host Options and view if hosts succeeded to exit standby:

Next, enable DPM and choose one of the 3 power management policies:

If DPM behavior needs to be controlled, a very good idea is to have it enabled during NBH and disabled during BH. Set power management to Automatic and add 2 scheduled tasks of type Change cluster power settings in vSphere - one to enable DPM in the evening (19.00) and another one to disable DPM in the morning ( 7.00). This way during BH all hosts are running at full capacity and during night the data center takes full advantage of power savings.

Wednesday, March 20, 2013

PowerCLI - Batch move templates

While preparing for configuring DPM on one of the clusters, I noticed that templates are scattered around all the hosts. A design best practice for DPM states to put all your templates on one host and disable DPM on that host - this way you will not get the surprise of not being able to access some of your templates because the host was powered off. Remembering that I decided to bring all templates in one place. First I took a look at PowerCLI cmdlets - get-template and move-template and hoped for an easy job. However the cmdlets are a bit tricky:

get-template - documentation states that it takes as input for Location "vSphere container objects (such as folders, data centers, and clusters) you want to search for templates". It is partialy true, since it does not accept clusters, but it accepts hosts;
move-template - Destination parameter used to "specify a container object where you want to move the templates" accepts only Datacenters and Folders.

My use case is input a cluster name and consolidate all templates on a single host. The next best thing was to use the cmdlets above and make my little script. The script takes as input 2 parameters: cluster and destination host. It gets all templates from cluster, converts them to VMs, migrates the VMs to destination host and reconverts them to template.

Function MoveTemplate()

{

Param(

[Parameter(Mandatory=$True,Position=0)][string]$cluster,

[Parameter(Mandatory=$True,Position=1)][string]$dsthost

)

get-cluster $cluster | get-vmhost | foreach {

get-template -Location $_ | foreach {

Write-Host "moving $_"

get-vm (set-template -Template $_ -ToVM).Name | move-vm -Destination $dsthost | set-vm -ToTemplate -Confirm:$false

}

MoveTemplate

Wednesday, March 13, 2013

vShield Edge configuration and operational issues

In this post I will cover some of the aspects of using vShield Edge (vSE) devices at vApp level, mostly things I have learnt the hard way. I am currently using vCloud 1.5 and the aspects herein are related to this version. However, I will specify changes that appear in vCD 5.1.

In vApp diagram, vSE is represented interposed between the vApp network and the organization network:

The symbol displays a firewall, if firewall services are active. vSE offers the following services:

DHCP
firewall
routing
NAT

All configuration is done using vCD portal. DHCP services can be created for vApp network and the configuration is basic:

It is not possible to create MAC reservations, but basic service functions are available.

Firewall configuration has by default Deny policy enabled and allows any traffic going outbound:

vSE can be configured to send syslog messages to a central server, and it is highly recommended to do it if you want to debug anything. But the syslog server can only be configured globally at vCD level, not separately for each organization. If you are an organization admin, it will be difficult, if not impossible to get access to syslog server.. Other issue is that the graphical interface does not allow copy or clone rules.Which means a lot of manual work. To make things worse, you cannot add two different hosts or subnets in the source or destination field - no possibility to group more objects. More manual work. I do not know why they have provided such a bad interface to vSE.

Next service is NAT - by default vSE is configured in IP translation mode. VMs are assigned IPs from Organization network and vSE does the rest: source and destination NAT.

Org network IPs can be statically assigned or automatically. But, they cannot be unassigned from a VM. If there are 10 VMs connected to the vApp network, then all 10 will be assigned IPs from the Organization network even if only 1 VM needs such an IP (simple use case - one arm load balancing topology). Adding the 1 IP used by vSE, a total of 11 IPs will be used. This is a problem since you might get into situations where you run out of IPs at Organization level (I did).

The only solution is to change NAT mode of vSE to Port forwarding. In which case you are limited to the IP of vSE. All requests are sent to vSE`s IP and it vSE does basic port forwarding:

Do not think of publishing in the vApp two services on the same port. You will not be able to do it.

A difference in 5.1 is that you can specify source and destination NAT separately.

Last feature that vSE provides is static routing. There is not much to talk about it - it allows adding of static routes:

One more limitation of vSE in 1.5 is that it has only 2 interfaces: one external and one internal. In 5.1 there are multiple network interfaces provided to external networks.

vCloud Director 5.1 brings other changes from 1.5:

IP masquerading setting was removed
there is no default rule on the firewall

More details about these two changes can be found here

Unfortunately, vSE brings too many technical limitations to be a useful firewall and in many cases architectural decisions have to be made to not use vSE. Changing vSE with other technologies that provide better firewall, NAT and routing functions also brings its own technological challenges.

Wednesday, March 6, 2013

PowerCLI - Get VMware tools status

This little piece of code displays information about VMware tools status for one or more VMs. Output can be configured to display running status (running or not), status (outdated or updated) and version of tools. The script takes as input 2 parameters: one of the two -clname and -vmname, and the mandatory parameter -tools. The syntax is:

Description of parameters:

clname - enter the name of cluster
vmname - enter the name of a VM
stop - displays VMs with tools not running
stop - displays VMs with outdated tools
install - displays VMs with tool not installed
status - display running status and update status for a VM
version - displays tools version for a VM or more
all - displays running status, update status and tools version for a VM or more

There are 3 ways the script can be used

if -clname parameter is chosen -tools can take the value stop, old, install, version or all
if -vmname parameter is chosen -tools can take the value status (prints both runstatus and update), version or all
if no value is entered for clname and vmname but -tools is all, the script displays tools information
(runstatus, update and version) for all VMs in the datacenter.

A few examples of usage:
Get all VMs from cluster01 where tools are not running:

./ToolsStatus -clname cluster01 -tools stop

Get tools version for all VMs in cluster02
./ToolsStatus -clname cluster02 -tools version

Get all tools information for VM dc01
./ToolsStatus -vmname dc01 -tools all

The script uses Get-View which returns vSphere .Net view objects. The main advantage of get-view is that it is a lot faster than using normal get-cluster, get-vm cmdlets. Just for the fun you can use Measure-Command to time the execution of get-view command and get-vm.

The script:

Param(
    [Parameter(Position=0,Mandatory=$false)][string] $clname,
    [Parameter(Position=0,Mandatory=$false)][string] $vmname,
    [Parameter(Position=1,Mandatory=$true)][string] $tools
    )
    if ($clname) {
        if ($tools -eq "all") {
            get-view -ViewType "VirtualMachine" -SearchRoot (get-cluster $clname).Id | Select Name,@{N="ToolsRunningStatus";E={$_.Guest.ToolsRunningStatus}},@{N="ToolsStatus";E={$_.Guest.ToolsVersionStatus2}},@{N="ToolsVersion";E={$_.Guest.ToolsVersion}}
        }
        elseif ($tools -eq "stop") {
            get-view -ViewType "VirtualMachine" -SearchRoot (get-cluster $clname).Id -Filter @{"Guest.ToolsStatus" = "toolsNotRunning"} | Select Name,@{N="ToolsStatus";E={$_.Guest.ToolsStatus}}
        }
        elseif ($tools -eq "old") {
            get-view -ViewType "VirtualMachine" -SearchRoot (get-cluster $clname).Id -Filter @{"Guest.ToolsVersionStatus2" = "guestToolsSupportedOld"} | Select Name,@{N="ToolsStatus";E={$_.Guest.ToolsVersionStatus2}}
        }
        elseif ($tools -eq "install") {
            get-view -ViewType "VirtualMachine" -SearchRoot (get-cluster $clname).Id -Filter @{"Guest.ToolsVersionStatus2" = "guestToolsNotInstalled"} | Select Name,@{N="ToolsStatus";E={$_.Guest.ToolsVersionStatus2}}
        }
        elseif ($tools -eq "version") {
            get-view -ViewType "VirtualMachine" -SearchRoot (get-cluster $clname).Id | Select Name,@{N="ToolsVersion";E={$_.Guest.ToolsVersion}}
        }
        else {
            Write-Warning "Option not implemented $tools"
        }
    }
    elseif ($vmname) {
        if ($tools -eq "all") {
            get-view -ViewType "VirtualMachine" -filter @{'name'=$vmname}| Select Name,@{N="ToolsRunningStatus";E={$_.Guest.ToolsRunningStatus}},@{N="ToolsStatus";E={$_.Guest.ToolsStatus}},@{N="ToolsVersion";E={$_.Guest.ToolsVersion}}
        }
        elseif ($tools -eq "status") {
            get-view -ViewType "VirtualMachine" -filter @{'name'=$vmname}| Select Name,@{N="ToolsRunningStatus";E={$_.Guest.ToolsRunningStatus}},@{N="ToolsStatus";E={$_.Guest.ToolsStatus}}
        }
        elseif ($tools -eq "version") {
            get-view -ViewType "VirtualMachine" -filter @{'name'=$vmname}| Select Name,@{N="ToolsVersion";E={$_.Guest.ToolsVersion}}
        }
        else {
            Write-Warning "Option not implemented $tools"
        }
    }
    elseif ($tools) {
        if ($tools -eq "all") {
            get-view -ViewType "VirtualMachine" | Select Name,@{N="ToolsRunningStatus";E={$_.Guest.ToolsRunningStatus}},@{N="ToolsStatus";E={$_.Guest.ToolsVersionStatus2}},@{N="ToolsVersion";E={$_.Guest.ToolsVersion}}
        }
        else {
            Write-Warning "Option not implemented $tools"
        }
    }

Wednesday, February 27, 2013

NetApp SnapMirror traditional volumes to flex volumes

SnapMirror comes in two flavors: volume snapmirror (VSM) and qtree snapmirror (QSM). One major difference between the two is that VSM can be synchronous or asynchronous, while QSM is only asynchronous. Another important difference is that VSM is block based replication, while QSM is logical replication. More differences can be found in TR-3446, however, the most important difference (for me and this post) is the type of volumes the two replication methods support:

VSM: TradVol - TradVol and FlexVol - FlexVol
QSM: TradVol - TradVol, FlexVol - FlexVol and TradVol - FlexVol

QSM supports replication between traditional volume and flex volumes and that is the reason why I used QSM to migrate client`s data from an old filer using traditional volumes. Source filer is Filer1 and destination filer is Filer2. On Filer1 there is vol1 (trad) and on Filer2 there is client1 (flex vol). Being a QSM, destination for vol1 will be a qtree on client1 volume, but qtrees must not be created before snapmirror operation.

For each filer, check the license for snapmirror exists, if not added it:

Filer1# license add XXXXX

Filer2# license add XXXXX

Check snapmirror status:

Filer1# snapmirror status

Snapmirror is on.

Filer2# snapmirror status

Snapmirror is on.

Make sure that each filer is resolvable in /etc/hosts file (a best practice is to mount /vol/vol0 and edit the files, else use the infamous rdfile and wrfile commands)

Filer1# wrfile /etc/hosts

# All other host entries that already exist on Filer 1

# Filer2
192.168.1.2 Filer2

Filer2# wrfile /etc/hosts

# All other host entries that already exist on Filer 2

# Filer1
192.168.1.1 Filer1

Authorize destination filer to perform replication on source filer:
Filer1# options snapmirror.access host=Filer2

In order to reduce replication volumes and keep source and destination synchronized, configure snapmirro schedule (snapmirror.conf):

Filer2# wrfile /etc/snapmirror.conf
Filer1:/vol/vol1/- Filer2:/vol/client1/vol1 - 0 8,16,0 * *

Synchronizations will run automatically everyday, every 8 hours starting with 0:00.

The first synchronization is the baseline and it is manually initialized on destination (snapmirror is destination driven):

Filer2# snapmirror initialize -S Filer1:/vol/vol1/- Filer2:/vol/client1/vol1

In both snapmirror.conf and when issuing snapmirror command, the syntax for source is a bit different: Filer1:/vol/vol1/-. The hyphen ensures that trad vol1 is actually treated as qtree and not as volume. Snapmirror starts and you may check the status. When synchronization is finished or is no longer need, snapmirror has to be stopped. Do a final synchronization:

Filer2# snapmirror update -S Filer1:/vol/vol1/- Filer2:/vol/client1/vol1

Pause the destination qtree:

Filer2# snapmirror quiesce /vol/client1/vol1

Break snapmirror relationship on destionatino filer:

Filer2# snapmirror break /vol/client1/vol1

Stop snapmirror on source filer:

Filer1# snapmirror release /vol/vol1/- Filer2:/vol/client1/vol1

And do not forget to export the new qtree.

Filer2> exportfs –p sec=sys, rw=192.168.1.0/24,root=192.168.1.0/24,nosuid /vol/client1/vol1

As a best practice, use a different network link for snapmirror traffic than production networks. It tends to use a lot of bandwidth, even though QSM does not fill up the bandwidth because of its nature. If a dedicated network is not possible, then bandwidth usage can be limited:

add kbps=value_in_kbps in snapmirror.conf file
use -k option when issuing snapmirror command