Thursday, December 10, 2020

NSX Load Balancer - Redirecting Traffic to Maintenance Page

In this post we'll look at 2 situations when we need to redirect vRealize Automation traffic to a maintenance page. The type of traffic doesn't really matter as long as the traffic goes through the load balancer, but for a less abstract post we'll use vRA 7.x. The use cases are: 

  • vRA services are down (for example IaaS manager pool is gone) - in this case it would help if traffic is redirected from the vRA server login portal to a "sorry server"
  • scheduled maintenance window (for patching)  - you need vRA working normally, but you don't want anyone else to login and start playing around 

For both cases we'll be using simple application rules in NSX load balancer (well, if the services are actually behind a NSX load balancer). In a highly available architecture, every service in VRA will be behind a load balancer. For simplicity we'll look only at VRA appliances as for the rest it can be easily extrapolated. 

When a user tries to connect to VRA portal, it will make a request using the virtual IP assigned to the load balancer virtual server. The virtual server (VRA Appliance Virtual Server) has a pool of servers (VRA Appliance pool) associated to which it can direct the traffic. The blue path represents normal situation, when the user reaches VRA appliances is the portal. The green path does not actually exist and is the subject of the post. What we need is in case all servers in the VRA Appliance pool are down to redirect the user to another page. For this we need a few additional elements.

First we need a VM that runs an HTTP server and is able to serve a simple html page, called in the diagram above "Sorry Server". We installed Apache, enabled SSL and created in the document root path a structure similar to VRA login URL (below document root is /var/www/html) to serve a custom index.html page.


At NSX level we add the "sorry server" to a new pool, called "vra-maintenance-pool". We also create application rules to check availability of  VRA appliances.  Application rules are written using HAProxy syntax and they are used to manipulate traffic at the load balancer side. It's a simple rule, where we first check if there are any servers up and running in VRA appliance pool using an access control list (acl). If the pool is down, acl becomes true and we use another backend pool - the maintenance one:

# detect if vra appliance is still up 

acl vra-appliance-down nbsrv(vra-appliance-pool) eq 0

# use pool "vra-maintenance-pool" if app is dead

use_backend vra-maintenance-pool if vra-appliance-down

The rule is then linked to the virtual server of the VRA appliances. Whenever a request comes to the virtual server, the rule is checked and if vra-appliance-pool is down, users will be redirected to the maintenance page. You can extend the rules and redirect users to maintenance pool for other situations that may render VRA useless such as IaaS manager servers down or other IaaS services are down. 

Another usage for application rules is restricting access to VRA during scheduled maintenance. In this case the rule will use ACL to restrict IP's accessing VRA virtual servers by matching the source IP of the request.

# allow only vra components and management server 

acl allowed-servers src

# send everything else to maintenance page

use_backend vra-maintenance-pool if !allowed-servers

Traffic is redirected to maintenance pool when it comes from a source different than the VRA itself or the management server. Happy patching! 

Wednesday, December 2, 2020

Using NSX load balancer as a monitoring tool for RESTful APIs

We are going to look at a different use case for NSX load balancer - monitoring tool for external API's. 

Our core platform is integrating with other systems using RESTful API's. These systems even though they are built with high availability in mind, they are sometimes highly unavailable. They are also part of the critical path for our core platform. Not being able to reach the systems creates troubles  in the form of incident tickets because we fail to deliver services to our customers. So we needed a way to monitor those API's.

We know that the systems are monitored, but we don't have access to those tools. We have our own tools, but they do not offer a simple and efficient way to check the status of a RESTful API. Ideally we don't want to introduce another monitoring tool. However, the core platform is running on top of NSX and it actually uses NSX load balancers for its internal service. So why not use load balancers to monitor the external services? 

We created a service monitor and a pool in the load balancer for each of the external systems. This way NSX monitors the status of the RESTful API of the system and generates alerts whenever it is down. The status of the pool is then checked by the core platform. All communication between core platform and APIs goes directly. It does not use the load balancer.

The pool contains the RESTful API endpoint of the system that we use to connect directly from the core platform. 

The service monitor uses GET requests to check the availability of the RESTful API. 

Nothing fancy, basic configuration for a load balancer. Half-configuration actually because here we stop as no traffic goes through the load balancer to these pools. But whenever the external system is not reachable, the load balancer knows it because now the external system is a member in on of its pools: 

The status of the member in the pool is accessible through the RESTful API of NSX Manager. 

GET /api/4.0/edges/{edge-id}/loadbalancer/statistics

<failureCause>layer 7 response error, code:400 Bad Request</failureCause>
<lastStateChangeTime>2020-12-02 18:20:43</lastStateChangeTime>

This way the core platform knows the status of its external systems before doing anything. More important core platform can now act on that status. In this case it will wait a specific period of time until trying again to use the system. 

It is a pretty simple solution. It is also pretty obvious that the APIs should have been monitored. We actually relied too much on the availability of those API's and used a fire and forget approach. The approach was far from optimal and it impacted our KPIs and created additional operational workload.