Wednesday, December 2, 2020

Using NSX load balancer as a monitoring tool for RESTful APIs

We are going to look at a different use case for NSX load balancer - monitoring tool for external API's. 

Our core platform is integrating with other systems using RESTful API's. These systems even though they are built with high availability in mind, they are sometimes highly unavailable. They are also part of the critical path for our core platform. Not being able to reach the systems creates troubles  in the form of incident tickets because we fail to deliver services to our customers. So we needed a way to monitor those API's.

We know that the systems are monitored, but we don't have access to those tools. We have our own tools, but they do not offer a simple and efficient way to check the status of a RESTful API. Ideally we don't want to introduce another monitoring tool. However, the core platform is running on top of NSX and it actually uses NSX load balancers for its internal service. So why not use load balancers to monitor the external services? 

We created a service monitor and a pool in the load balancer for each of the external systems. This way NSX monitors the status of the RESTful API of the system and generates alerts whenever it is down. The status of the pool is then checked by the core platform. All communication between core platform and APIs goes directly. It does not use the load balancer.

The pool contains the RESTful API endpoint of the system that we use to connect directly from the core platform. 

The service monitor uses GET requests to check the availability of the RESTful API. 

Nothing fancy, basic configuration for a load balancer. Half-configuration actually because here we stop as no traffic goes through the load balancer to these pools. But whenever the external system is not reachable, the load balancer knows it because now the external system is a member in on of its pools: 

The status of the member in the pool is accessible through the RESTful API of NSX Manager. 

GET /api/4.0/edges/{edge-id}/loadbalancer/statistics

<failureCause>layer 7 response error, code:400 Bad Request</failureCause>
<lastStateChangeTime>2020-12-02 18:20:43</lastStateChangeTime>

This way the core platform knows the status of its external systems before doing anything. More important core platform can now act on that status. In this case it will wait a specific period of time until trying again to use the system. 

It is a pretty simple solution. It is also pretty obvious that the APIs should have been monitored. We actually relied too much on the availability of those API's and used a fire and forget approach. The approach was far from optimal and it impacted our KPIs and created additional operational workload. 

No comments: