This post is part one of a series on VMware Integrated OpenStack and Resource Schedulers. Later posts will walk through each of the schedulers and how they affect the VIO Platform.
VMware Integrated OpenStack has several resource schedulers that mostly do not have knowledge of each other. Deep understanding of the various schedulers and their interactions is extremely helpful when designing, implementing, or operating a VIO-based cloud that is high-churn. There is nuance in how VMware implemented their drivers. Recognize that each scheduler can place virtual instances into the same ESXi cluster without knowledge of what the other scheduler is doing.
I have a customer that is using vRealize Automation today as their cloud management platform. Their primary customer is a large software subsidiary that has used AWS EC2 to host their development cloud and CI/CD environment. The subsidiary’s management decided recently to move away from AWS to my customer’s internal cloud. My customer, like many customers, is new to operating a private cloud, especially a private cloud that will be consumed via a fully automated manner.
Historically the customer has not operated under a formal service-level agreement or a set of service-level objectives, they operate under an informal “best available - don’t do anything that would get us yelled at” agreement. This has driven them to the following operational mandate which they believe it a good balance of operational readiness, available capacity overhead, and support for maintenance during some failure events:
- ESXi clusters greater than or equal to eight (8) servers must have an additional two (2) ESXi always available
- ESXi clusters less than eight (8) servers must have an additional ESXi server always available
The configurations are normally referred to as “N+2” and “N+1” respectively. In every case I have seen or implemented previously this configuration has been applied by enabling vSphere HA and configuring HA Admission Control. This customer has an alternative method: reducing the vRA reservation sizes to account for the N+2 and N+1 configurations. Allow me to explain.
For example, let’s say they have ten (10) ESXi servers in an ESXi cluster, each server is 32 CPU cores and 256 GB of memory, for a total of 320 CPU cores and 2.56 TB of memory (OK yes, this isn’t entirely accurate but it suffices for this example - the exact available number of resources isn’t the thing you should concentrate on here). Their operational mandate states that they can only populate eight (8) of the servers. They enable vSphere HA but disable HA Admission Control, resulting in a scenario where vSphere HA doesn’t look at the virtualization and compute topologies during restart events. In vRA they reduce each reservation assigned to an ESXi cluster by 64 CPU cores and 512 GB of memory (once again, don’t fixate on the fact that reservations are assigned to a Business Group and the ESXi cluster-to-Business Group ratio isn’t normally 1-to-1 - they are solving this with some math). From their perspective this works just as well as HA Admission Control as vRA is the only provisioner into their infrastructure and, since vRA determines whether there is enough resources to deploy a new VM, they can assume that there will always be enough resources to support VM restarts.
At first glance this configuration is definitely, well, different. It requires some math to determine how to rebalance consumption usage but it works for them and ultimately that’s what matters.
But you are probably a little confused as to why I’m talking about vRA when the title of this post is VMware Integrated OpenStack and Resource Schedulers Impacts - Part One - Overview. The subsidiary’s technical team has standardized on HashiCorp Terraform as their provisioning engine. The customer realized that vRealize Automation doesn’t have a mature Terraform provider and a decision was made to transition out of vRA to VMware Integrated OpenStack. This isn’t a trivial decision and there’s nuance with how VMware developed VIO. VIO is NOT vRA, they each have specific use cases and really don’t compete (for that matter, can we stop the Who’s going to win? VMware or OpenStack arguments?). Complicating matters more is the desire of the customer to maintain the operational mandate I described above.
Stay tuned, the next set of posts will explain if they can or can’t….