This post is part two of a series on VMware Integrated OpenStack and resource schedulers. You can read VMware Integrated OpenStack and Resource Schedulers Impacts - Part One first for some context but it’s not required.
VMware Integrated OpenStack utilizes several resource schedulers that mostly do not have knowledge of each other. An understanding of the various schedulers and their interactions is helpful when designing, implementing, or operating a VIO cloud. A lack of understanding can result in platform performance bottlenecks, partial outages (due to DDOS), and a lot of unnecessary troubleshooting.
Given that, there is nuance in how VMware implemented their OpenStack drivers that leads to the schedulers interacting with the OpenStack virtual instances1 and vSphere virtual machines1 without knowledge of what the other schedulers are doing.
VMware Integrated OpenStack uses several schedulers to provision and run virtual machines, in both OpenStack and vSphere:
- VMware vSphere DRS (compute)
- VMware vSphere HA Admission Control (compute)
- VMware ESX local-scheduler (compute)
- OpenStack nova-scheduler (compute and virtualization)
- OpenStack neutron-server (networking)
The VMware schedulers are tightly coupled and maintain separate topologies that overlap. This means that from the vSphere perspective VIO does not change how vSphere works.
The OpenStack schedulers don’t work together and do not know that each other exist; they maintain separate and distinct topologies. This isn’t a bad thing, it’s by design; one of OpenStack’s goal is to abstract resources via a system of loosely coupled RESTful services.
There are tradeoffs though. The OpenStack and VMware schedulers do not work together, they don’t know of each other, and that can lead to scenarios that are less than ideal.
Don’t run away yet, those “less than ideal scenarios” also provide some definite advantages: semi-hierarchial compute scheduling, all of the advanced vSphere features like HA and DRS that you know and love, all coupled with an industry-standard RESTful API.
Infrastructure Workload Provisioning Profiles
Before I dive into scheduling, I want to lay some groundwork, specifically on what an infrastructure workload provisioning profile (if you are interested in more information on IWPPs I wrote a blog post about it here) is. Here are some definitions to start:
- Infrastructure Workload Provisioning Profile (WPP)
A representation of the velocity, magnitude, concurrency, and churn of object lifecycles within a resource constrained system.
Each property is described below:
Represents the speed at which provisioning should occur over X amount of time. An example is a service-level objective (SLO) for provisioning - each virtual machine must be provisioned and available for use within a 5-minute from the original request.
Represents the number of workloads to execute a task against at one time. An example is a CI/CD pipeline that uses a Heat template to deploy a testing environment consisting of 40 virtual machines.
Represents how many parallel lifecycle tasks are running at the same time. An example is ten CI/CD pipelines, each using a Heat template to deploy a testing environment consisting of 40 virtual machines.
Represents how often lifecycle tasks occur over X period of time. Churn ties velocity, magnitude, and concurrency together. An example is provisioning 100 workloads at 9 AM and deprovisioning them at 10 AM.
Types of IWPP
There are four general types of IWPPs:
- Low churn/low retention
Less frequent provisioning of short-lived workloads. An example is the lifecycle of a virtual machine hosting an actuarial table calculation application that is provisioned quarterly to build an insurance company’s actuarial tables and is destroyed within a day of its provisioning.
- Low churn/high retention
Less frequent provisioning of long-lived workloads. Historically this is what the WPP in an enterprise most looks like. Generally, workloads are either manually provisioned by an infrastructure team or leverage a semi-automated method including checkpoints.
- High churn/low retention
Continual provisioning of short-lived workloads. Continuous Integration (CI) environments normally match this profile, workload provisioning is fully automated with testing frameworks with the workload life expectancy measured in minutes or hours.
- High churn/high retention
Continual provisioning of long-lived workloads. An excellent example of this is AWS, customers can continously deploy AWS EC2 instances without degradation and will keep them around.
Anyone designing or operating a cloud (public or private) has to establish their cloud’s IWPPs as earlier as possible and continually monitor for changes. Understanding IWPPs goes a long way in properly designing the scheduling required to meet the demands of your customers.
Now, onto scheduling.
Differences in Nova for KVM-based OpenStack and VIO
The OpenStack nova-scheduler process is tasked with a very important job, determining the placement of virtual instances within the OpenStack compute topology. The nova-scheduler can also live migrate/vMotion virtual instances but for the purposes of this blog post I am only looking at the initial placement of virtual instances under OpenStack. The nova-scheduler process uses Python code, referred to as filters, to guide the placement. Two of the nova-scheduler filters are the CoreFilter and RamFilter. The CoreFilter filter uses the CPU usage to determine which hypervisors can meet the provisioning request requirements. The RamFilter uses the memory usage to determine the same. There are several other filters that can be used (a comprehensive list can be found on the Filter Scheduler page in the OpenStack docs) and once all of the configured filters have executed the request is sent to a message queue.
Although the latest VIO major version (4.x) is based on Ocata, I’m purposely not talking about the Placement API. For the purposes of this blog post filters are still applicable.
The nova-compute service is tasked with host resource reconciliation and virtual instance management, including instance provisioning and de-provisioning. Nova-compute periodically checks the message queue for new requests and when one is found, acts upon the request. Most OpenStack implementations normally use KVM as the hypervisor and run the nova-compute process directly on the hypervisor, leading to a 1-to-1 mapping of a nova-compute instance to a hypervisor.
The vSphere nova-compute driver used in VIO has been constructed to interact with OpenStack in a slightly different manner than the KVM nova-compute driver. Two vSphere services, HA and DRS, are fully supported in VIO. However, they can only be enabled on an ESXi cluster, and, therefore, if VIO followed the standard OpenStack KVM model, nova-scheduler would effectively bypass the higher-level vSphere services for placement, selecting the individual ESXi server itself. This is not the optimal solution and drove VMware to construct an alternative model. Instead of running the nova-compute process locally on ESXi, VIO deploys a dedicated virtual appliance that runs a single instance of the nova-compute service and assigns that nova-compute instance to an ESXi cluster. This pairing results in a 1-to-1 mapping of a nova-compute instance to an ESXi cluster.
VIO also has the capability to consume resources from multiple vCenter Servers when NSX-T is used, to do so the object types mapped between nova-compute and vSphere are different though. However, the overall process and model is the same.
- KVM-based OpenStack
- nova-compute runs directly on the hypervisor
- every hypervisor runs only one nova-compute service
- nova-scheduler selects a hypervisor as a provisioning target
- an OpenStack hypervisor is a single hypervisor
- VMware Integrated OpenStack
- nova-compute runs on a virtual appliance, not on the hypervisor
- nova-compute is assigned to an ESXi cluster
- nova-scheduler still selects an OpenStack hypervisor as a provisioning target
- an OpenStack hypervisor is a single ESXi cluster
A diagram will probably help.
There are advantages to the VIO model, one being that VMware can leverage vSphere HA and vSphere DRS to provide higher availability for instances, another is one can transparently add and remove ESXi servers (so CPU and memory resources) to OpenStack. But as I stated in the TL;DR, there’s nuance that needs to be understood.
vSphere and Nova Compute Schedulers
Earlier I mentioned that VIO uses several compute schedulers, three in vSphere-land (vSphere HA, vCenter DRS, and the ESXi local scheduler), and one in OpenStack (nova). The vSphere and OpenStack schedulers don’t understand nor even realize that each other exists. This can lead to some peculiar scenarios, one being that vSphere and OpenStack schedulers have the potential ability to independently undersubscribe or oversubscribe resources.
Compute Scheduling from the OpenStack perspective
Under OpenStack the nova-scheduler process is the single compute scheduling control point. As mentioned above, nova-scheduler uses a set of Python-based filters to determine virtual instance placement. I’m going to concentrate on the two specific filters I described above, the CoreFilter and RamFilter. They are defined as variables in nova configuration file and are used to set the CPU and memory subscription ratios from the OpenStack perspective:
- cpu_allocation_ratio - VIO default is 10x overcommit
- ram_allocation_ratio - VIO default is 1.5x overcommit
The allocation ratios are independent of each other and do not take into account whether there’s available resources to start or run the virtual instance. This is because nova-scheduler assumes that if there are available resources then the virtual instance can boot. Ultimately nova-scheduler is concerned with if there are enough resources multiplied by the allocation ratio to fulfill the provisioning request.
There are some additional configuration variables in nova.conf that can also be used to reduce the amount of resources presented to VIO for consumption:
- reserved_host_memory_mb - VIO default is 0 MB
- reserved_host_cpus - VIO default is 0
The reserved_host_memory_mb variable is used to ensure that the hypervisor has enough memory available to execute its host processes. The default value is set to 0 as ESXi servers report all of the used memory including the overhead that the hypervisor processes require. Since VIO maps an ESXi cluster to an OpenStack hypervisor one could use this variable as an additional method of ensuing available memory resources at the vSphere layer. I’ll talk more about this in a bit.
The reserved_host_cpus variable is used to ensure that the hypervisor has X number of physical CPUs (pCPU) available for host processes. A CPU is actually a CPU core, not the physical CPU socket. The CPU parameter reserves the entire pCPU from the nova-scheduler perspective meaning that the pCPU isn’t actually reserved within the hypervisor’s OS. The KVM hypervisor can use this variable to ensure that nova-scheduler always leaves X number of physical pCPUs available and is normally used to ensure virtual instances do not performantly affect the underlying OS processes. Under VIO this variable is less important, the local ESXi CPU scheduler provides a more advanced scheduling function that, when coupled with vSphere DRS, reduces the need for this parameter. Therefore, under VIO, the default value is set to 0 as there isn’t a generic use case that requires it’s use.
Compute Scheduling from the vSphere perspective
The vSphere infrastructure layer uses three compute schedulers: vSphere HA, vSphere DRS, and the ESXi local scheduler. Although the ESXi local scheduler can impact the performance of a virtual machine its functionality isn’t directly applicable since VIO treats an ESXi cluster as an OpenStack hypervisor, therefor I’m going to ignore it for this discussion.
Each vSphere scheduler does the following:
- vSphere HA - restarts virtual machines running on an ESXi server on other ESXi servers within an ESXi cluster if the original ESXi server fails
- vSphere Dynamic Resource Scheduling - periodically migrates virtual machines (using vMotion) to other ESXi servers based on resource utilization calculations in attempt to balance resource consumption
Some may argue that vSphere HA isn’t a scheduler but I disagree. It determines whether there’s available resources to restart all virtual machines. Hence it schedules some virtual machines and doesn’t schedule others.
The DRS calculations do not directly or indirectly affect VIO. Nova-scheduler only cares about the ESXi cluster as a whole and the nova-scheduler placement calculations are not affected by DRS.
In contrast vSphere HA can affect VIO directly. vSphere HA has a policy function called Admission Control that places limits on the virtual machine startup, ensuring there’s always available CPU and memory capacity to restart the cluster’s virtual machines. For Admission Control to work properly, vSphere virtual machine reservations must be used and should be passed to vCenter Server from nova-compute as extra-specs2 during provisioning.
Under vSphere 6.x there are three available Admission Control policies that can be selected:
- Host Failures Cluster Tolerates - This policy calculates the slot size of each virtual machine within an ESXi cluster - per the vSphere Availability guide3 a slot is defined as a logical representation of memory and CPU resources. By default, it is sized to satisfy the requirements for any powered-on virtual machine in the cluster. Separate, distinct thresholds for CPU and memory can be configured for this policy and vSphere HA calculates how many slots are available within each ESXi cluster and uses this information to establish the failover capacity. If the failover capacity is less that the threshold set in the Admission Control, the virtual machine cannot start. Virtual machine reservations are key, when a reservation does not exist the virtual machine’s resource utilization is set to 32 MHz and 0 MB + the memory overhead.
- Percentage of Cluster Resources Reserved - This policy reserves a specific percentage of aggregate CPU and memory. As with the Host Failures Cluster Tolerates policy, separate and distinct CPU and memory thresholds can be configured. vSphere HA calculates how much memory and CPU is required by all powered-on virtual machines within the ESXi cluster and uses this information to establish the failover capacity. A calculation is then executed to determine the total CPU and memory available for allocation, and finally it determines whether the policy thresholds have been violated. If so the virtual machine cannot start. Once again, vSphere HA uses the actual reservations of the virtual machines and if a virtual machine doesn’t have a CPU and/or memory reservation, then the default values specified above are used (32 MHz and 0 MB). One additional piece of information, when using this policy there must be at least two vSphere HA-enabled ESXi servers in the cluster excluding any servers in maintenance mode. vSphere HA cannot fail over to nothing.
- Specify Failover Hosts - This policy is the easiest of all to understand - you designate specific ESXi servers to act as standby servers. If a failure occurs vSphere HA will attempt to restart all of the virtual machines on the designated ESXi servers. If the resource requirements are too large vSphere HA will attempt to use other ESXi servers. There are no capacity calculations, there’s no need for them.
The general VMware best practice is to use the Percentage of Cluster Resources Reserved Admission Control policy[^4] and using this policy has proven beneficial in many clouds. By default VIO’s CPU and memory allocation ratios are not synchronized with the vSphere HA Admission Control policy percentages. The lack of synchronization can lead to situations where the VIO layer and the vSphere look like they don’t match up, especially if vSphere virtual machine reservations are not being used fully or at all. An example will probably help:
The Tyrell Corporation runs VIO…
This is a general example, using general numbers, with general assertions. I do not recommend using this example to determine how to calculate resources, don’t get caught up in the 10 foot level, this is 10,000 foot level example. For instance, ESXi servers have a CPU and memory overhead associated with it. If you want to know more about these topics and more there has several blog posts by Duncan Epping, Frank Denneman, and Niels Hagoort to name a few that provide excellent vSphere resource management overviews.
The Tyrell Corporation wants to use VIO to provision and run vSphere virtual machines. Their median vSphere VM profile is 2 vCPU using the full CPU allocation and 6 GB of vMemory, normally hosted on 16 node ESXi clusters, where each ESXi server provides 256 GB of RAM and 32 CPU cores, with each core providing 2.20GHz. From the physical perspective each ESXi cluster presents 1126 GHz of CPU and 4096 GB of RAM for consumption. Tyrell Corporation always applies vSphere VM reservations and has used the Percentage of Cluster Resources Reserved HA Admission Control policy to ensure high availability, configuring both the CPU and Memory parameters to 25% respectively. Starting from day one this configuration provides approximately 844 GHz of CPU and 3072 GB of memory for consumption.
As mentioned above, nova-compute does not understand the concept of vSphere HA Admission Control. It’s job is to interact with vCenter Server for provisioning and the individual ESXi servers for resource reconciliation. In the code the reconciliation is executed by querying the vSphere ESXi cluster object for the child ESXi servers, then looping through each ESXi server to retrieve the available resources for consumption and adding the data together. Therefore from the VIO perspective, nova-scheduler sees 11260 GHz of CPU and 6144 GB of memory, including the default VIO Nova allocation ratios (10x CPU and 1.5x memory).
When HA Admission Control policy is in-play and the Nova CPU and memory allocation ratios are left at their defaults there’s always a scenario where nova-scheduler and nova-compute will continue to schedule and provision OpenStack virtual instances but they will always fail because they can’t start.
But, you say, our cloud has been designed to always cap resource utilization at 75% to provide enough time to purchase new hardware. Well, read on.
What I didn’t mention is the business case for deploying VIO - the Tyrell Corporation has deployed VIO as an on-premise alternative to AWS. The development groups internally at the Tyrell Corporation have used AWS extensively to host not only their CI/CD pipelines but also a few externally facing command and control applications that interface with their Replicants. Executive management has decreed that all AWS workloads must move internally to
limit the ability of the Government to take control meet the Government’s requirements. The development teams expect VIO to provide, within reason, the same features as AWS. They especially do not expect to change their infrastructure workload provisioning profile, which historically has been high churn, low retention.
Under heritage enterprise infrastructure workload provisioning profiles the migration of the workloads to an internal private cloud would result in a manually driven process matching the low churn, high retention IWPP. However, the Tyrell Corporation development teams leverage fully automated CI/CD pipelines, meaning that there’s a level of dynamicity and entropy that is unseen in a cloud consumed by a low churn, high retention IWPP. Once the VIO cloud is up and running the devs just need to repoint their CI/CD pipelines to point to the new on-premise cloud. Overnight the IWPPs shift from low churn, high retention to high churn, low retention. This transformation can easily lead to temporary violations of the 75% resource utilization and requires a shift in thinking about capacity management and resource utilization.
How does this impact VIO and vSphere?
First, vSphere HA and OpenStack Nova scheduling may look the same but they have different functions:
- Nova scheduling controls the placement of the virtual machine
- vSphere HA scheduling controls the startup of the virtual machine
Second, remember that the Admission Control policies Host Failures Cluster Tolerates or Percentage of Cluster Resources Reserved are only effective when virtual machine reservations are used, else you can overprovision until the pets or cattle come home (see what I did there?).
Third, VIO’s default configuration will never violate the CPU and RAM allocation ratios when both these configurations are in-play:
- vSphere reservations are used for all virtual instances
- Either the Host Failures Cluster Tolerates or Percentage of Cluster Resources Reserved Admission Control policies are configured
How does the application of these configurations manifest themselves? Violation of the Admission Control policy thresholds will always occur first and they effectively render the OpenStack oversubscription scheduling null and void.
Why does this matter?
In most cases it may not. Historically, in my opinion and experience, the stereotypical VMware-based cloud (with the exception of CSPs) does not attempt to consume every resource available to them. Lately though there are two scenarios that are becoming more and more prevalent for enterprises:
- CI/CD pipelines that result in a high churn/low retention (HC/LR) IWPP
- Autoscaling and autohealing of applications
The first scenario, HC/LR CI/CD pipelines, is becoming more and more prevalent today inside the enterprise. Over a small period of time (as short as minutes and as long as a week) entire development and testing environments consisting of hundreds or thousands of vSphere virtual machines are built and destroyed. This can result in temporary violations of the vSphere HA thresholds, resulting in failed provisioning tasks that, due to the external provisioner (e.g. Jenkins, Concourse CI, etc.), may DDoS a VIO system if the pipelines are not built to intelligently recognize downstream congestion. The Tyrell Corporation example describes the challenges in detail.
The second scenario has become an obtainable stretch goal for many clouds. As the agility and accessibility of application hosting and cloud management platforms matures there’s a nature fit for additional automation to ensure uptime and availability from applications themselves. Upstream congestion is a real, direct issue, and providing an application with the ability to automatically add and remove resources on-the-fly can help solve this dilemma. The next step is implementing a health check and using its result to trigger the automatic add or removal of the resources. Now the application metrics drive the expansion and contraction, not the OS memory or CPU usage.
So what does this mean and how can I fix it?
Out of the box VIO is configured for the 80/20 rule, 80% of the clouds running VIO don’t require additional scheduling changes, 20% normally do. For that 20%, there are some steps you can take to further tune your VIO environment:
- Use the Specify Failover Hosts Admission Control policy and the reserved_host_memory_mb variable. This simplifies the multi-layer scheduling by effectively removing X number of ESXi servers from the consumption pool. Whatever value is specified for the failover hosts, subtract the same amount of memory from the reserved_host_memory_mb. In the Tyrell Corporation example their ESXi servers have 256 GB of memory; if the Specify Failover Hosts is set to 2 then 512 GB of memory would be the value of the reserved_host_memory_mb variable.
- If that’s just too radical then use the de facto vSphere best practice Percentage of Cluster Resources Reserved Admission Control policy while assigning reservations to all virtual machines. The CoreFilter and RamFilter allocation ratios should map to the HA Admission Control Policy thresholds. If the PCRR value is 75% for CPU and 75% for memory then set the respective allocation ratios to .75 and .75.
- Don’t use Host Failures Cluster Tolerates unless you know what you are doing. I’ve tried to use this policy a couple of times and it hasn’t been pretty. The math involved to develop the required algorithm to synchronize the vSphere and VIO resource ratios is out of my reach. I recommend staying away from this one. If you figure out how to do it, could you please let me know?
Outside of VIO proper there are some additional steps to take:
- Make sure that you have modified your CPU and memory consumption thresholds to include the possibility of temporary, short-term violations.
- Talk with the Operations and Capacity/Demand Management teams so that they understand that this will happen and ask them to take steps to ensure it’s not a panic-driven situation.
- Develop CI/CD pipelines that are aware (or can be made aware) of downstream resource contention within the cloud management platform. If the downstream CMP is out of resources (in this case VIO), the provisioner needs to recognize this and react properly, through a queue and forward mechanism or something similar.
OpenStack nomenclature refers to the workloads deployed to a hypervisor as virtual instances, vSphere nomenclature uses virtual machines. Although there is a 1-to-1 mapping of virtual instance to virtual machine, the nomenclature difference is important as they represent unique objects within each cloud management platform. ↩ ↩2