Products

VMware vCenter Server

Issue/Introduction

This article provides information on resolving the vCLS health issues, so that DRS functions correctly in the cluster.

Symptoms:
vSphere 7.0 Update 1, vSphere DRS for a cluster depends on the health of vSphere Cluster Services (vCLS). vCLS on a cluster configures a quorum on vCLS system VMs on the cluster. These VMs are necessary to maintain the health of the cluster services. If vCLS health gets impacted due to unavailability of these VMs in a cluster, then vSphere DRS will not be functional in the cluster until the time vCLS VMs are brought back up.

Below are the listed operations that could fail if performed when DRS is not functional. Also, another point to note that below operations on a new DRS enabled cluster will not be available until the first vCLS VM is deployed and powered-on in that cluster.

A new workload VM placement/power-on.
Host selection for a VM that is migrated from another cluster/host within the vCenter.
Migrated VM could get powered-on on a non-DRS selected host.
Placing a host into maintenance mode might get stuck if it has any powered-on VM
Invocation of DRS APIs such as ClusterComputeResource.placeVm() and ClusterComputeResource.enterMaintenanceMode() will get InvalidState.
Configuration of Workload Management, Supervisor Cluster and Tanzu Kubernetes Cluster will fail.

Note: If DRS is not enabled on such a cluster, then the vSphere Cluster health will be in the degraded state. In the vSphere Client UI, you see the error similar to:

vSphere DRS functionality was impacted due to unhealthy state vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS.

For more information, see vSphere Cluster Services (vCLS) in vSphere 7.0 Update 1 .

Cause

There could be multiple issues resulting in this error.

A user has powered off or deleted vCLS VMs from a DRS enabled cluster
vCLS VMs deployment failed
vCLS VMs power on failed
When vCLS is disabled on a cluster using Retreat Mode
HA was unable to failover vCLS VMs upon host or storage failure

Resolution

The below workaround applies to 7.0U1 and higher. For any additional issues, please file a Support Request.

Additional Information

vCLS VMs will automatically be powered on or recreated by vCLS service. These VMs are deployed prior to any workload VMs that are deployed in a green field/fresh deployment. In an upgrade scenario, these VMs are deployed before vSphere DRS is configured to run on the clusters. When all the vCLS VMs are powered-off or deleted, the vSphere Cluster status for that cluster will turn to 'Degraded (Yellow)'. vSphere DRS needs one of the vCLS VMs to be running in a vSphere cluster to be healthy. If DRS runs prior to these VMs are brought back up, then the cluster service will be 'Unhealthy (Red)', until the time vCLS VMs are brought back up.

Scenarios with a resolution where vCLS VMs deployment could fail:

Not enough free resource in the cluster - Requires 400 MHz of CPU, 400 MB of memory and 2 GB of storage space on a cluster with more than 3 hosts. For more information on the resource requirements for these VMs, see the vCLS VM Resource Allocation section of the vSphere Resource Management Guide. vCLS reserves slots equal to the quorum size of the VMs + 1 per cluster. vCLS VMs require this much extra resources in the clusters for successful deployment.
Deployment failures in 1 node and 2 node vSAN cluster - vCLS VMs failed to deploy on a 1 or 2 node vSAN cluster with the error: Can't provision VM for ClusterAgent due to lack of suitable datastore. Since vCLS uses datastore default policy for datastore selection, if vSAN is the only available datastore within the cluster, then default policy requires 3 node vSAN cluster. The deployment of these VMs will fail in such a cluster. If a 2 Node vSAN cluster has a witness node, then deployment of vCLS VM succeeds. Workaround is to increase the size of the vSAN cluster or to change the datastore default policy.
Orphaned VM cases - If there are orphaned vCLS VMs in the vCenter Server because of disconnected and reconnected hosts, deployment of new vCLS VMs in such a cluster after adding the host might fail. Suggested workaround is to clean-up any stale/orphaned vCLS VMs from the inventory.

Scenarios with a resolution where vCLS VMs power on may fail

Not enough free resources in the cluster.
Power-on of disconnected/orphaned vCLS VMs could fail - If there are orphaned vCLS VMs in vCenter because of disconnected and reconnected hosts, power-on of such orphaned VMs could fail as these are disconnected. The workaround is to manually delete these VMs so new deployment of vCLS VMs will happen automatically in proper connected hosts/datastores.
Power-on failure due to changes to the configuration of the VMs - If user changes the configuration of vCLS VMs, power-on of such a VM could fail. User is not supposed to change any configuration of these VMs.

https://knowledge.broadcom.com/external/article?legacyId=79892

https://knowledge.broadcom.com/external/article/312147

Products

VMware vCenter Server 7.0VMware vCenter Server 8.0VMware vSphere ESXi 7.0VMware vSphere ESXi 8.0

Issue/Introduction

vSphere Cluster Services (vCLS) is a new feature in vSphere 7.0 Update 1. This feature ensures cluster services such as vSphere DRS and vSphere HA are all available to maintain the resources and health of the workloads running in the clusters.

In vSphere 7.0 Update 1, VMware has released a platform/framework to facilitate them to run independently of the vCenter Server instance availability. In this release, vCenter Server is still required for running cluster services such as vSphere DRS, vSphere HA, etc.

Note: vSphere DRS depends on the health of the vSphere Cluster Services starting with vSphere 7.0 Update 1.

In vSphere 8.0 U3, VMware has released a newer version of vCLS known as Embedded vCLS, which will be used when both vCenter and ESXi are updated to 8.0 U3. The original version of vCLS (prior to 8.0 U3) will be referred to as External vCLS.

Environment

VMware vCenter Server 7.0.x
VMware vCenter Server 8.0.x
ESXi 7.0.x
ESXi 8.0.x

Resolution

Related KBs

vSphere Cluster Services

vCLS is a mandatory feature which is deployed on each vSphere cluster when vCenter Server is upgraded to Update 1 or after a fresh deployment of vSphere 7.0 Update 1. The ESXi hosts can be of any older version which is compatible with vCenter server 7.0 Update 1. For more information, see the vSphere Cluster Services (vCLS) section of the vSphere Resource Management Guide.

As explained in the documentation, there will be 1 to 3 vCLS VMs running on each vSphere cluster depending on the size of the cluster. vSphere DRS in a DRS enabled cluster will depend on the availability of at least 1 vCLS VM. Unlike your workload/application VMs, vCLS VMs should be treated like system VMs. Do not perform any operations on these VMs unless guided by VMware support or explicitly listed as supported operation in any documentation.

There is no way to disable vCLS on a vSphere cluster and still have vSphere DRS being functional on that cluster. However, should it be necessary, you can disable vCLS on a cluster by following the Retreat Mode steps, but this will impact some of the cluster services for that cluster.

Reference: How to Disable vCLS on a Cluster via Retreat Mode

This feature has two revisions. The first, introduced in vSphere 7.0 Update 1, is known as "External vCLS". It will be deprecated in future versions of vSphere. The second, introduced in vSphere 8.0 Update 3, is known as "Embedded vCLS". These versions serve the same overall purpose, but have different runtimes, leading to differences in behaviors and supported operations.

Size of the vCLS VMs

vSphere Cluster Service VMs are very small VMs compared to workload VMs. Each consumes 1 vCPU and 128 MB of memory and about 500 MB of storage. Below table shows the specification of these VMs:

Memory	128 MB
Memory Reservation	100 MB
Swap Size	256 MB
CPU	1
CPU Reservation	100 MHz
Hard Disk	2 GB
Ethernet Adapter	0 (It is a No NIC VM)
VMDK Size	~245 MB (thin disk)
Storage Space	~480 MB (thin disk)

vCLS During Infrastructure Maintenance

Cluster compute maintenance (more details here - Automatic power-off of vCLS VMs during maintenance mode)

When there is only 1 host - vCLS VMs will be automatically powered-off when the single host cluster is put into Maintenance Mode, thus maintenance workflow is not blocked.
When there are 2 or more hosts - In a vSphere cluster where there is more than 1 host, and the host being considered for maintenance has running vCLS VMs, then vCLS VMs will be migrated to other hosts if there are free resources and if they have storage connectivity (shared storage). If these VMs cannot be migrated for the lack of free available resource on other hosts or if these VMs are placed in a local datastore, then these VMs will be powered off automatically to give preference to the host Maintenance Mode operation. As stated before, vSphere DRS for a cluster will not be functional where there is not at least 1 vCLS VM running in that cluster.
If you are decommissioning a cluster, then you have to put all the hosts into Maintenance Mode prior to deleting the cluster for proper clean-up of vCLS VMs. If you delete the cluster without placing the hosts in Maintenance Mode, there will be stale vCLS VMs running inside the hosts causing issues when these hosts with running VMs are added back to a new cluster.
Disconnect Host - On the disconnect of Host, vCLS VMs are not cleaned from these hosts as they are disconnected are not reachable. New vCLS VMs will not be created in the other hosts of the cluster as it is not clear how long the host is disconnected. When disconnected host is connected back, vCLS VM in this disconnected host will be registered again to the vCenter inventory. If a disconnected host is removed from inventory, then new vCLS VMs may be created in other hosts of the clusters if Quorum is not reached.
Datastore maintenance. For more information, see Impact of vSphere Cluster Services on storage workflows

Other VMware Product Interop

SRM - Planned migration

SRM 8.3.1 is not supported with vSphere 7.0 update.
VMware Aria Operations
Capacity reclaim- Capacity optimization workflow of vRealize Operations Manager might detect vCLS VMs as idle VMs and might include them in the recommendations for reclaiming the capacity. If vCLS VMs are deleted as part of reclaim workflow, vCLS service will recreate these VMs back. There might be a time when vCLS status for that cluster might turn unhealthy if DRS runs prior to bringing the VM back up. For more information, see the Reclaim section of the VMware Aria Operations Documentation. The recommended option is to exclude these VMs from capacity reclaim workflow. These VMs can be identified by their names (vCLS) or by looking at additional properties as explained in the documentation.
Cross cluster services - vRealize Operation Manager Workload Placement (WLP) workflows might be impacted if DRS is not functional on the cluster due to unhealthy vCLS, where WLP is recommending the placement of workloads.

vSAN

Check Datastore maintenance. For more information, see Impact of vSphere Cluster Services on storage workflows

vRealize Automation

vCLS should not impact any partner workflows like Backup, monitoring etc., Since these VMs are managed by vCLS, there is no reason to configure backup on these VMs as restoring from backup in case of a recovery operation is not necessary or might fail. These VMs can be identified using APIs as listed above under “Identifying vCLS VMs” section.
Products/solutions without any interop issues

VMware Cloud Foundation - Cloud Builder and SDDC Manager will not have any impact, vRA, vROps and vSAN impact is addressed above
NSX Data Center for vSphere
NSX-T Data Center for vSphere
vCPP
vCD
vCDA
vXRail
Horizon Enterprise

Partners Impact

vCLS should not impact any partner workflows like Backup, monitoring etc., Since these VMs are managed by vCLS, there is no reason to configure backup on these VMs as restoring from backup in case of a recovery operation is not necessary or might fail. These VMs can be differentiated via API with below additional properties for these VMs.

Extra Config

vm.config.extraConfig["HDCS.agent"] = "true"

This is the most reliable way to identify a general vCLS VM. A previous version of this article also included using the VM's "ManagedByInfo" to identify it as vCLS. This was not incorrect, but it only works for External vCLS, as Embedded vCLS uses a different value. For more extensive information on identifying vCLS VMs and differentiating their type, refer to Script Identification for Embedded vCLS has Changed Identifiers Including ManagedByInfo.

Attachments

configure_retreat_modeget_app

configure_retreat_mode_v2get_app

retreatModeConfigurationget_app

https://knowledge.broadcom.com/external/article?legacyId=91890

Products

VMware vCenter ServerVMware vCenter Server 7.0VMware vCenter Server 8.0

Issue/Introduction

There is no way to disable vCLS on a vSphere cluster and still have vSphere DRS remain functional on that cluster.

However, should it be necessary, you can disable vCLS on a cluster by following the Retreat Mode steps below, but this will impact some of the cluster services for that cluster.

Impact/Risks:

Note: Retreat Mode should be used with extra caution and should be used only for the purposes mentioned in this document. Below are the details of the impacted cluster services due to the enablement of Retreat Mode on a cluster:

vSphere DRS will not function on that cluster if DRS is enabled for that cluster. That means workloads running inside that cluster are not load-balanced, hence will not be migrated to different hosts within the cluster when the current host running that VM is running out of resources. When a user wants to take down a host for maintenance, running VMs will not be automatically migrated to other hosts within that cluster.
vSphere HA will not perform optimal placement during a host failure scenario as HA depends on DRS for placement recommendations. HA will still power-on the VMs but these VMs might be powered on in a less optimal host.

Environment

VMware vCenter Server 8.0.x
VMware vCenter Server 7.0.x

Resolution

Retreat Mode Steps

Note: Starting in vSphere 7.0 U3o and 8.0 U2, entering Retreat Mode is now available as a Cluster setting within the vCenter Server UI.

vSphere 7.0 U3o/8.0 U2 and Later

Log into vCenter's HTML5 client
In Hosts and Clusters inventory, select a cluster.
Click on the Configure tab.
Under vSphere Cluster Services, select General.
In the top right, click on EDIT VCLS MODE.

In the Edit vCLS Mode pop up window, click on the second radio option Retreat Mode.

Click OK.

For Versions Prior to vSphere 7.0 U3o and 8.0 U2, Using the vSphere Client

Log in to the vSphere Client.
Navigate to the cluster on which vCLS should be disabled. Copy the cluster domain id from the URL of the browser. It should be similar to 'domain-c<number>', not the entire string.

Notes: You only need to copy domain-c<number> part of the URL. For example: When you navigate to cluster in vSphere client, your URL will be similar to this:

https://<fqdn-of-vCenter-server>/ui/app/cluster;nav=h/urn:vmomi:ClusterComputeResource:domain-c1006:ce4a7b9f-768c-2222-3333-############/summary

. You only need to copy domain-c1006 to use in the steps below.
Using other values, for example the cluster UUID, or a combination of the cluster ID and the UUID, will result in vpxd failing to start when you next restart it. Therefore please be careful to only use the ID domain-c<number>.
If you already did add the wrong value by accident, causing vpxd to no longer start, you can remove the VCLS retreat mode settings from the vpxd.cfg configuration file. Take a backup of the vpxd.cfg, then run the following command:

# sed '/<vcls>/,/<\/vcls>/d' -i /etc/vmware-vpx/vpxd.cfg

This will remove all retreat mode settings from all of the clusters in this vCenter, but it will allow vpxd to start again

Navigate to the vCenter Server and then to Configure tab.
Click on Advanced setting section and then on Edit settings button.
Add a new entry with name = config.vcls.clusters.domain-c<number>.enabled and value = False.

Note: True and False are case insensitive, so any case of these two values should be accepted.
Click Save.
vCLS monitoring service will initiate the clean-up of vCLS VMs and user will start noticing the tasks with the VM deletion.
If this cluster has DRS enabled, then it will not be functional and additional warning will be displayed in the cluster summary. DRS will be disabled until vCLS is re-enabled on this cluster.
To remove Retreat Mode from the cluster, change the value to True in step# 5 above.

Note: True and False are case insensitive, so any case of these two values should be accepted.
Once you configure retreat mode on a cluster, the entry for the cluster will stay in the vCenter Advanced Settings. There is no way to delete this entry from vSphere Client, there will be no issue with keeping this entry.

Using APIs/CLIs

Use the attached retreatModeConfiguration.py script to configure retreat mode on multiple clusters on the VC.

Usage: python retreatModeConfiguration.py -r disable or python retreatModeConfiguration.py -r enable

Identifying vCLS VMs

In the vSphere Client UI, vCLS VMs are named vCLS (<number>) where the number field is auto-generated. All vCLS VMs with the Datacenter of a vSphere Client are visible in the VMs and Template tab of the client inside a VMs and Templates folder named vCLS.

If you click on the summary of these VMs, you will see a banner which reads vSphere Cluster Service VM is required to maintain the health of vSphere Cluster Services. Power state and resource of this VM is managed by vSphere Cluster Services, along with a Learn More link which takes you to the KB article.
Using vSphere Managed Object Browser (MOB)

Identifying all the vCLS VMs for a given datacenter

Sample MOB query examples:

Replace IP address and moid to a vCLS VM in these sample queries:

https://<IP address>/mob/?moid=vm-1004&doPath=config.managedBy <then screenshot 1>
https://<IP address>/mob/?moid=vm-1004&doPath=config.extraConfig%5b%22HDCS.agent%22%5d