Managing Hardware Failures in the VMware Cloud on Dell Environment

The following processes are in place to ensure that hardware failures are detected and rectified:

Failure detection

The VMware Cloud Platform continuously monitors the system health of your SDDCs for any hardware and software defects. If there is hardware failure or degradation of services, the monitoring system creates an incident ticket with the severity marked appropriately. Also, the VMware SRE on-call engineers are contacted to work on the incidents immediately.

Failure mitigation

When hardware or software fails partially or completely on any of the ESXi hosts, the VMware SRE team adds a spare ESXi host to the cluster. The spare ESXi host ensures that the compute and storage capacity are available to manage your workloads. The workload VMs are then migrated to other healthy hosts in the cluster using the VMware vMotion live migration with zero downtime or restarted on other healthy hosts.

Failure rectification

After the spare ESXi host is added to the cluster, the defective host is removed and isolated for troubleshooting. If the issue is related to hardware, the VMware SRE team coordinates with the VMware hardware infrastructure partner and the customer to dispatch the replacement parts and ensure that a hardware infrastructure partner engineer is present at the customer location when the replacement parts are delivered.
The hardware infrastructure partner engineer performs hardware replacements during the maintenance window approved by the customers. After the defective hardware is replaced, the VMware SRE team images the ESXi host to match the software, driver, and firmware versions of the remaining hosts in the VMware ESXi cluster. The SRE team also performs specific validations to ensure that the repaired host can be added back as a spare ESXi host.