Managing Hardware Failures in the VMware
Cloud on Dell Environment
The following processes are in place to ensure that hardware failures are detected
and rectified:
Failure detection
The VMware Cloud Platform continuously
monitors the system health of your SDDCs for any hardware and software defects. If
there is hardware failure or degradation of services, the monitoring system creates
an incident ticket with the severity marked appropriately. Also, the VMware SRE
on-call engineers are contacted to work on the incidents immediately.
Failure mitigation
When hardware or software fails partially
or completely on any of the ESXi hosts, the VMware SRE team adds a spare ESXi host
to the cluster. The spare ESXi host ensures that the compute and storage capacity
are available to manage your workloads. The workload VMs are then migrated to other
healthy hosts in the cluster using the VMware vMotion live migration with zero
downtime or restarted on other healthy hosts.
Failure rectification
After the spare ESXi host is added to the
cluster, the defective host is removed and isolated for troubleshooting. If the
issue is related to hardware, the VMware SRE team coordinates with the VMware
hardware infrastructure partner and the customer to dispatch the replacement parts
and ensure that a hardware infrastructure partner engineer is present at the
customer location when the replacement parts are delivered.
The hardware infrastructure partner
engineer performs hardware replacements during the maintenance window approved by
the customers. After the defective hardware is replaced, the VMware SRE team images
the ESXi host to match the software, driver, and firmware versions of the remaining
hosts in the VMware ESXi cluster. The SRE team also performs specific validations to
ensure that the repaired host can be added back as a spare ESXi host.