Skip to main content

vCloud Director vApp power on failure due to vim.fault.HAErrorsAtDest


When I was trying to power on a vApp with 12 VMs in vCloud Director the power on operation failed due to one VM was unable to power on the ESXi host with this error "The host is reporting errors in its attempts to provide vSphere HA support"
++++++++++++++++
Underlying system error: com.vmware.vim.binding.vim.fault.HAErrorsAtDest
vCenter Server task (moref: task-689) failed in vCenter Server 'TEST-VC1' (73dc8fb7-28d6-41b3-86dd-09126c88aebe).
- The host is reporting errors in its attempts to provide vSphere HA support.
+++++++++++++++
vim.binding.vim.fault.HAErrorsAtDest














I was searching for the fault message vim.fault.HAErrorsAtDest and got the information from the http://pubs.vmware.com/,
http://pubs.vmware.com/vsphere-6-5/index.jsp?topic=/com.vmware.wssdk.apiref.doc/vim.fault.HAErrorsAtDest.html 
Fault Description 
The destination compute resource is HA-enabled, and HA is not running properly. This will cause the following problems: 
1) The VM will not have HA protection. 
2) If this is an intracluster VMotion, HA will not be properly informed that the migration completed. 
This can have serious consequences for the functioning of HA.
The host is reporting errors in its attempts to provide vSphere HA support



















The error is telling something had gone wrong with vSphere HA. So I went to see the ESXi host where this VM was trying to power to know what caused the power-on operation failed.
The host is reporting errors in its attempts to provide vSphere HA support




From the task and events of the host TEST-ESX50 found that this host is not communicating with the HA MASTER node properly.
++++++++++++++++++++++++++++++++++++++++++++++++++++++
11/22/2017 21:12 | The vSphere HA availability state of this host has changed to Unreachable 
11/22/2017 21:12 | vSphere HA agent on host TEST-ESX50 connected to the vSphere HA master on host TEST-ESX55 
11/22/2017 21:12 | The vSphere HA availability state of this host has changed to Slave 
11/22/2017 21:12 | vSphere HA agent is healthy 
11/22/2017 21:12 | Successfully restored access to volume 5540a613-1683deb8-5622-0025b552073e (TEST-Datastore2) following connectivity issues. 
11/22/2017 21:12 | Lost access to volume 56d03783-18ad0808-f7bc-0025b552073e (TEST-Datastore1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 
11/22/2017 21:13 | The vSphere HA availability state of this host has changed to Unreachable 
11/22/2017 21:13 | vSphere HA agent on host TEST-ESX50 connected to the vSphere HA master on host TEST-ESX55 
11/22/2017 21:13 | The vSphere HA availability state of this host has changed to Slave 
11/22/2017 21:13 | vSphere HA agent is healthy 
11/22/2017 21:14 | The vSphere HA availability state of this host has changed to Unreachable 
11/22/2017 21:14 | Successfully restored access to volume 56d03785-18ad0808-f7bc-0025b552073e (TEST-Datastore1) following connectivity issues. 
11/22/2017 21:14 | Lost access to volume 56d037d2-b860fed8-5593-0025b559073e (TEST-Datastore3) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 
11/22/2017 21:14 | Successfully restored access to volume 56d037d2-b860fed8-5593-0025b559073e (TEST-Datastore3) following connectivity issues. 
11/22/2017 21:14 | Lost access to volume 546b6e62-30bd5d48-be59-0025c552073e (TEST-Datastore5) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 
11/22/2017 21:14 | DRS migrated TEST-VM1 (58c44bb8-bb81-4207-9ffa-b25de465b79c) from TEST-ESX46
11/22/2017 21:14 | Successfully restored access to volume 546b6e62-30bd5d48-be59-0025c552073e (TEST-Datastore5) following connectivity issues. 
11/22/2017 21:14 | Lost access to volume 546b6ea1-cdbde74a-8cf1-0025b552073e (TEST-Datastore10) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 
11/22/2017 21:14 | vSphere HA agent on host TEST-ESX50 connected to the vSphere HA master on host TEST-ESX55 
11/22/2017 21:14 | The vSphere HA availability state of this host has changed to Slave 
11/22/2017 21:14 | vSphere HA agent is healthy 
++++++++++++++++++++++++++++++++++++++++++++++++
From the fdm.log of the ESXi host TEST-ESX50 ( SLAVE)
++++++++++++++++++++++++++++++++++++++++++++++++
2017-11-22T21:13:34.526Z TEST-ESX50 Fdm: error fdm[3E75B70] [Originator@6876 sub=Cluster opID=SWI-60b7acd9] [ClusterPersistence::VersionChange] Fetch timeout of 2 seconds from host host-44,192.168.1.100 for version [1] 16698 
2017-11-22T21:13:34.526Z TEST-ESX50 Fdm: verbose fdm[3E75B70] [Originator@6876 sub=Cluster opID=SWI-60b7acd9] [ClusterManagerImpl::AddBadIP] IP 192.168.1.100 marked bad for reason Unreachable IP 
2017-11-22T21:13:34.526Z TEST-ESX50 Fdm: info fdm[3E75B70] [Originator@6876 sub=Cluster opID=SWI-60b7acd9] [ClusterPersistence::VersionChange] fetching version[1] 16698 from host-44,192.168.1.100 
2017-11-22T21:13:34.528Z TEST-ESX50 Fdm: info fdm[3E34B70] [Originator@6876 sub=Cluster] [ClusterManagerImpl::NewClusterConfig] version 16698 
2017-11-22T21:13:34.529Z TEST-ESX50 Fdm: verbose fdm[3E34B70] [Originator@6876 sub=Cluster] [ClusterManagerImpl::Uncompress] Uncompressed from size 8135 to size 116147 
2017-11-22T21:13:34.535Z TEST-ESX50 Fdm: verbose fdm[3E34B70] [Originator@6876 sub=Cluster] [ClusterManagerImpl::UpdatePersistentObject] name clusterconfig version (16698 ?> 16697) force false 
2017-11-22T21:13:34.536Z TEST-ESX50 Fdm: verbose fdm[3E34B70] [Originator@6876 sub=Invt] [InventoryManagerImpl::Handle(ClusterConfigNotification)] Processing cluster config 
2017-11-22T21:13:34.536Z TEST-ESX50 Fdm: verbose fdm[3E34B70] [Originator@6876 sub=Invt] [InventoryManagerImpl::UpdateAgentVms] Number of required agents vms changed to 0. 
2017-11-22T21:13:34.536Z TEST-ESX50 Fdm: verbose fdm[3E34B70] [Originator@6876 sub=Simulator] [Processing cluster config 
2017-11-22T21:13:34.536Z TEST-ESX50 Fdm: verbose fdm[3E34B70] [Originator@6876 sub=Simulator] numPowerOpsPerMinute=0, numResOpsPerMinute=0, sendInterval=0 
2017-11-22T21:13:34.536Z TEST-ESX50 Fdm: verbose fdm[3E34B70] [Originator@6876 sub=Simulator] waitTime=60000, numPowerOps=0, numResOps=0 
2017-11-22T21:13:34.536Z TEST-ESX50 Fdm: verbose fdm[3E34B70] [Originator@6876 sub=Cluster] Processing cluster config 
2017-11-22T21:13:34.536Z TEST-ESX50 Fdm: info fdm[3FBAB70] [Originator@6876 sub=Cluster opID=SWI-1b53be18] [ClusterManagerImpl::StoreDone] Wrote cluster-config version 16698 
2017-11-22T21:13:43.540Z TEST-ESX50 Fdm: verbose fdm[3E75B70] [Originator@6876 sub=Election opID=SWI-60b7acd9] CheckVersion: Version[2] Other host GT : 821635 > 821634 
2017-11-22T21:13:43.540Z TEST-ESX50 Fdm: verbose fdm[3E75B70] [Originator@6876 sub=Cluster opID=SWI-60b7acd9] [ClusterPersistence::VersionChange] version[2] 821635 from host-44,192.168.1.100 
2017-11-22T21:13:43.540Z TEST-ESX50 Fdm: info fdm[3E75B70] [Originator@6876 sub=Cluster opID=SWI-60b7acd9] [ClusterPersistence::VersionChange] fetching version[2] 821635 from host-44,192.168.1.100 
2017-11-22T21:13:44.541Z TEST-ESX50 Fdm: info fdm[3E75B70] [Originator@6876 sub=Election opID=SWI-60b7acd9] Slave timed out 
2017-11-22T21:13:44.541Z TEST-ESX50 Fdm: info fdm[3E75B70] [Originator@6876 sub=Election opID=SWI-60b7acd9] [ClusterElection::ChangeState] Slave => Startup : Lost master 
2017-11-22T21:13:44.541Z TEST-ESX50 Fdm: info fdm[3E75B70] [Originator@6876 sub=Cluster opID=SWI-60b7acd9] Change state to Startup:0
+++++++++++++++++++++++++++++++++++++++++++++++
From the fdm.log of the ESXi host TEST-ESX45 ( MASTER)
++++++++++++++++++++++++++++++++++++++++++++++++
2017-11-22T21:12:07.619Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Starting datastore heartbeat checking for slave host-32 
2017-11-22T21:12:08.620Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Heartbeat still pending for slave @ host-32 
2017-11-22T21:12:09.622Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Heartbeat still pending for slave @ host-32 
2017-11-22T21:12:10.625Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Heartbeat still pending for slave @ host-32 
2017-11-22T21:12:11.627Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Heartbeat still pending for slave @ host-32 
2017-11-22T21:12:12.630Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Heartbeat still pending for slave @ host-32 
2017-11-22T21:12:13.631Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Heartbeat still pending for slave @ host-32 
2017-11-22T21:12:14.632Z TEST-ESX45 Fdm: error fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Timeout for slave @ host-32 
2017-11-22T21:12:14.632Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Marking slave host-32 as unreachable 
2017-11-22T21:12:14.632Z TEST-ESX45 Fdm: verbose fdm[43EAB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Beginning ICMP pings every 1000000 microseconds to host-32 
2017-11-22T21:12:14.635Z TEST-ESX45 Fdm: info fdm[483AB70] [Originator@6876 sub=Invt opID=SWI-33316784] [HostStateChange::SaveToInventory] host host-32 changed state: FDMUnreachable 
2017-11-22T21:12:14.635Z TEST-ESX45 Fdm: verbose fdm[483AB70] [Originator@6876 sub=Vmcp opID=SWI-33316784] Canceling VM reservation for non-live host host-32 
2017-11-22T21:12:14.635Z TEST-ESX45 Fdm: verbose fdm[483AB70] [Originator@6876 sub=PropertyProvider opID=SWI-33316784] RecordOp ASSIGN: slave["host-32"], fdmService. Applied change to temp map. 
2017-11-22T21:12:15.448Z TEST-ESX45 Fdm: verbose fdm[48BCB70] [Originator@6876 sub=Cluster] FDM_SM_SLAVE_MSG with id host-32 (192.168.1.36) 
++++++++++++++++++++++++++++++++++++++++++++++++
So it is clear that the Slave had a communication problem with the HA Master very frequently. The vApp VM from vCloud director was trying to power on when the master host declared this host as unreachable (Marking slave host-32 as unreachable).

Reconfigured the vSphere HA on the cluster to fix the HA issues. Post re-configuration of HA the power-on operation went fine.

Read: http://pubs.vmware.com/vsphere-6-5/index.jsp?topic=/com.vmware.wssdk.apiref.doc/vim.fault.HAErrorsAtDest.html 
https://docs.vmware.com/en/VMware-vSphere/6.5/vsphere-esxi-vcenter-server-65-troubleshooting-guide.pdf ( section : Troubleshooting Availability)

Comments

Popular posts from this blog

How to Fix | Virtual SAN Health Alarm 'Performance data collection'' status is Red

Virtual SAN Health Alarm 'Performance data collection'' status is Red vSAN CLuster ==> Monitor ==> Virtual SAN==> Health ==> Performance Service ==> Performance Data Collection==>  Stats Gathering ==>  Failed Stats persistence==> Failed The causes for this error is unknown but there are two fixes available to this issue, 1)  Restarting the vsanmgmtd and vsanvpd service on all the ESXi hosts in the vSAN Cluster.  There is no impact of restarting these two services on the ESXi,  /etc/init.d/vsanmgmtd  restart /etc/init.d/vsanvpd restart Make sure the service is is running state after the restart,  /etc/init.d/vsanmgmtd  status /etc/init.d/vsanvpd status Post restart of the services retest the vsan health , vSAN CLuster ==> Monitor ==> Virtual SAN==> Health==>Retest and the Performance Data Collection should be green. 2) To resolve this issue, re-enable the performance service from...

vSAN Disk group is in "Unhealthy State"

If you are running VMware vSAN 6.0, 6.1 and 6.2 then there is a high chance that you will be seeing this issue with the following RAID controllers, Cisco 12G SAS Modular Raid Controller DELL FD332-PERC (Dual ROC) DELL FD332-PERC (Single ROC) DELL PERC H730 Adapter DELL PERC H730 Mini ==> We are using with Dell R620/630 serves with this RAID controller DELL PERC H730P Adapter  DELL PERC H730P Mini Huawei Technologies Co. Ltd. SR 430C Lenovo ThinkServer RAID 720i AnyRAID Adapter Lenovo ThinkServer RAID 720ix AnyRAID Adapter Lenovo ServeRAID 5210e SAS/SATA Controller Lenovo ServeRAID M5210 SAS/SATA Controller LSI MegaRAID SAS 9361-8i LSI MegaRAID SAS 9362-8i Supermicro SMC3108 But this can happen due to Physical Disk Drive failure and RAID Controllers from above list resetting the Disk Drives. In some scenario only one disk group will go to unhealthy state or all the disk groups will go to unhealthy state on the ESXi host in the vSAN cluster....

How to Fix | Virtual SAN Health - Physical Disk Health Retrieval Issues

Physical Disk Health – Physical Disk Health Retrieval Issues In Virtual SAN cluster, there is one more common issue is the Virtual SAN health test failing to retrieve the Physical Disk Health on an ESXi host.It is informing the administrator that it cannot get physical disk-related information from the ESXi host in question in order to perform a check on the health of the physical disks. If the Virtual SAN management service vsanmgmtd on the ESXi host is nonresponsive then you will encounter this issue, in the vsanmgmt.log you will see the following snippets, ++++++++++++++++++++++++++++++++++++++++++++ [root@esxihost-1:/var/log] cat vsanmgmt.log  2017-11-15T03:08:46Z VSANMGMTSVC: INFO vsanperfsvc[Thread-1] [VsanLsomHealth::getHealthStats] Get issued comps = {}  2017-11-15T03:08:46Z VSANMGMTSVC: WARNING vsanperfsvc[Thread-1] [VsanHealthUtil::InvokeMethod] Invoke: mo=ServiceInstance, info=RetrieveContent  2017-11-15T03:08:46Z VSANMGMTSVC: ERROR vsanperfsvc...

Horizon View Pools stuck in Deleting state

Recently, had an issue with 2 view Desktop Pools that were stuck in Deleting state in horizon view manager. We are running Horizon View 7.2 and this issue happening since View 4.x. Out of 2 Pools, I was able to delete one pool by just removing the VM from Resources-->Machines--> Filtered using Pool name.But when doing the same thing for the TEST-Pool I was getting an error as below, "Machine","Desktop Pool","DNS Name","User","Host","Agent Version","Datastore","Status" "TEST-POOL-046","TEST-POOL","TEST-POOL-046.TEST.LOCAL","","esx3.TEST.LOCAL","Unknown","[TEST-VCENTER1VSAN]","Status:Error Status Errors:Nov 30, 2017 10:38:20 PM PST: Failed to delete VM - null" So I logged in to the connection server and found the following error logs, C:\programdata\vmware\vdm\logs\debug-2017-11-30-221023.txt ++++++++...

How to fix | ESXI Virtual SAN Health service installation

I encountered an issue with the ESXi Virtual SAN Health service installation in one of the vSAN cluster, Step 1 : I checked whether all the ESXi hosts are running on the same version or not, VMware ESXi 6.0.0 build-5224934 VMware ESXi 6.0.0 Update 3 on ESX1 VMware ESXi 6.0.0 build-5224934 VMware ESXi 6.0.0 Update 3 on ESX2 VMware ESXi 6.0.0 build-5224934 VMware ESXi 6.0.0 Update 3 on ESX3 VMware ESXi 6.0.0 build-5224934 VMware ESXi 6.0.0 Update 3 on ESX5 VMware ESXi 6.0.0 build-5224934 VMware ESXi 6.0.0 Update 3 on ESX4 They are in the same version so we can go check whether vSAN health VIB  is installed or not. From the KB https://kb.vmware.com/s/article/2109874 , On vSphere 6.0 Update 2 release, none of the other health checks will be conducted until all the hosts are upgraded to 6.0 Update 2 (when running the latest version, vSAN 6.2) release to avoid false alarms.  But we have all the ESXi hosts in ESXi 6.0 Update 3 "Install the vSAN Health Servi...

How to fix | vSAN CLOMD Liveness - Part I

In the following scenarios, you will see the CLOMD service liveness on ESXi hosts, If any of the ESXi hosts are disconnected, the CLOMD liveness state of the disconnected host is shown as unknown .If the Health service is not installed on a particular ESXi host, the CLOMD liveness state of all the ESXi hosts is also reported as unknown. If the CLOMD service is not running on a particular ESXi host, the CLOMD liveness state of one host is abnormal. The Cluster Health – CLOMD liveness check in the vSAN Health Service, and provides details on why it might report an error.This checks if the Cluster Level Object Manager ( CLOMD ) daemon is alive or not. It does so by first checking that the service is running on all ESXi hosts, and then contacting the service to retrieve run-time statistics to verify that CLOMD can respond to inquiries.  CLOMD (Cluster Level Object Manager Daemon) plays a key role in the operation of a vSAN cluster. It runs on every ...

vSAN Component Failure State - Degraded vs Absent - Part I

Failure States of Virtual SAN Components: Virtual SAN  handles failures of the host, network and storage devices in the cluster based on the severity of the failure. When these fail they directly affect the components in the  vSAN cluster.  Virtual SAN has 2 types of failure states for components ABSENT and DEGRADED. According to the component state, it uses different approaches to recover the affected components. Degraded: "A component is in degraded state if Virtual SAN detects a permanent component failure and assumes that the component is not going to recover to working state." Absent: "A component is in absent state if Virtual SAN detects a temporary component failure where the component might recover and restore its working state." An ABSENT state may or not resolve itself over time, but a  DEGRADED state is a permanent state. From the above image, left side a disk has been unplugged or offline may be reinserted or brought online, Vir...

How to fix | vSAN CLOMD Liveness - Part II (Virtual machine creation failed)

vSAN CLOMD daemon may fail when trying to repair objects with 0 byte components When Cloning a VM from a template from VRA and vCenter vMotion failed with the following errors. And the vApp deployment failed due to the clomd service is failed on the host, Read the importance of clomd here,  https://virtuawisdom.blogspot.in/2017/11/how-to-fix-vsan-clomd-liveness-part-i.html Task Details: Name: clone Status: Cannot complete file creation operation. Start Time: May 30, 2017 5:34:13 AM Completed Time: May 30, 2017 5:35:13 AM State: error Error Stack:  A CLOM is not attached. This could indicate that the clomd daemon is not running.Failed to create the object. Additional Task Details: Error Type: CannotCreateFile Task Id: Task Cancelable: true Canceled: false Description Id: VirtualMachine.clone Event Chain Id: 291778 /var/log/clomd.log +++++++++++++++++++++++++++++++++++++++++++++ 2017-05-25T18:20:12.755Z 26738391 (111018916128)(opID:0)main: Clom...

LogicalSwitch virtualwire-xxxx is marked as missing | NSX

NSX Manager detected a backing DV portgroup for an NSX logical switch is missing in Virtual Center. There was a request to create a port group for a vApp network when deploying a vApp from the vcloud director. But due to some reason the port group was created and removed in the dvsiwtch, and in my case, the vApp was deleted from the vCloud director so the network also removed from the VC and NSX. If  there is an error on the NSX UI follow the below solutions, +++++++++++++++++++++++++++++++++ Dec  6 09:42:00 manager 2017-12-06 09:42:00.597 UTC  INFO DCNPool-1 VirtualWireInFirewallRuleNotificationHandler:58 - Recieved VDN CREATE notification for context virtualwire-33172:VirtualWire Dec  6 09:42:00 manager 2017-12-06 09:42:00.600 UTC  INFO DCNPool-1 VirtualWireDCNHandler:43 - Recieved VDN CREATE notification for context virtualwire-33172:VirtualWire Dec  6 09:42:00 manager 2017-12-06 09:42:00.696 UTC  INFO http-nio-127.0.0.1-7441-exec-917 Net...