VMware: Reconnecting “Disconnected” VMWare ESX Server to Virtual Infrastructure (vpx,vmware-hostd)
A lot of backups failed overnight and this was my first clue to a problem in the virtual environment. I checked the monitoring server and there were no down VMs or alerts coming from the hosts. After logging into the management system through virtual infrastructure client, I saw that all the virtual machines were running, but about 30 showed “disconnected”. These VMs were in a muted, almost dulled out color and were un-manageable. There was also an alarm on the connection state on our second VMWare ESX server.
Image of the Infrastructure Client showing VMs in disconnected state. VM names were replaced with underlines.
View when the malfunctioning ESX server is selected
Bellow is a snippet of the log showing an error from VPXA
[2008-10-27 16:41:39.482 'App' 3076446112 verbose] [SchedulePolling] Last stats polling used [0] ms
[2008-10-27 16:41:39.482 'App' 3076446112 verbose] [VpxaHalCnxHostagent] Creating temporary connect spec: localhost:443
[2008-10-27 16:41:39.483 'App' 3076446112 error] [VpxaHalCnxHostagent] Failed to discover namespace: Connection refused
[2008-10-27 16:41:39.483 'App' 3076446112 warning] [VpxaHalCnxHostagent] Could not resolve namespace for authenticating to host agent
When comparing the vmware-* process listing from a working vmware server, vmware-hostd was running. On the server having problems, it was not.
# ps -auxc | grep vmware
root 2179 0.0 0.0 4256 228 ? S Sep29 0:00 vmware-watchdog
root 2823 0.0 0.1 4268 444 ? S Sep29 0:00 vmware-watchdog
root 2907 0.0 0.2 4256 568 ? S Sep29 0:00 vmware-watchdog
root 5801 0.0 0.4 4204 1144 ? S 16:41 0:00 vmware-watchdog
I could not determine what init.d script called vmware-hostd. Update: it is the /etc/init.d/mgmt-vmware script that calls vmware-hostd.
Example
# /etc/init.d/mgmt-vmware status
vmware-hostd (pid 6047) is running
Since the system had 30 virtual machines working fine but un-manageable, we did not want to risk taking the whole system down. Nohup was used to start vmware-hostd as a background process to make sure it would stay alive after the SSH session was disconnected.
Restart vmware-hostd
# /etc/init.d/mgmt-vmware restart
or
# nohup /usr/sbin/vmware-hostd &
[1] 6047
# nohup: appending output to `nohup.out’
Re-check vmware processes
# ps -auxc | grep vmware
root 2179 0.0 0.0 4256 228 ? S Sep29 0:00 vmware-watchdog
root 2823 0.0 0.1 4268 444 ? S Sep29 0:00 vmware-watchdog
root 2907 0.0 0.2 4256 568 ? S Sep29 0:00 vmware-watchdog
root 5801 0.0 0.4 4204 1144 ? S 16:41 0:00 vmware-watchdog
root 6047 11.0 19.8 70396 53304 pts/2 R 16:54 0:01 vmware-hostd
Connection restored to Virtual Infrastructure server!
Migrated virtual machines off to the other ESX servers and entered maintenance mode on server 2. The server generated core dumps and this is an intermittent problem we have been having. Will re-install VMWare ESX on it to see if it will fix the issues. If not, time to troubleshoot hardware and OS.
Note: This process resolved the issue without taking down the running VMs on the box.



This Saved me hours of hunting thanks
Ray said this on February 17, 2009 at 1:42 pm
Very good description. This information helped immediatly and
saved hours :-)
Anonymous said this on April 1, 2009 at 5:03 am
Thank you ! :-)
Anonymous said this on August 13, 2009 at 6:33 pm
Thanks for the description. Definitely helped me.
One thing to note that… the mgmt-restart didn’t restart for me, so I had to kill the vmware-hostd process (ps -ef|grep vmware-hostd… kill PID etc etc). Then the restart process worked beautifully and connection restored.
Thanks once again.
Anonymous said this on August 24, 2009 at 2:44 am
a big thanks
Tiago Ramos said this on January 20, 2010 at 3:55 pm