Can’t console to frozen XenServer host but VMs are still running

Link: http://www.jasonsamuel.com/2011/12/13/cant-console-to-frozen-xenserver-host-but-vms-are-still-running/

Let’s say a host in your pool won’t restart a VM and freezes half way (that wonderful yellow icon). If you hit the console tab, it might be blank. If you hit the console tab of the host, it might also be blank. If you SSH in it may connect, but you can’t pass any xe commands. It just sits. If you attempt to migrate or stop a VM, it hangs. The host is essentially frozen but VMs are still running on it just fine.

This is all a pretty good sign the XAPI service on the host is hung up. XAPI is the XenServer management toolstack which pretty much controls everything on the XenServer host. If the “XenAPI” toolstack is hosed, XenCenter can’t talk to the host and you probably won’t be able to pass any xe commands. The Xen API is what controls everything at the host layer. Quick way to troubleshoot this:

1. SSH into the host with the issue.

2. Type:

df -h

which will show the disk space usage on the file system. The “-h” switch will display it in gigabytes. Much easier to read. We need to check the root partition and see if it is full. This is typically 4 GB and can be filled up by logs which may cause the XAPI service to stop. If the XenServer root disk is full, you will probably see it drop out of XenCenter because XAPI is stopped. You won’t be able to restart the XAPI service until you free up some space. Here is an example of the root being 100% full:

Extra tip, once you log in to one XenServer host, you can check other hosts remotely without having to SSH into each one in a different terminal. Just type:

ssh <RemoteXenServerIPorName> df -h

3. If the root is full like above, type:

cd /var/log

then

ls

to list the logs. Type:

du –ksh *.*

to list the logs with the sizes. If you find one that is too big, delete it:

rm <logname>.log

From here you can skip ahead below to step 6 and try restarting XAPI.

Also, you might want to consider moving your logs off to a different volume. If you fill your dom0 root, you’re basically hosing the XenServer. Citrix has a good article on how to move the/var/log directory to a different volume here:

http://support.citrix.com/article/CTX130245

or retain fewer logs by editing logrotate.conf here:

http://support.citrix.com/article/CTX131619

4. If your root is not full, the next thing you probably want to do is disable HA. You can do this in the XenCenter console or you can just type:

xe pool-ha-disable

or if you want to disable HA on a host (you’ll have to run this on each host though):

host-emergency-ha-disable force=true

5. After disabling HA, restart the toolstack:

xe-toolstack-restart

This will disconnect all the hosts in the pool in XenCenter but don’t panic. Give it 10-20 seconds, once the toolstack is restarted the hosts will all reconnect to XenCenter. All pending actions like reboots, migrations, etc. will all stop when restarting the tool stack so you have a clean slate.

6. You should be able to console into your host with the issues now. Type:

service xapi status

and see if it is running. If you want to see how taxed XAPI is, type:

top

to see all the running processing. If XAPI is taking up 40% CPU or more, that is a good indication something is hung up on it.

If XAPI is not running or is very taxed, type:

service xapi restart

if it hangs at “Stopping xapi” or “Starting xapi”, you may need to kill the process.

Type:

kill <pid>

using the process ID from when you ran “service xapi status” or “top”. Then service xapi status to verify all xapi processes have stopped. Then you can type:

service xapi restart

again if it didn’t automatically try and start already. Eventually it will say:

Starting xapi: ....start-of-day complete.                  [  OK  ]

and you should see the host pop back in your XenCenter console. If you go back and run top, xapi should be taking up around 1% or less CPU.

You can type:

xe task-list

to see all the running tasks which shouldn’t be much at this point. Don’t forget to re-enable HA after you’re done. Hope this helps someone.