Today one OVS server met issue with ovs-agent and need reboot. As there were VMs running on it, so I tried live migrating xen based VMs using "xm migrate -l", but below error occurred:
-bash-3.2# xm migrate -l vm1 server1
Error: can't connect: (111, 'Connection refused')
Usage: xm migrate
Migrate a domain to another machine.
-h, --help Print this help.
-l, --live Use live migration.
Use specified port for migration.
Use specified NUMA node on target.
-s, --ssl Use ssl connection for migration.
As xen migration use xend-relocation-server of xend-relocation-port, so this "Connection refused" issue was most likely related to this. And below is the configuration of /etc/xen/xend-config.sxp:
-bash-3.2# egrep -v '^#|^$' /etc/xen/xend-config.sxp
And to check the progresses related with these:
-bash-3.2# lsof -i :8002
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
xend 12095 root 5u IPv4 146473964 TCP *:teradataordbms (LISTEN)
-bash-3.2# ps auxww|egrep '/opt/ovs-agent-2.3/utils/dlm.py|/opt/ovs-agent-2.3/utils/hook_vm_shutdown.py'
root 3501 0.0 0.0 3924 740 pts/0 S+ 08:37 0:00 egrep /opt/ovs-agent-2.3/utils/dlm.py|/opt/ovs-agent-2.3/utils/hook_vm_shutdown.py
root 19007 0.0 0.0 12660 5840 ? D 03:44 0:00 python /opt/ovs-agent-2.3/utils/dlm.py --lock --name vm1 --uuid 56f17372-0a86-4446-8603-d82423c54367
root 27446 0.0 0.0 12664 5956 ? D 05:11 0:00 python /opt/ovs-agent-2.3/utils/dlm.py --lock --name vm2 --uuid eb1a4e84-3572-4543-8b1d-685b856d98c7
When processes went into D state(uninterruptable sleep), it'll be troublesome, as these processes can only be killed by reboot the whole system. However, on this server, we had many VMs running, and now live migration/relocation was blocked by issue caused by itself, and deadlock surfaced. And seems reboot was the only way to "resolve" the issue.
Firstly, I tried bounce xend(/etc/init.d/xend restart), but met below error indicated in /var/log/message:
[2015-11-04 04:39:43 24026] INFO (SrvDaemon:227) Xend stopped due to signal 15.
[2015-11-04 04:39:43 24115] INFO (SrvDaemon:332) Xend Daemon started
[2015-11-04 04:39:43 24115] INFO (SrvDaemon:336) Xend changeset: unavailable.
[2015-11-04 04:40:14 24115] ERROR (SrvDaemon:349) Exception starting xend ((98, 'Address already in use'))
Traceback (most recent call last):
File "/usr/lib/python2.4/site-packages/xen/xend/server/SrvDaemon.py", line 339, in run
File "/usr/lib/python2.4/site-packages/xen/xend/server/relocate.py", line 159, in listenRelocation
hosts_allow = hosts_allow)
File "/usr/lib/python2.4/site-packages/xen/web/tcp.py", line 36, in __init__
File "/usr/lib/python2.4/site-packages/xen/web/connection.py", line 89, in __init__
self.sock = self.createSocket()
File "/usr/lib/python2.4/site-packages/xen/web/tcp.py", line 49, in createSocket
File "", line 1, in bind
error: (98, 'Address already in use')
And later, I realized that we can change xend-relocation-port to have a try. So I made below changes to /etc/xen/xend-config.sxp:
And later, bounced xend:
/etc/init.d/xend stop; /etc/init.d/xend start
PS: xend bouncing will not affect running VMs, as I had compared qemu output(ps -ef|grep qemu). A tip here is that when xen related commands(xm list, and so on) stopped working, checking for "qemu" simulator processes will help you get the VM list.
After this, "xm migrate -l vm1 server1" still failed with the same can't connect: (111, 'Connection refused'). And I resolved this by specifying port:(you may need stop iptables too):
-bash-3.2# xm migrate -l -p 8002 vm1 server1
Now the live migration went on smoothly, and after all VMs were migrated, I changed xend-relocation-port back to 8002 and reboot the server to fix the D state(uninterruptable sleep) issue.
If you find error "Error: can't connect: (111, 'Connection refused')" even after above WA, then you can change back from 8003 to 8002, or even from 8003 to 8004, restart iptables, and try again.