[ClusterLabs] SLES11/cLVMd/OCFS2: Network problems after node fence cause another node fence

Thu Mar 26 09:54:03 EDT 2015

Hi!

Today I had a very unpleasant experience where two of three nodes were fences in a cluster running SLES11 SP3:
I tried "rcopenais stop" on one node when one resource was blocked by user mistake, so the node was fenced (as expected).
However then the other nodes displayed communication problems (the famous "corosync[6723]:  [TOTEM ] Retransmit List: 2684 2686 2688 2672 2673 2674 2675 2676 2677 2678 2679 267a 267b 267c 267d 267e 267f 2680 2681 2682 2683 2685 2687").
This in turn caused a node self-fence (on another node):
--
corosync[15624]:  [TOTEM ] Retransmit List: 2683 2685 2687 2691
corosync[15624]:  [TOTEM ] FAILED TO RECEIVE
kernel: [4914102.487131] dlm: writequeue empty for nodeid 739512321
attrd[15664]:    error: plugin_dispatch: Receiving message body failed: (2) Library error: Success (0)
cluster-dlm[16739]:    error: plugin_dispatch: Receiving message body failed: (2) Library error: Success (0)
ocfs2_controld[16815]:    error: plugin_dispatch: Receiving message body failed: (2) Library error: Success (0)
cib[15661]:    error: plugin_dispatch: Receiving message body failed: (2) Library error: Success (0)
ocfs2_controld[16815]: pacemaker connection died
attrd[15664]:     crit: attrd_cs_destroy: Lost connection to Corosync service!
cluster-dlm[16739]: cluster_dead: cluster is down, exiting
cluster-dlm[16739]: process_cpg_daemon: daemon cpg_dispatch error 2
ocfs2_controld[16815]: pacemaker connection died
cib[15661]:    error: cib_cs_destroy: Corosync connection lost!  Exiting.
ocfs2_controld[16815]: client 1 fd 8 dead
attrd[15664]:   notice: main: Exiting...
cluster-dlm[16739]: loop: shutdown
ocfs2_controld[16815]: Unexpected leave of group 490B9FCAFA3D4B2F9A586A5893E00730
stonith-ng[15662]:    error: plugin_dispatch: Receiving message body failed: (2) Library error: Success (0)
attrd[15664]:   notice: main: Disconnecting client 0x6145b0, pid=15666...
crmd[15666]:    error: plugin_dispatch: Receiving message body failed: (2) Library error: Success (0)
ocfs2_controld[16815]: Unexpected leave of group 490B9FCAFA3D4B2F9A586A5893E00730
stonith-ng[15662]:    error: stonith_peer_cs_destroy: Corosync connection terminated
attrd[15664]:    error: attrd_cib_connection_destroy: Connection to the CIB terminated...
crmd[15666]:    error: crmd_cs_destroy: connection terminated
ocfs2_controld[16815]: Group 490B9FCAFA3D4B2F9A586A5893E00730 is live, exiting
lrmd[15663]:    error: crm_ipc_read: Connection to stonith-ng failed
crmd[15666]:   notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
ocfs2_controld[16815]: Group 490B9FCAFA3D4B2F9A586A5893E00730 is live, exiting
lrmd[15663]:    error: mainloop_gio_callback: Connection to stonith-ng[0x616030] closed (I/O condition=17)
lrmd[15663]:    error: stonith_connection_destroy_cb: LRMD lost STONITH connection
lrmd[15663]:  warning: qb_ipcs_event_sendv: new_event_notification (15663-15666-6): Bad file descriptor (9)
lrmd[15663]:  warning: send_client_notify: Notification of client crmd/aedcb919-f34f-41df-acd9-d518e753e176 failed
lrmd[15663]:  warning: send_client_notify: Notification of client crmd/aedcb919-f34f-41df-acd9-d518e753e176 failed
(last message in syslog before reset)
--

I would very much appreciate if these bugs in the communication protocol were fixed.
pacemaker-1.1.11-0.7.53
openais-1.1.4-5.19.7
resource-agents-3.9.5-0.34.57
kernel-default-3.0.101-0.40.1

Regards,
Ulrich