[ClusterLabs] Q: SLES10 SP4: Node (cmirrord) refuses to join cluster (long)

Mon May 29 07:29:12 UTC 2017

Hi!

I have a problem that doesn't make any progress in SUSE support, so I want to ask here:
We have a three-node cluster that has a stable configuration for a long time. When we updates one node (stop cluster node, update software, reboot, start cluster software during reboot), the node failed to join the cluster (some CPG problem it seems). During that the other two nodes were refusing to provide cLVM/OCFS2 service, and the software relying on it (Xen VMs) seemed to hang. I had to emergency-reboot the node that seemed to trigger this problem, and on the second boot everything was fine...

More then three months had passed since then, and today I was hit by the same problem again.

Has anyone seen this problem, also? The cluster uses UDPU since multicast proved to be even worse.

For reference here is the status (when the cluster is OK):
Last updated: Mon May 29 09:23:52 2017
Last change: Tue May 23 16:05:16 2017 by root via cibadmin on h01
Stack: classic openais (with plugin)
Current DC: h05 - partition with quorum
Version: 1.1.12-f47ea56
3 Nodes configured, 3 expected votes
46 Resources configured

Online: [ h01 h05 h10 ]

Full list of resources:

 prm_stonith_sbd        (stonith:external/sbd): Started h05
 Clone Set: cln_DLM [prm_DLM]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_cLVMd [prm_cLVMd]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_LVM_CFS_VMs [prm_LVM_CFS_VMs]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_O2CB [prm_O2CB]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_CFS_VMs_fs [prm_CFS_VMs_fs]
     Started: [ h01 h05 h10 ]
 prm_xen_v01    (ocf::heartbeat:Xen):   Started h01
 prm_xen_v07    (ocf::heartbeat:Xen):   Started h01
 prm_xen_v05    (ocf::heartbeat:Xen):   Started h01
 prm_xen_v03    (ocf::heartbeat:Xen):   Started h01
 prm_xen_v04    (ocf::heartbeat:Xen):   Started h05
 prm_xen_v02    (ocf::heartbeat:Xen):   Started h01
 prm_xen_v06    (ocf::heartbeat:Xen):   Started h05
 prm_xen_v08    (ocf::heartbeat:Xen):   Started h05
 prm_xen_v09    (ocf::heartbeat:Xen):   Started h01
 Clone Set: cln_iotw_xen-LV [prm_iotw_xen-LV]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_iotw_xen-FS [prm_iotw_xen-FS]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_iotw_xen-LV-L1 [prm_iotw_xen-LV-L1]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_iotw_xen-LV-L2 [prm_iotw_xen-LV-L2]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_iotw_xen-ML [prm_iotw_xen-ML]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_iotw_xen-ML-L1 [prm_iotw_xen-ML-L1]
     Started: [ h01 h05 h10 ]
 Clone Set: cln_iotw_xen-ML-L2 [prm_iotw_xen-ML-L2]
     Started: [ h01 h05 h10 ]

Node Attributes:
* Node h01:
    + xen_migration_ip                  : 172.20.78.1
* Node h05:
    + xen_migration_ip                  : 172.20.78.5
* Node h10:
    + xen_migration_ip                  : 172.20.78.10

Migration summary:
* Node h01:
* Node h10:
* Node h05:

This is where the node should have joined the cluster (node 1 of 3):
May 29 08:30:07 h01 corosync[10873]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:30:07 h01 corosync[10873]:  [CLM   ] New Configuration:
May 29 08:30:07 h01 corosync[10873]:  [CLM   ]  r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)
May 29 08:30:07 h01 corosync[10873]:  [CLM   ]  r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)
May 29 08:30:07 h01 corosync[10873]:  [CLM   ] Members Left:
May 29 08:30:07 h01 corosync[10873]:  [CLM   ] Members Joined:
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 3444: memb=2, new=0, lost=0
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: pcmk_peer_update: memb: h01 739512321
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: pcmk_peer_update: memb: h05 739512325
May 29 08:30:07 h01 corosync[10873]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:30:07 h01 corosync[10873]:  [CLM   ] New Configuration:
May 29 08:30:07 h01 corosync[10873]:  [CLM   ]  r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)
May 29 08:30:07 h01 corosync[10873]:  [CLM   ]  r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)
May 29 08:30:07 h01 cluster-dlm[11711]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h01 cluster-dlm[11711]: update_cluster: Processing membership 3444
May 29 08:30:07 h01 ocfs2_controld[12424]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h01 corosync[10873]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:30:07 h01 crmd[10883]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h01 cluster-dlm[11711]: dlm_process_node: Skipped active node 739512325: born-on=3420, last-seen=3444, this-event=3444, last-event=3440
May 29 08:30:07 h01 corosync[10873]:  [CLM   ] Members Left:
May 29 08:30:07 h01 corosync[10873]:  [CLM   ] Members Joined:
May 29 08:30:07 h01 corosync[10873]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 3444: memb=3, new=1, lost=0
May 29 08:30:07 h01 cluster-dlm[11711]: dlm_process_node: Adding address ip(172.20.16.10) to configfs for node 739512330
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: update_member: Node 739512330/h10 is now: member
May 29 08:30:07 h01 cluster-dlm[11711]: add_configfs_node: set_configfs_node 739512330 172.20.16.10 local 0
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: pcmk_peer_update: NEW:  h10 739512330
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: pcmk_peer_update: MEMB: h01 739512321
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: pcmk_peer_update: MEMB: h05 739512325
May 29 08:30:07 h01 cib[10878]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: pcmk_peer_update: MEMB: h10 739512330
May 29 08:30:07 h01 cluster-dlm[11711]: dlm_process_node: Adding address ip(10.2.2.10) to configfs for node 739512330
May 29 08:30:07 h01 cluster-dlm[11711]: add_configfs_node: set_configfs_node 739512330 10.2.2.10 local 0
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: send_member_notification: Sending membership update 3444 to 5 children
May 29 08:30:07 h01 cluster-dlm[11711]: create_path: /sys/kernel/config/dlm/cluster/comms/739512330: mkdir failed: 17
May 29 08:30:07 h01 cib[10878]:   notice: crm_update_peer_state: cib_peer_update_callback: Node h10[739512330] - state is now lost (was member)
May 29 08:30:07 h01 stonith-ng[10879]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h01 cib[10878]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h01 cluster-dlm[11711]: dlm_process_node: Added active node 739512330: born-on=3436, last-seen=3444, this-event=3444, last-event=3440
May 29 08:30:07 h01 corosync[10873]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 29 08:30:07 h01 stonith-ng[10879]:   notice: crm_update_peer_state: st_peer_update_callback: Node h10[739512330] - state is now lost (was member)
May 29 08:30:07 h01 cluster-dlm[11711]: dlm_process_node: Skipped active node 739512321: born-on=3428, last-seen=3444, this-event=3444, last-event=3440
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: update_member: 0x6aa5b0 Node 739512330 (h10) born on: 3444
May 29 08:30:07 h01 stonith-ng[10879]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: update_member: Node h10 now has process list: 00000000000000000000000000151312 (1381138)
May 29 08:30:07 h01 corosync[10873]:  [pcmk  ] info: send_member_notification: Sending membership update 3444 to 5 children
May 29 08:30:08 h01 cmirrord[12926]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:30:08 h01 cmirrord[12926]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:30:08 h01 cmirrord[12926]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
[...and so on ...]

On the DC-node:
May 29 08:30:07 h05 corosync[24255]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:30:07 h05 corosync[24255]:  [CLM   ] New Configuration:
May 29 08:30:07 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)
May 29 08:30:07 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)
May 29 08:30:07 h05 corosync[24255]:  [CLM   ] Members Left:
May 29 08:30:07 h05 corosync[24255]:  [CLM   ] Members Joined:
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 3444: memb=2, new=0, lost=0
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: memb: h01 739512321
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: memb: h05 739512325
May 29 08:30:07 h05 corosync[24255]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:30:07 h05 corosync[24255]:  [CLM   ] New Configuration:
May 29 08:30:07 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)
May 29 08:30:07 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)
May 29 08:30:07 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:30:07 h05 corosync[24255]:  [CLM   ] Members Left:
May 29 08:30:07 h05 corosync[24255]:  [CLM   ] Members Joined:
May 29 08:30:07 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 3444: memb=3, new=1, lost=0
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: update_member: Node 739512330/h10 is now: member
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: NEW:  h10 739512330
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: MEMB: h01 739512321
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: MEMB: h05 739512325
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: MEMB: h10 739512330
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: send_member_notification: Sending membership update 3444 to 5 children
May 29 08:30:07 h05 cib[24260]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h05 corosync[24255]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 29 08:30:07 h05 cluster-dlm[24772]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h05 ocfs2_controld[24901]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h05 cluster-dlm[24772]: update_cluster: Processing membership 3444
May 29 08:30:07 h05 cluster-dlm[24772]: dlm_process_node: Skipped active node 739512325: born-on=3420, last-seen=3444, this-event=3444, last-event=3440
May 29 08:30:07 h05 crmd[24265]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h05 cluster-dlm[24772]: dlm_process_node: Skipped active node 739512321: born-on=3428, last-seen=3444, this-event=3444, last-event=3440
May 29 08:30:07 h05 stonith-ng[24261]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h05 cluster-dlm[24772]: dlm_process_node: Adding address ip(172.20.16.10) to configfs for node 739512330
May 29 08:30:07 h05 cluster-dlm[24772]: add_configfs_node: set_configfs_node 739512330 172.20.16.10 local 0
May 29 08:30:07 h05 cluster-dlm[24772]: dlm_process_node: Adding address ip(10.2.2.10) to configfs for node 739512330
May 29 08:30:07 h05 cluster-dlm[24772]: add_configfs_node: set_configfs_node 739512330 10.2.2.10 local 0
May 29 08:30:07 h05 cluster-dlm[24772]: create_path: /sys/kernel/config/dlm/cluster/comms/739512330: mkdir failed: 17
May 29 08:30:07 h05 cluster-dlm[24772]: dlm_process_node: Added active node 739512330: born-on=3436, last-seen=3444, this-event=3444, last-event=3440
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: update_member: 0x6b59c0 Node 739512330 (h10) born on: 3444
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: update_member: Node h10 now has process list: 00000000000000000000000000151312 (1381138)
May 29 08:30:07 h05 corosync[24255]:  [pcmk  ] info: send_member_notification: Sending membership update 3444 to 5 children
May 29 08:30:07 h05 stonith-ng[24261]:   notice: crm_update_peer_state: st_peer_update_callback: Node h10[739512330] - state is now lost (was member)
May 29 08:30:07 h05 stonith-ng[24261]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h05 cib[24260]:   notice: crm_update_peer_state: cib_peer_update_callback: Node h10[739512330] - state is now lost (was member)
May 29 08:30:07 h05 cib[24260]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h05 crmd[24265]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
May 29 08:30:07 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:30:07 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:30:07 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:30:07 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:30:07 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
[...and so on...]

On the rebooted node:
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] adding new UDPU member {172.20.16.1}
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] adding new UDPU member {172.20.16.5}
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] adding new UDPU member {172.20.16.10}
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] The network interface [10.2.2.10] is now up.
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] adding new UDPU member {10.2.2.1}
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] adding new UDPU member {10.2.2.5}
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] adding new UDPU member {10.2.2.10}
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] New Configuration:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] Members Left:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] Members Joined:
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 3440: memb=0, new=0, lost=0
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] New Configuration:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] Members Left:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] Members Joined:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 3440: memb=1, new=1, lost=0
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_peer_update: NEW:  h10 739512330
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_peer_update: MEMB: h10 739512330
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Node h10 now has process list: 00000000000000000000000000151312 (1381138)
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 29 08:30:07 h10 corosync[17254]:  [CPG   ] chosen downlist: sender r(0) ip(172.20.16.10) r(1) ip(10.2.2.10) ; members(old:0 left:0)
May 29 08:30:07 h10 corosync[17254]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 29 08:30:07 h10 attrd[17262]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
May 29 08:30:07 h10 attrd[17262]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] New Configuration:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] Members Left:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] Members Joined:
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 3444: memb=1, new=0, lost=0
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_peer_update: memb: h10 739512330
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] New Configuration:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ]  r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)
May 29 08:30:07 h10 corosync[17254]:  [CLM   ]  r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)
May 29 08:30:07 h10 corosync[17254]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] Members Left:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ] Members Joined:
May 29 08:30:07 h10 corosync[17254]:  [CLM   ]  r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)
May 29 08:30:07 h10 corosync[17254]:  [CLM   ]  r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 3444: memb=3, new=2, lost=0
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Creating entry for node 739512321 born on 3444
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Node 739512321/unknown is now: member
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_peer_update: NEW:  .pending. 739512321
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Creating entry for node 739512325 born on 3444
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Node 739512325/unknown is now: member
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_peer_update: NEW:  .pending. 739512325
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_peer_update: MEMB: .pending. 739512321
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_peer_update: MEMB: .pending. 739512325
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_peer_update: MEMB: h10 739512330
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: send_member_notification: Sending membership update 3444 to 0 children
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: 0x6950d0 Node 739512330 ((null)) born on: 3444
May 29 08:30:07 h10 corosync[17254]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: 0x7fad3c001140 Node 739512325 (h05) born on: 3420
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: 0x7fad3c001140 Node 739512325 now known as h05 (was: (null))
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Node h05 now has process list: 00000000000000000000000000151312 (1381138)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Node h05 now has 1 quorum votes (was 0)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: send_member_notification: Sending membership update 3444 to 0 children
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: 0x6a4e80 Node 739512321 (h01) born on: 3428
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: 0x6a4e80 Node 739512321 now known as h01 (was: (null))
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Node h01 now has process list: 00000000000000000000000000151312 (1381138)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_member: Node h01 now has 1 quorum votes (was 0)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: update_expected_votes: Expected quorum votes 2 -> 3
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: send_member_notification: Sending membership update 3444 to 0 children
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 cib[17259]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 pengine[17263]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
May 29 08:30:07 h10 stonith-ng[17260]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
May 29 08:30:07 h10 attrd[17262]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_ipc: Recorded connection 0x6a6cb0 for attrd/17262
May 29 08:30:07 h10 stonith-ng[17260]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 attrd[17262]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:07 h10 attrd[17262]:   notice: main: Starting mainloop...
May 29 08:30:07 h10 stonith-ng[17260]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_ipc: Recorded connection 0x6b1260 for stonith-ng/17260
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_ipc: Sending membership update 3444 to stonith-ng
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 mgmtd: [17265]: info: Pacemaker-mgmt Git Version: 236fee1
May 29 08:30:07 h10 mgmtd: [17265]: WARN: Core dumps could be lost if multiple dumps occur.
May 29 08:30:07 h10 mgmtd: [17265]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability
May 29 08:30:07 h10 mgmtd: [17265]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
May 29 08:30:07 h10 lrmd[17261]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
May 29 08:30:07 h10 mgmtd: [17265]: notice: connect to lrmd: 0, rc=-107
May 29 08:30:07 h10 stonith-ng[17260]:   notice: crm_update_peer_state: st_peer_update_callback: Node h10[739512330] - state is now lost (was (null))
May 29 08:30:07 h10 stonith-ng[17260]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:07 h10 crmd[17264]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
May 29 08:30:07 h10 crmd[17264]:   notice: main: CRM Git Version: f47ea56
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
May 29 08:30:07 h10 cib[17259]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 29 08:30:07 h10 cib[17259]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_ipc: Recorded connection 0x6b5df0 for cib/17259
May 29 08:30:07 h10 corosync[17254]:  [pcmk  ] info: pcmk_ipc: Sending membership update 3444 to cib
May 29 08:30:07 h10 cib[17259]:   notice: crm_update_peer_state: cib_peer_update_callback: Node h10[739512330] - state is now lost (was (null))
May 29 08:30:07 h10 cib[17259]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:07 h10 cib[17259]:   notice: plugin_handle_membership: Membership 3444: quorum acquired
May 29 08:30:07 h10 cib[17259]:   notice: crm_update_peer_state: cib_peer_update_callback: Node h01[739512321] - state is now lost (was (null))
May 29 08:30:07 h10 cib[17259]:   notice: crm_update_peer_state: plugin_handle_membership: Node h01[739512321] - state is now member (was lost)
May 29 08:30:07 h10 cib[17259]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:07 h10 cib[17259]:   notice: crm_update_peer_state: cib_peer_update_callback: Node h05[739512325] - state is now lost (was (null))
May 29 08:30:07 h10 cib[17259]:   notice: crm_update_peer_state: plugin_handle_membership: Node h05[739512325] - state is now member (was lost)
May 29 08:30:08 h10 stonith-ng[17260]:   notice: setup_cib: Watching for stonith topology changes
May 29 08:30:08 h10 stonith-ng[17260]:   notice: plugin_handle_membership: Membership 3444: quorum acquired
May 29 08:30:08 h10 stonith-ng[17260]:   notice: crm_update_peer_state: st_peer_update_callback: Node h01[739512321] - state is now lost (was (null))
May 29 08:30:08 h10 stonith-ng[17260]:   notice: crm_update_peer_state: plugin_handle_membership: Node h01[739512321] - state is now member (was lost)
May 29 08:30:08 h10 stonith-ng[17260]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was lost)
May 29 08:30:08 h10 stonith-ng[17260]:   notice: crm_update_peer_state: st_peer_update_callback: Node h05[739512325] - state is now lost (was (null))
May 29 08:30:08 h10 stonith-ng[17260]:   notice: crm_update_peer_state: plugin_handle_membership: Node h05[739512325] - state is now member (was lost)
May 29 08:30:08 h10 crmd[17264]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 29 08:30:08 h10 crmd[17264]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:08 h10 corosync[17254]:  [pcmk  ] info: pcmk_ipc: Recorded connection 0x6ba170 for crmd/17264
May 29 08:30:08 h10 corosync[17254]:  [pcmk  ] info: pcmk_ipc: Sending membership update 3444 to crmd
May 29 08:30:08 h10 crmd[17264]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:08 h10 crmd[17264]:   notice: plugin_handle_membership: Membership 3444: quorum acquired
May 29 08:30:08 h10 crmd[17264]:   notice: crm_update_peer_state: plugin_handle_membership: Node h01[739512321] - state is now member (was (null))
May 29 08:30:08 h10 crmd[17264]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now member (was (null))
May 29 08:30:08 h10 crmd[17264]:   notice: crm_update_peer_state: plugin_handle_membership: Node h05[739512325] - state is now member (was (null))
May 29 08:30:08 h10 crmd[17264]:   notice: do_started: The local CRM is operational
May 29 08:30:08 h10 crmd[17264]:   notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
May 29 08:30:08 h10 mgmtd: [17265]: info: Started.
May 29 08:30:08 h10 SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
May 29 08:30:08 h10 SuSEfirewall2: using default zone 'ext' for interface eth0
May 29 08:30:08 h10 SuSEfirewall2: using default zone 'ext' for interface eth1
May 29 08:30:08 h10 SuSEfirewall2: using default zone 'ext' for interface eth2
May 29 08:30:08 h10 SuSEfirewall2: using default zone 'ext' for interface eth3
May 29 08:30:09 h10 SuSEfirewall2: using default zone 'ext' for interface eth4
May 29 08:30:09 h10 SuSEfirewall2: using default zone 'ext' for interface eth5
May 29 08:30:10 h10 crmd[17264]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
May 29 08:30:10 h10 attrd[17262]:   notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
May 29 08:30:10 h10 attrd[17262]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
May 29 08:30:10 h10 attrd[17262]:   notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
May 29 08:30:11 h10 SuSEfirewall2: batch committing...
May 29 08:30:11 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_stonith_sbd_monitor_0: not running (node=h10, call=6, rc=7, cib-update=13, confirmed=true)
May 29 08:30:11 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_DLM_monitor_0: not running (node=h10, call=11, rc=7, cib-update=14, confirmed=true)
May 29 08:30:11 h10 SuSEfirewall2: Firewall rules successfully set
May 29 08:30:11 h10 LVM(prm_LVM_CFS_VMs)[18708]: WARNING: LVM Volume CFS_VMs is not available (stopped)
May 29 08:30:11 h10 o2cb(prm_O2CB)[18738]: INFO: configfs not loaded
May 29 08:30:11 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_cLVMd_monitor_0: not running (node=h10, call=16, rc=7, cib-update=17, confirmed=true)
May 29 08:30:11 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_LVM_CFS_VMs_monitor_0: not running (node=h10, call=21, rc=7, cib-update=18, confirmed=true)
May 29 08:30:11 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_O2CB_monitor_0: not running (node=h10, call=26, rc=7, cib-update=19, confirmed=true)
May 29 08:30:11 h10 Filesystem(prm_CFS_VMs_fs)[18753]: WARNING: Couldn't find device [/dev/CFS_VMs/xen]. Expected /dev/??? to exist
May 29 08:30:11 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_CFS_VMs_fs_monitor_0: not running (node=h10, call=31, rc=7, cib-update=23, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v07_monitor_0: not running (node=h10, call=39, rc=7, cib-update=25, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v05_monitor_0: not running (node=h10, call=43, rc=7, cib-update=26, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v01_monitor_0: not running (node=h10, call=35, rc=7, cib-update=29, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v03_monitor_0: not running (node=h10, call=47, rc=7, cib-update=30, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v02_monitor_0: not running (node=h10, call=55, rc=7, cib-update=33, confirmed=true)
May 29 08:30:12 h10 attrd[17262]:   notice: attrd_perform_update: Sent update 3: shutdown=0
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v04_monitor_0: not running (node=h10, call=51, rc=7, cib-update=34, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v09_monitor_0: not running (node=h10, call=59, rc=7, cib-update=36, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v06_monitor_0: not running (node=h10, call=63, rc=7, cib-update=37, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_xen_v08_monitor_0: not running (node=h10, call=67, rc=7, cib-update=38, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_iotw_xen-LV_monitor_0: not running (node=h10, call=72, rc=7, cib-update=43, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_iotw_xen-FS_monitor_0: not running (node=h10, call=77, rc=7, cib-update=44, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_iotw_xen-LV-L1_monitor_0: not running (node=h10, call=82, rc=7, cib-update=45, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_iotw_xen-LV-L2_monitor_0: not running (node=h10, call=87, rc=7, cib-update=46, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_iotw_xen-ML-L1_monitor_0: not running (node=h10, call=98, rc=7, cib-update=51, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_iotw_xen-ML_monitor_0: not running (node=h10, call=93, rc=7, cib-update=52, confirmed=true)
May 29 08:30:12 h10 crmd[17264]:   notice: process_lrm_event: Operation prm_iotw_xen-ML-L2_monitor_0: not running (node=h10, call=103, rc=7, cib-update=53, confirmed=true)
May 29 08:30:12 h10 attrd[17262]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
May 29 08:30:12 h10 attrd[17262]:   notice: attrd_perform_update: Sent update 7: probe_complete=true
May 29 08:30:12 h10 kernel: [  194.613196] sctp: Hash tables configured (established 65536 bind 65536)
May 29 08:30:12 h10 kernel: [  194.626425] DLM installed
May 29 08:30:12 h10 cluster-dlm[19180]: main: dlm_controld master started
May 29 08:30:12 h10 cluster-dlm[19180]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 29 08:30:12 h10 cluster-dlm[19180]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:12 h10 cluster-dlm[19180]:   notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
May 29 08:30:12 h10 corosync[17254]:  [pcmk  ] info: pcmk_notify: Enabling node notifications for child 19180 (0x6be4f0)
May 29 08:30:12 h10 cluster-dlm[19180]: setup_misc_devices: found /dev/misc/dlm-control minor 54
May 29 08:30:12 h10 cluster-dlm[19180]: setup_misc_devices: found /dev/misc/dlm-monitor minor 53
May 29 08:30:12 h10 cluster-dlm[19180]: setup_misc_devices: found /dev/misc/dlm_plock minor 52
May 29 08:30:12 h10 cluster-dlm[19180]: setup_monitor: /dev/misc/dlm-monitor fd 11
May 29 08:30:12 h10 cluster-dlm[19180]: update_comms_nodes: /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
May 29 08:30:12 h10 cluster-dlm[19180]: clear_configfs_spaces: /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
May 29 08:30:12 h10 cluster-dlm[19180]: detect_protocol: totem/rrp_mode = 'passive'
May 29 08:30:12 h10 cluster-dlm[19180]: set_configfs_protocol: set protocol 1
May 29 08:30:12 h10 cluster-dlm[19180]: set_proc_rmem: set /proc/sys/net/core/rmem_default 4194304
May 29 08:30:12 h10 cluster-dlm[19180]: set_proc_rmem: set /proc/sys/net/core/rmem_max 4194304
May 29 08:30:22 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:30:32 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:30:33 h10 login[18729]: FAILED LOGIN 1 FROM /dev/tty1 FOR root, User not known to the underlying authentication module
May 29 08:30:42 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:30:52 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
[here I decided to shutdown openais]
May 29 08:30:53 h10 sshd[19608]: Accepted keyboard-interactive/pam for root from 172.20.64.67 port 57512 ssh2
May 29 08:31:02 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:31:12 h10 corosync[17254]:  [SERV  ] Unloading all Corosync service engines.
May 29 08:31:12 h10 corosync[17254]:  [pcmk  ] notice: pcmk_shutdown: Shuting down Pacemaker
[...]
[as this seemed to hang, I tried a reboot via Ctrl+Alt+Del]
May 29 08:32:12 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:32:22 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:32:32 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:32:42 h10 corosync[17254]:  [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd (pid=17264, seq=6) to terminate...
May 29 08:32:42 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:32:51 h10 shutdown[22154]: shutting down for system reboot
May 29 08:32:51 h10 init: Switching to runlevel: 6
[...]
May 29 08:33:32 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
May 29 08:33:42 h10 corosync[17254]:  [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd (pid=17264, seq=6) to terminate...
May 29 08:33:42 h10 lrmd[17261]:  warning: child_timeout_callback: prm_DLM_stop_0 process (PID 20736) timed out
May 29 08:33:42 h10 lrmd[17261]:  warning: operation_finished: prm_DLM_stop_0:20736 - timed out after 120000ms
May 29 08:33:42 h10 crmd[17264]:    error: process_lrm_event: Operation prm_DLM_stop_0: Timed Out (node=h10, call=104, timeout=120000ms)
May 29 08:33:42 h10 attrd[17262]:   notice: attrd_cs_dispatch: Update relayed from h05
May 29 08:33:42 h10 attrd[17262]:   notice: attrd_cs_dispatch: Update relayed from h05
May 29 08:33:42 h10 attrd[17262]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-prm_DLM (1496039622)
May 29 08:33:42 h10 attrd[17262]:   notice: attrd_perform_update: Sent update 19: last-failure-prm_DLM=1496039622
May 29 08:33:42 h10 attrd[17262]:   notice: attrd_cs_dispatch: Update relayed from h05
May 29 08:33:42 h10 attrd[17262]:   notice: attrd_cs_dispatch: Update relayed from h05
May 29 08:33:42 h10 attrd[17262]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-prm_DLM (1496039622)
May 29 08:33:42 h10 attrd[17262]:   notice: attrd_perform_update: Sent update 21: last-failure-prm_DLM=1496039622
May 29 08:33:42 h10 cluster-dlm[19180]: setup_cpg_daemon: daemon cpg_join error retrying
[here fencing kicked in]

DC:
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 pengine[24264]:  warning: unpack_rsc_op_failure: Processing failed op stop for prm_DLM:1 on h10: unknown error (1)
May 29 08:33:42 h05 pengine[24264]:  warning: unpack_rsc_op_failure: Processing failed op stop for prm_DLM:1 on h10: unknown error (1)
May 29 08:33:42 h05 pengine[24264]:  warning: pe_fence_node: Node h10 will be fenced because of resource failure(s)
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
[...]
May 29 08:33:42 h05 pengine[24264]:  warning: stage6: Scheduling Node h10 for STONITH
May 29 08:33:42 h05 pengine[24264]:   notice: native_stop_constraints: Stop of failed resource prm_DLM:1 is implicit after h10 is fenced
[...]
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 crmd[24265]:   notice: do_te_invoke: Processing graph 1810 (ref=pe_calc-dc-1496039622-1956) derived from /var/lib/pacemaker/pengine/pe-warn-30.bz2
May 29 08:33:42 h05 crmd[24265]:   notice: te_fence_node: Executing reboot fencing operation (160) on h10 (timeout=240000)
May 29 08:33:42 h05 stonith-ng[24261]:   notice: handle_request: Client crmd.24265.56d8e329 wants to fence (reboot) 'h10' with device '(any)'
May 29 08:33:42 h05 stonith-ng[24261]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for h10: c1f89edc-fe28-47b5-9890-1c3e893ced7f (0)
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 pengine[24264]:  warning: process_pe_message: Calculated Transition 1810: /var/lib/pacemaker/pengine/pe-warn-30.bz2
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
[...]
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 sbd: [24857]: info: Delivery process handling /dev/disk/by-id/dm-name-SBD_1-E3
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 sbd: [24857]: info: Writing reset to node slot h10
May 29 08:33:42 h05 sbd: [24857]: info: Messaging delay: 180
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 sbd: [24858]: info: Delivery process handling /dev/disk/by-id/dm-name-SBD_1-E4
May 29 08:33:42 h05 sbd: [24858]: info: Writing reset to node slot h10
May 29 08:33:42 h05 sbd: [24858]: info: Messaging delay: 180
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:42 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
[...]
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 corosync[24255]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:33:54 h05 corosync[24255]:  [CLM   ] New Configuration:
May 29 08:33:54 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)
May 29 08:33:54 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)
May 29 08:33:54 h05 corosync[24255]:  [CLM   ] Members Left:
May 29 08:33:54 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
May 29 08:33:54 h05 corosync[24255]:  [CLM   ] Members Joined:
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 3448: memb=2, new=0, lost=1
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: memb: h01 739512321
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: memb: h05 739512325
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: lost: h10 739512330
May 29 08:33:54 h05 corosync[24255]:  [CLM   ] CLM CONFIGURATION CHANGE
May 29 08:33:54 h05 corosync[24255]:  [CLM   ] New Configuration:
May 29 08:33:54 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)
May 29 08:33:54 h05 corosync[24255]:  [CLM   ]  r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)
May 29 08:33:54 h05 ocfs2_controld[24901]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now lost (was member)
May 29 08:33:54 h05 cluster-dlm[24772]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now lost (was member)
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cluster-dlm[24772]: update_cluster: Processing membership 3448
May 29 08:33:54 h05 cib[24260]:   notice: crm_update_peer_state: cib_peer_update_callback: Node h10[739512330] - state is now lost (was member)
May 29 08:33:54 h05 corosync[24255]:  [CLM   ] Members Left:
May 29 08:33:54 h05 crmd[24265]:   notice: peer_update_callback: Node return implies stonith of h10 (action 160) completed
May 29 08:33:54 h05 cluster-dlm[24772]: dlm_process_node: Skipped active node 739512325: born-on=3420, last-seen=3448, this-event=3448, last-event=3444
May 29 08:33:54 h05 stonith-ng[24261]:   notice: crm_update_peer_state: st_peer_update_callback: Node h10[739512330] - state is now lost (was member)
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 corosync[24255]:  [CLM   ] Members Joined:
May 29 08:33:54 h05 crmd[24265]:   notice: crm_update_peer_state: plugin_handle_membership: Node h10[739512330] - state is now lost (was member)
May 29 08:33:54 h05 cluster-dlm[24772]: dlm_process_node: Skipped active node 739512321: born-on=3428, last-seen=3448, this-event=3448, last-event=3444
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 3448: memb=2, new=0, lost=0
May 29 08:33:54 h05 cluster-dlm[24772]: del_configfs_node: del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/739512330"
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: MEMB: h01 739512321
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] info: pcmk_peer_update: MEMB: h05 739512325
May 29 08:33:54 h05 cluster-dlm[24772]: dlm_process_node: Removed inactive node 739512330: born-on=3444, last-seen=3444, this-event=3448, last-event=3444
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] info: ais_mark_unseen_peer_dead: Node h10 was not seen in the previous transition
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] info: update_member: Node 739512330/h10 is now: lost
May 29 08:33:54 h05 corosync[24255]:  [pcmk  ] info: send_member_notification: Sending membership update 3448 to 5 children
[??? 5 children???]
May 29 08:33:54 h05 corosync[24255]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 kernel: [1622638.010108] dlm: closing connection to node 739512330
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 corosync[24255]:  [CPG   ] chosen downlist: sender r(0) ip(172.20.16.1) r(1) ip(10.2.2.1) ; members(old:2 left:0)
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 cmirrord[25158]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May 29 08:33:54 h05 corosync[24255]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 29 08:33:54 h05 crmd[24265]:   notice: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
May 29 08:33:54 h05 crmd[24265]:   notice: abort_transition_graph: Transition aborted by deletion of transient_attributes[@id='h10']: Transient attribute change (cib=1.150.115, source=te_update_diff:395, pa
th=/cib/status/node_state[@id='h10']/transient_attributes[@id='h10'], 0)
May 29 08:33:54 h05 crmd[24265]:   notice: crm_update_quorum: Updating quorum status to true (call=2080)
May 29 08:33:54 h05 attrd[24263]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
May 29 08:33:54 h05 attrd[24263]:   notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
May 29 08:33:54 h05 attrd[24263]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
[...from here the system behaved sanely again...]

Regards,
Ulrich