[ClusterLabs] hanging after node shutdown

Thu Sep 10 06:54:54 EDT 2015

Hi

I have now for a few weeks been trying to get a cluster using pacemaker
to work. We are using Ubuntu 14.04.2 LTS with
corosync 2.3.3-1ubuntu1
pacemaker 1.1.10+git2013

It is a 2 node cluster and it includes a gfs2 file system on top of
drbd.

After som initial problem with stonith not working due to dlm_stonith
missing (which I fixed by compiling it myself), it looked good. I have
set upp the cluster to power off the other node through stonith instead
of reboot as is default.

I tested failures by doing init 0, halt -f, pkill -9 coresync on one
node and it worked fine. But then I detected that after the cluster had
been up (both nodes) for 2 days, doing init 0 on one node resulted in
that node hanging during shutdown and the other node failing to stonith
it. And after forcing the hanging node to power off and then powering it
on, doing pcs status on it reports not being able to talk to other node
and all resources are stopped. And on the other node (which have been
running the whole time) pcs status hangs (crm status works and says that
all is up) and the gfs2 file system is blocking. Doing init 0 on this
node never shuts it down, a reboot -f does work and after it is upp
again the entire cluster is ok.

So in short, everything works fine after a fresh boot of both two nodes
but after 2 days a requested shutdown of one node (using init 0) hangs
and the other node stops working correctly.

Looking at the console on the node I did init 0 on, dlm_controld reports
that cluster is down and then that drbd have problem talking to other
node, and then that gfs2 is blocked. So that is why that node never
powers off - gfs2 and drbd was not shutdown correctly by the pacemaker
before it stopped (or is trying to stop).

Looking through the logs (syslog and corosync.log) (I have debug mode on
corosync) I can see that on node 1 (the one I left running the whole
time) it does:

stonith-ng:     info: crm_update_peer_proc:     pcmk_cpg_membership: Node node2[2] - corosync-cpg is now offline
crmd:     info: crm_update_peer_proc:     pcmk_cpg_membership: Node node2[2] - corosync-cpg is now offline
crmd:     info: peer_update_callback:     Client node2/peer now has status [offline] (DC=node2)

crmd:   notice: peer_update_callback:     Our peer on the DC is dead

stonith-ng notice: handle_request:   Client stonith-api.10797.41ef3128 wants to fence (off) '2' with device '(any)'
stonith-ng notice: initiate_remote_stonith_op:       Initiating remote operation off for node2: 20f62cf6-90eb-4c53-8da1-30ab
048de495 (0)
stonith-ng:     info: stonith_command:  Processed st_fence from stonith-api.10797: Operation now in progress (-115)

corosyncdebug   [TOTEM ] Resetting old ring state
corosyncdebug   [TOTEM ] recovery to regular 1-0
corosyncdebug   [MAIN  ] Member left: r(0) ip(10.10.1.2) r(1) ip(192.168.12.142) 
corosyncdebug   [TOTEM ] waiting_trans_ack changed to 1
corosyncdebug   [TOTEM ] entering OPERATIONAL state.
corosyncnotice  [TOTEM ] A new membership (10.10.1.1:588) was formed. Members left: 2
corosyncdebug   [SYNC  ] Committing synchronization for corosync configuration map access
corosyncdebug   [QB    ] Not first sync -> no action
corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.10.1.1) r(1) ip(192.168.12.140) ; members(old:2 left:1)
corosyncdebug   [CPG   ] chosen downlist: sender r(0) ip(10.10.1.1) r(1) ip(192.168.12.140) ; members(old:2 left:1)
corosyncdebug   [CPG   ] got joinlist message from node 1
corosyncdebug   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01

and a little later most log entries are:
cib:     info: crm_cs_flush:     Sent 0 CPG messages  (3 remaining, last=25): Try again (6)

The Sent 0 CFG messages is logged forever until I force reboot of this node.

On node 2 (the one I did init 0) I can find:
stonith-ng[1415]:   notice: log_operation: Operation 'monitor' [17088] for device 'ipmi-fencing-node1' returned: -201 (Generic Pacem
aker error)
several lines from crmd, attrd, pengine about ipmi-fencing

Hard to know what log entries are important.

But as as summary: after power on my 2 node cluster works fine, reboots
and other node failure tests all work fine. But after letting the
cluster run for 2 days, when I do node failure test parts of the cluster
services fails to stop on the node failure is simulated and both nodes
stop working (even though only one node was shutdown).

The version of corosync and pacemaker is somewhat old - it is the
official version available for our ubuntu version. Is this a known
problem?

I have seen that there are newer versions available, pacemaker has many
changes done as I see on github. If this is a know problem, which
versions of corosync and pacemaker should I try to change to?

Or do you have some other idea what I can test/try to pin this down?

    Dan