[ClusterLabs] Antw: hanging after node shutdown
Ulrich.Windl at rz.uni-regensburg.de
Thu Sep 10 11:12:01 EDT 2015
>>> dan <dan.oscarsson at intraphone.com> schrieb am 10.09.2015 um 12:54 in Nachricht
<1441882494.19690.53.camel at intraphone.com>:
> I have now for a few weeks been trying to get a cluster using pacemaker
> to work. We are using Ubuntu 14.04.2 LTS with
> corosync 2.3.3-1ubuntu1
> pacemaker 1.1.10+git2013
I don't know how current Ubunto is, but SLES11 SP4 is already at pacemaker 1.1.12. You may have less trouble using a more recent version (if available for you).
> It is a 2 node cluster and it includes a gfs2 file system on top of
> After som initial problem with stonith not working due to dlm_stonith
> missing (which I fixed by compiling it myself), it looked good. I have
> set upp the cluster to power off the other node through stonith instead
> of reboot as is default.
> I tested failures by doing init 0, halt -f, pkill -9 coresync on one
> node and it worked fine. But then I detected that after the cluster had
> been up (both nodes) for 2 days, doing init 0 on one node resulted in
> that node hanging during shutdown and the other node failing to stonith
> it. And after forcing the hanging node to power off and then powering it
Could you find out whay? Maybe the cluster node tried a clean stop/migration of resources, waiting for operations to finish. Waht's in the logs?
> on, doing pcs status on it reports not being able to talk to other node
> and all resources are stopped. And on the other node (which have been
> running the whole time) pcs status hangs (crm status works and says that
> all is up) and the gfs2 file system is blocking. Doing init 0 on this
> node never shuts it down, a reboot -f does work and after it is upp
> again the entire cluster is ok.
I'm old schol and always use "shutdown -h|-r now" ;-)
> So in short, everything works fine after a fresh boot of both two nodes
> but after 2 days a requested shutdown of one node (using init 0) hangs
> and the other node stops working correctly.
> Looking at the console on the node I did init 0 on, dlm_controld reports
> that cluster is down and then that drbd have problem talking to other
> node, and then that gfs2 is blocked. So that is why that node never
> powers off - gfs2 and drbd was not shutdown correctly by the pacemaker
> before it stopped (or is trying to stop).
> Looking through the logs (syslog and corosync.log) (I have debug mode on
> corosync) I can see that on node 1 (the one I left running the whole
> time) it does:
> stonith-ng: info: crm_update_peer_proc: pcmk_cpg_membership: Node
> node2 - corosync-cpg is now offline
> crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node node2
> - corosync-cpg is now offline
> crmd: info: peer_update_callback: Client node2/peer now has status
> [offline] (DC=node2)
> crmd: notice: peer_update_callback: Our peer on the DC is dead
If the nodeis actually alive at that time, you have a big configuration or software problem!
> stonith-ng notice: handle_request: Client stonith-api.10797.41ef3128 wants
> to fence (off) '2' with device '(any)'
> stonith-ng notice: initiate_remote_stonith_op: Initiating remote
> operation off for node2: 20f62cf6-90eb-4c53-8da1-30ab
> 048de495 (0)
> stonith-ng: info: stonith_command: Processed st_fence from
> stonith-api.10797: Operation now in progress (-115)
> corosyncdebug [TOTEM ] Resetting old ring state
> corosyncdebug [TOTEM ] recovery to regular 1-0
Ah, the "good old corosync rings". I just guess there a lots of bugs to be fixed.
I can imaging that when you have NFS or cLVM or GFS or any other CFS on the same net that a corosync ring uses, corosync will go crazy under network load.
> corosyncdebug [MAIN ] Member left: r(0) ip(10.10.1.2) r(1)
> corosyncdebug [TOTEM ] waiting_trans_ack changed to 1
> corosyncdebug [TOTEM ] entering OPERATIONAL state.
> corosyncnotice [TOTEM ] A new membership (10.10.1.1:588) was formed.
> Members left: 2
> corosyncdebug [SYNC ] Committing synchronization for corosync
> configuration map access
> corosyncdebug [QB ] Not first sync -> no action
> corosyncdebug [CPG ] comparing: sender r(0) ip(10.10.1.1) r(1)
> ip(192.168.12.140) ; members(old:2 left:1)
> corosyncdebug [CPG ] chosen downlist: sender r(0) ip(10.10.1.1) r(1)
> ip(192.168.12.140) ; members(old:2 left:1)
> corosyncdebug [CPG ] got joinlist message from node 1
> corosyncdebug [SYNC ] Committing synchronization for corosync cluster
> closed process group service v1.01
> and a little later most log entries are:
> cib: info: crm_cs_flush: Sent 0 CPG messages (3 remaining,
> last=25): Try again (6)
> The Sent 0 CFG messages is logged forever until I force reboot of this node.
> On node 2 (the one I did init 0) I can find:
> stonith-ng: notice: log_operation: Operation 'monitor'  for
> device 'ipmi-fencing-node1' returned: -201 (Generic Pacem
> aker error)
> several lines from crmd, attrd, pengine about ipmi-fencing
> Hard to know what log entries are important.
Yes, I'm still learning, too.
> But as as summary: after power on my 2 node cluster works fine, reboots
> and other node failure tests all work fine. But after letting the
> cluster run for 2 days, when I do node failure test parts of the cluster
> services fails to stop on the node failure is simulated and both nodes
> stop working (even though only one node was shutdown).
I suspect it's not the running time, but the load at some point in these two days.
> The version of corosync and pacemaker is somewhat old - it is the
> official version available for our ubuntu version. Is this a known
Can't tell, sorry!
> I have seen that there are newer versions available, pacemaker has many
> changes done as I see on github. If this is a know problem, which
> versions of corosync and pacemaker should I try to change to?
> Or do you have some other idea what I can test/try to pin this down?
> Users mailing list: Users at clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users