[ClusterLabs] Antw: hanging after node shutdown

Thu Sep 10 15:12:01 UTC 2015

>>> dan <dan.oscarsson at intraphone.com> schrieb am 10.09.2015 um 12:54 in Nachricht
<1441882494.19690.53.camel at intraphone.com>:
> Hi
> 
> I have now for a few weeks been trying to get a cluster using pacemaker
> to work. We are using Ubuntu 14.04.2 LTS with
> corosync 2.3.3-1ubuntu1
> pacemaker 1.1.10+git2013

I don't know how current Ubunto is, but SLES11 SP4 is already at pacemaker 1.1.12. You may have less trouble using a more recent version (if available for you).

> 
> It is a 2 node cluster and it includes a gfs2 file system on top of
> drbd.
> 
> After som initial problem with stonith not working due to dlm_stonith
> missing (which I fixed by compiling it myself), it looked good. I have
> set upp the cluster to power off the other node through stonith instead
> of reboot as is default.
> 
> I tested failures by doing init 0, halt -f, pkill -9 coresync on one
> node and it worked fine. But then I detected that after the cluster had
> been up (both nodes) for 2 days, doing init 0 on one node resulted in
> that node hanging during shutdown and the other node failing to stonith
> it. And after forcing the hanging node to power off and then powering it

Could you find out whay? Maybe the cluster node tried a clean stop/migration of resources, waiting for operations to finish. Waht's in the logs?

> on, doing pcs status on it reports not being able to talk to other node
> and all resources are stopped. And on the other node (which have been
> running the whole time) pcs status hangs (crm status works and says that
> all is up) and the gfs2 file system is blocking. Doing init 0 on this
> node never shuts it down, a reboot -f does work and after it is upp
> again the entire cluster is ok.

I'm old schol and always use "shutdown -h|-r now" ;-)

> 
> So in short, everything works fine after a fresh boot of both two nodes
> but after 2 days a requested shutdown of one node (using init 0) hangs
> and the other node stops working correctly.
> 
> Looking at the console on the node I did init 0 on, dlm_controld reports
> that cluster is down and then that drbd have problem talking to other
> node, and then that gfs2 is blocked. So that is why that node never
> powers off - gfs2 and drbd was not shutdown correctly by the pacemaker
> before it stopped (or is trying to stop).
> 
> Looking through the logs (syslog and corosync.log) (I have debug mode on
> corosync) I can see that on node 1 (the one I left running the whole
> time) it does:
> 
> stonith-ng:     info: crm_update_peer_proc:     pcmk_cpg_membership: Node 
> node2[2] - corosync-cpg is now offline
> crmd:     info: crm_update_peer_proc:     pcmk_cpg_membership: Node node2[2] 
> - corosync-cpg is now offline
> crmd:     info: peer_update_callback:     Client node2/peer now has status 
> [offline] (DC=node2)
> 
> crmd:   notice: peer_update_callback:     Our peer on the DC is dead

If the nodeis actually alive at that time, you have a big configuration or software problem!

> 
> stonith-ng notice: handle_request:   Client stonith-api.10797.41ef3128 wants 
> to fence (off) '2' with device '(any)'
> stonith-ng notice: initiate_remote_stonith_op:       Initiating remote 
> operation off for node2: 20f62cf6-90eb-4c53-8da1-30ab
> 048de495 (0)
> stonith-ng:     info: stonith_command:  Processed st_fence from 
> stonith-api.10797: Operation now in progress (-115)
> 
> corosyncdebug   [TOTEM ] Resetting old ring state
> corosyncdebug   [TOTEM ] recovery to regular 1-0

Ah, the "good old corosync rings". I just guess there a lots of bugs to be fixed.
I can imaging that when you have NFS or cLVM or GFS or any other CFS on the same net that a corosync ring uses, corosync will go crazy under network load.

> corosyncdebug   [MAIN  ] Member left: r(0) ip(10.10.1.2) r(1) 
> ip(192.168.12.142) 
> corosyncdebug   [TOTEM ] waiting_trans_ack changed to 1
> corosyncdebug   [TOTEM ] entering OPERATIONAL state.
> corosyncnotice  [TOTEM ] A new membership (10.10.1.1:588) was formed. 
> Members left: 2
> corosyncdebug   [SYNC  ] Committing synchronization for corosync 
> configuration map access
> corosyncdebug   [QB    ] Not first sync -> no action
> corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.10.1.1) r(1) 
> ip(192.168.12.140) ; members(old:2 left:1)
> corosyncdebug   [CPG   ] chosen downlist: sender r(0) ip(10.10.1.1) r(1) 
> ip(192.168.12.140) ; members(old:2 left:1)
> corosyncdebug   [CPG   ] got joinlist message from node 1
> corosyncdebug   [SYNC  ] Committing synchronization for corosync cluster 
> closed process group service v1.01
> 
> and a little later most log entries are:
> cib:     info: crm_cs_flush:     Sent 0 CPG messages  (3 remaining, 
> last=25): Try again (6)
> 
> The Sent 0 CFG messages is logged forever until I force reboot of this node.
> 
> 
> On node 2 (the one I did init 0) I can find:
> stonith-ng[1415]:   notice: log_operation: Operation 'monitor' [17088] for 
> device 'ipmi-fencing-node1' returned: -201 (Generic Pacem
> aker error)
> several lines from crmd, attrd, pengine about ipmi-fencing
> 
> Hard to know what log entries are important.

Yes, I'm still learning, too.

> 
> But as as summary: after power on my 2 node cluster works fine, reboots
> and other node failure tests all work fine. But after letting the
> cluster run for 2 days, when I do node failure test parts of the cluster
> services fails to stop on the node failure is simulated and both nodes
> stop working (even though only one node was shutdown).

I suspect it's not the running time, but the load at some point in these two days.

> 
> The version of corosync and pacemaker is somewhat old - it is the
> official version available for our ubuntu version. Is this a known
> problem?

Can't tell, sorry!

> 
> I have seen that there are newer versions available, pacemaker has many
> changes done as I see on github. If this is a know problem, which
> versions of corosync and pacemaker should I try to change to?
> 
> Or do you have some other idea what I can test/try to pin this down?

Regards,
Ulrich

> 
>     Dan
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org