[ClusterLabs] Antw: Problem with stonith and starting services

Thu Jul 6 01:45:56 EDT 2017

>>> Cesar Hernandez <c.hernandez at medlabmg.com> schrieb am 03.07.2017 um 09:34 in
Nachricht <BD70E5A4-B0B3-431D-B046-2C796574029F at medlabmg.com>:
> Hi
> 
> I have installed a pacemaker cluster with two nodes. The same type of 
> installation has done before many times and the following error never 
> appeared before. The situation is the following:
> 
> both nodes running cluster services
> stop pacemaker&corosync on node 1
> stop pacemaker&corosync on node 2
> start corosync&pacemaker on node 1

I don't have answers, but questions:
Assuming node1 was DC when stopped: Will ist CIB still record it as DC after being stopped?
Obviously node1 cannot know about any changes node2 did. And node1 when started will find that node2 is unexpectedly down, so it will fence it to be sure. node2 when started will thing ist DC also. That might trigger some comunication with node1 to find out who's right. AFAIK node1 should win, because it has the longer uptime.

> 
> Then node 1 starts, it sees node2 down, and it fences it, as it was 
> expected. But the problem comes when node 2 is rebooted and starts cluster 
> services: sometimes, it starts the corosync service but the pacemaker service 
> starts and then stops. The syslog shows the following error in these cases:
> 
> Jul  3 09:07:04 node2 pacemakerd[597]:  warning: The crmd process (608) can 
> no longer be respawned, shutting the cluster down.
> Jul  3 09:07:04 node2 pacemakerd[597]:   notice: Shutting down Pacemaker
> 
> Previous messages show some warning messages that I'm not sure they are 
> related with the shutdown:
> 
> 
> Jul  3 09:07:04 node2 stonith-ng[604]:   notice: Operation reboot of node2 by 
> node1 for crmd.2413 at node1.608d8118: OK
> Jul  3 09:07:04 node2 crmd[608]:     crit: We were allegedly just fenced by 
> node1 for node1!
> Jul  3 09:07:04 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
> crmd (conn=0x1471800, async-conn=0x1471800) left
> 
> 
> On node1, all resources become unrunnable and it stays there forever until I 
> start manually pacemaker service on node2. 

What is the detailed status then?

Regards,
Ulrich

> As I said, same type of installation has done before on other servers and 
> never happened this. The only difference is that in previous installations I 
> configured corosync with multicast and now I have configured with unicast (my 
> current network environment doesn't allow multicast) but I think it's not 
> related with that behaviour
> 
> Cluster software versions:
> corosync-1.4.8
> crmsh-2.1.5
> libqb-0.17.2
> Pacemaker-1.1.14
> resource-agents-3.9.6
> 
> 
> 
> Can you help me?
> 
> Thanks
> 
> Cesar
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org