[ClusterLabs] Problem with stonith and starting services
Ken Gaillot
kgaillot at redhat.com
Mon Jul 3 17:38:14 CEST 2017
On 07/03/2017 02:34 AM, Cesar Hernandez wrote:
> Hi
>
> I have installed a pacemaker cluster with two nodes. The same type of installation has done before many times and the following error never appeared before. The situation is the following:
>
> both nodes running cluster services
> stop pacemaker&corosync on node 1
> stop pacemaker&corosync on node 2
> start corosync&pacemaker on node 1
>
> Then node 1 starts, it sees node2 down, and it fences it, as it was expected. But the problem comes when node 2 is rebooted and starts cluster services: sometimes, it starts the corosync service but the pacemaker service starts and then stops. The syslog shows the following error in these cases:
>
> Jul 3 09:07:04 node2 pacemakerd[597]: warning: The crmd process (608) can no longer be respawned, shutting the cluster down.
> Jul 3 09:07:04 node2 pacemakerd[597]: notice: Shutting down Pacemaker
>
> Previous messages show some warning messages that I'm not sure they are related with the shutdown:
>
>
> Jul 3 09:07:04 node2 stonith-ng[604]: notice: Operation reboot of node2 by node1 for crmd.2413 at node1.608d8118: OK
> Jul 3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by node1 for node1!
> Jul 3 09:07:04 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x1471800, async-conn=0x1471800) left
>
>
> On node1, all resources become unrunnable and it stays there forever until I start manually pacemaker service on node2.
> As I said, same type of installation has done before on other servers and never happened this. The only difference is that in previous installations I configured corosync with multicast and now I have configured with unicast (my current network environment doesn't allow multicast) but I think it's not related with that behaviour
Agreed, I don't think it's multicast vs unicast.
I can't see from this what's going wrong. Possibly node1 is trying to
re-fence node2 when it comes back. Check that the fencing resources are
configured correctly, and check whether node1 sees the first fencing
succeed.
> Cluster software versions:
> corosync-1.4.8
> crmsh-2.1.5
> libqb-0.17.2
> Pacemaker-1.1.14
> resource-agents-3.9.6
>
>
>
> Can you help me?
>
> Thanks
>
> Cesar
More information about the Users
mailing list