[Pacemaker] Problems with Pacemaker + Corosync after reboot

Thu Dec 23 23:12:25 UTC 2010

On Wednesday, 22 December 2010 08:29:02 -0500,
Shravan Mishra wrote:

> Hi,

Hi, Shravan.

> What's happening is that corosync is forking but the exec is not
> happening.

And do you think that what is shown in the logs is consistent with what
is shown using ps?

> I used to see this problem in my case when syslog-ng process was not
> running.
> 
> Try checking that and starting it and then start corosync.

Now I see that if I do a shutdown of the node that has the resource
(failover-ip), then this does not migrate to another node. By doing the
test I made sure Pacemaker + Corosync are functioning correctly on both
nodes before doing a shutdown of Atlantis.

Before making a shutdown of Atlantis:

-----------------------------------------------------------------------
daedalus:~# crm_mon --one-shot
============
Last updated: Thu Dec 23 19:24:09 2010
Stack: openais
Current DC: atlantis - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ atlantis daedalus ]

 failover-ip    (ocf::heartbeat:IPaddr):        Started atlantis
-----------------------------------------------------------------------

After doing a shutdown of Atlantis:

-----------------------------------------------------------------------
daedalus:~# crm_mon --one-shot
============
Last updated: Thu Dec 23 19:25:44 2010
Stack: openais
Current DC: daedalus - partition WITHOUT quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ daedalus ]
OFFLINE: [ atlantis ]
-----------------------------------------------------------------------

Here I'm using a configuration like the one presented in the wiki [1].

I am also noting that after the Atlantis launch, corosync makes the fork
without exec (as we assume from what I showed in the previous mail) and
only now is when the resource migrates to Daedalus:

-----------------------------------------------------------------------
daedalus:~# crm_mon --one-shot
============
Last updated: Thu Dec 23 19:49:11 2010
Stack: openais
Current DC: daedalus - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ daedalus ]
OFFLINE: [ atlantis ]

 failover-ip    (ocf::heartbeat:IPaddr):        Started daedalus
-----------------------------------------------------------------------

-----------------------------------------------------------------------
atlantis:~# crm_mon --one-shot

Connection to cluster failed: connection failed
-----------------------------------------------------------------------

I tried doing a "corosync stop", but the processes are not closed:

atlantis:~# ps auxf
[...]
root      1564  0.0  1.2 168144  3240 ?        S    19:38   0:00 /usr/sbin/corosync
root      1565  0.0  1.2 168144  3240 ?        S    19:38   0:00 /usr/sbin/corosync
root      1566  0.0  1.2 168144  3240 ?        S    19:38   0:00 /usr/sbin/corosync
root      1567  0.0  1.2 168144  3240 ?        S    19:38   0:00 /usr/sbin/corosync
root      1568  0.0  1.2 168144  3240 ?        S    19:38   0:00 /usr/sbin/corosync
root      1569  0.0  1.2 168144  3240 ?        S    19:38   0:00 /usr/sbin/corosync

The only way I found to correctly start corosync is doing a "pkill -9
corosync" and "corosync start":

atlantis:~# ps auxf
[...]
root      2120  0.2  1.9 134288  5060 ?        Ssl  19:59   0:00 /usr/sbin/corosync
root      2128  0.0  4.5  76028 11600 ?        SLs  19:59   0:00  \_ /usr/lib/heartbeat/stonithd
105       2129  0.1  2.0  79104  5120 ?        S    19:59   0:00  \_ /usr/lib/heartbeat/cib
root      2130  0.0  0.8  71580  2108 ?        S    19:59   0:00  \_ /usr/lib/heartbeat/lrmd
105       2131  0.0  1.3  79968  3340 ?        S    19:59   0:00  \_ /usr/lib/heartbeat/attrd
105       2132  0.0  1.1  80332  2892 ?        S    19:59   0:00  \_ /usr/lib/heartbeat/pengine
105       2133  0.0  1.4  86216  3764 ?        S    19:59   0:00  \_ /usr/lib/heartbeat/crmd

After this, the resource automatically migrates back to Atlantis:

-----------------------------------------------------------------------
daedalus:~# crm_mon --one-shot
============
Last updated: Thu Dec 23 20:03:18 2010
Stack: openais
Current DC: daedalus - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ atlantis daedalus ]

 failover-ip    (ocf::heartbeat:IPaddr):        Started atlantis
-----------------------------------------------------------------------

Any idea how to fix this problem with Corosync?

Why to do a shutdown of Atlantis the resource does not migrate to
Daedalus?

Thanks for your reply.

Regards,
Daniel

[1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
-- 
Daniel Bareiro - GNU/Linux registered user #188.598
Proudly running Debian GNU/Linux with uptime:
17:52:45 up 71 days, 18:19, 10 users,  load average: 0.00, 0.01, 0.03
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20101223/412e0244/attachment-0004.sig>