[Pacemaker] Problems with Pacemaker + Corosync after reboot

Fri Dec 24 17:47:53 EST 2010

HI,

Your configuration is straightforward, nothing out of the ordinary.

Make sure that when your other box comes up from offline, syslog-ng is
started before corosync. Because it appears that when you kill all the
process and restart by that time syslog-ng has started and everything comes
up properly.

Your resource will migrate back because there is no reason for it to to
stick there i.e. resource-stickiness.

You might want to look into how to get resource stickiness which may mean
enhancing your config a little more than what you have now. Configuration
manual explains it very nicely.

There is a tool called ptest you can use it to get the scores which
determines the stickiness for e.g. you can experiment with different
resource-stickiness values and then do

ptest -sL  to look at the score.

You will have to go a bit deeper than your vanilla config to understand and
also read the manual.

Thanks
-Shravan

O n Thu, Dec 23, 2010 at 6:12 PM, Daniel Bareiro <daniel-listas at gmx.net>
wrote:
> On Wednesday, 22 December 2010 08:29:02 -0500,
> Shravan Mishra wrote:
>
>> Hi,
>
> Hi, Shravan.
>
>> What's happening is that corosync is forking but the exec is not
>> happening.
>
> And do you think that what is shown in the logs is consistent with what
> is shown using ps?
>
>> I used to see this problem in my case when syslog-ng process was not
>> running.
>>
>> Try checking that and starting it and then start corosync.
>
> Now I see that if I do a shutdown of the node that has the resource
> (failover-ip), then this does not migrate to another node. By doing the
> test I made sure Pacemaker + Corosync are functioning correctly on both
> nodes before doing a shutdown of Atlantis.
>
> Before making a shutdown of Atlantis:
>
> -----------------------------------------------------------------------
> daedalus:~# crm_mon --one-shot
> ============
> Last updated: Thu Dec 23 19:24:09 2010
> Stack: openais
> Current DC: atlantis - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ atlantis daedalus ]
>
>  failover-ip    (ocf::heartbeat:IPaddr):        Started atlantis
> -----------------------------------------------------------------------
>
> After doing a shutdown of Atlantis:
>
> -----------------------------------------------------------------------
> daedalus:~# crm_mon --one-shot
> ============
> Last updated: Thu Dec 23 19:25:44 2010
> Stack: openais
> Current DC: daedalus - partition WITHOUT quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ daedalus ]
> OFFLINE: [ atlantis ]
> -----------------------------------------------------------------------
>
> Here I'm using a configuration like the one presented in the wiki [1].
>
> I am also noting that after the Atlantis launch, corosync makes the fork
> without exec (as we assume from what I showed in the previous mail) and
> only now is when the resource migrates to Daedalus:
>
> -----------------------------------------------------------------------
> daedalus:~# crm_mon --one-shot
> ============
> Last updated: Thu Dec 23 19:49:11 2010
> Stack: openais
> Current DC: daedalus - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ daedalus ]
> OFFLINE: [ atlantis ]
>
>  failover-ip    (ocf::heartbeat:IPaddr):        Started daedalus
> -----------------------------------------------------------------------
>
>
> -----------------------------------------------------------------------
> atlantis:~# crm_mon --one-shot
>
> Connection to cluster failed: connection failed
> -----------------------------------------------------------------------
>
> I tried doing a "corosync stop", but the processes are not closed:
>
> atlantis:~# ps auxf
> [...]
> root      1564  0.0  1.2 168144  3240 ?        S    19:38   0:00
/usr/sbin/corosync
> root      1565  0.0  1.2 168144  3240 ?        S    19:38   0:00
/usr/sbin/corosync
> root      1566  0.0  1.2 168144  3240 ?        S    19:38   0:00
/usr/sbin/corosync
> root      1567  0.0  1.2 168144  3240 ?        S    19:38   0:00
/usr/sbin/corosync
> root      1568  0.0  1.2 168144  3240 ?        S    19:38   0:00
/usr/sbin/corosync
> root      1569  0.0  1.2 168144  3240 ?        S    19:38   0:00
/usr/sbin/corosync
>
>
> The only way I found to correctly start corosync is doing a "pkill -9
> corosync" and "corosync start":
>
>
> atlantis:~# ps auxf
> [...]
> root      2120  0.2  1.9 134288  5060 ?        Ssl  19:59   0:00
/usr/sbin/corosync
> root      2128  0.0  4.5  76028 11600 ?        SLs  19:59   0:00  \_
/usr/lib/heartbeat/stonithd
> 105       2129  0.1  2.0  79104  5120 ?        S    19:59   0:00  \_
/usr/lib/heartbeat/cib
> root      2130  0.0  0.8  71580  2108 ?        S    19:59   0:00  \_
/usr/lib/heartbeat/lrmd
> 105       2131  0.0  1.3  79968  3340 ?        S    19:59   0:00  \_
/usr/lib/heartbeat/attrd
> 105       2132  0.0  1.1  80332  2892 ?        S    19:59   0:00  \_
/usr/lib/heartbeat/pengine
> 105       2133  0.0  1.4  86216  3764 ?        S    19:59   0:00  \_
/usr/lib/heartbeat/crmd
>
>
> After this, the resource automatically migrates back to Atlantis:
>
> -----------------------------------------------------------------------
> daedalus:~# crm_mon --one-shot
> ============
> Last updated: Thu Dec 23 20:03:18 2010
> Stack: openais
> Current DC: daedalus - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ atlantis daedalus ]
>
>  failover-ip    (ocf::heartbeat:IPaddr):        Started atlantis
> -----------------------------------------------------------------------
>
>
> Any idea how to fix this problem with Corosync?
>
> Why to do a shutdown of Atlantis the resource does not migrate to
> Daedalus?
>
>
>
> Thanks for your reply.
>
> Regards,
> Daniel
>
> [1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
> --
> Daniel Bareiro - GNU/Linux registered user #188.598
> Proudly running Debian GNU/Linux with uptime:
> 17:52:45 up 71 days, 18:19, 10 users,  load average: 0.00, 0.01, 0.03
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAk0T11kACgkQZpa/GxTmHTejywCfdVBAfru12t1LL8kvDiSCYGpJ
> c9YAnjlbFMF9NzFWKCsA1vkzdCfOCmJr
> =7Gh3
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20101224/ab1226e1/attachment-0002.html>