[Pacemaker] cluster deadlock/malfunctioning

Tue Dec 2 09:22:07 EST 2008

On Fri, Nov 14, 2008 at 16:25, Raoul Bhatia [IPAX] <r.bhatia at ipax.at> wrote:
> dear list,
>
> i again encounter my cluster malfunctioning.
>
> i upgraded my configuration to use pam-ldap and libnss-ldap, which
> did not work out of the box.
>
> pacemaker tried to recover some errors and then stonithed both
> hosts.
>
> i now have only got:
>> Node: wc01 (31de4ab3-2d05-476e-8f9a-627ad6cd94ca): online
>> Node: wc02 (f36760d8-d84a-46b2-b452-4c8cac8b3396): online
>>
>> Clone Set: clone_nfs-common
>>     Resource Group: group_nfs-common:0
>>         nfs-common:0    (lsb:nfs-common):       Started wc02
>>     Resource Group: group_nfs-common:1
>>         nfs-common:1    (lsb:nfs-common):       Started wc01
>> Clone Set: DoFencing
>>     stonith_rackpdu:0   (stonith:external/rackpdu):     Started wc02
>>     stonith_rackpdu:1   (stonith:external/rackpdu):     Started wc01
>
> and pacemaker seems happy :)
>
> (please note, that i normally have several groups, clones and
> master/slave resources active.
>
> i took a look at pe-warn-12143.bz2 but do not know how to interpret the
> three different threads i see in the corresponding .dot file.
>
> can any1 explain how i may debug such a deadlock?

What exactly were you trying to determine here?

Running through ptest, I see two major areas of concern:

Clones clone_nfs-common contains non-OCF resource nfs-common:0 and so
can only be used as an anonymous clone. Set the globally-unique meta
attribute to false
Clones clone_mysql-proxy contains non-OCF resource mysql-proxy:0 and
so can only be used as an anonymous clone. Set the globally-unique
meta attribute to false

and

Hard error - drbd_www:0_monitor_0 failed with rc=4: Preventing
drbd_www:0 from re-starting on wc01
Hard error - drbd_www:1_monitor_0 failed with rc=4: Preventing
drbd_www:1 from re-starting on wc01
Hard error - drbd_mysql:1_monitor_0 failed with rc=4: Preventing
drbd_mysql:1 from re-starting on wc01
Hard error - drbd_mysql:0_monitor_0 failed with rc=4: Preventing
drbd_mysql:0 from re-starting on wc01

If drbd is failing, then I can imagine that would prevent much of the
rest of the cluster from being started.

Also, you might want to look into:

Operation nfs-kernel-server_monitor_0 found resource nfs-kernel-server
active on wc01
Operation nfs-common:0_monitor_0 found resource nfs-common:0 active on wc01

Having said all that, I just looked at the config and all of the above
is more than likely caused by the issue we spoke about the other day -
loading 0.6 config fragments into a 1.0 cluster (where all the meta
attributes now have dashes instead of underscores)