[Pacemaker] Stonithd segfaulting and causing unclean?

Michał Margula alchemyx at uznam.net.pl
Thu Mar 20 09:40:39 EDT 2014


We had many unresolved issues some time ago with Pacemaker. I think
almost all of them got solved by fixing link between clusters (removed
media converters, replaced them with NIC with SFP+, upgraded to 10Gbps).

Now it seems to be working fine with few exceptions:

- if I kill one node manually (power off, but IPMI is still operational
so stonith is working fine)


- if I move one of nodes to standby and it had few Xen domUs

It gets Unclean. Funny thing is that if I kill (or make a standby) node
B, also node A gets unclean. So I have situation that crm_mon shows
Node-A: UNCLEAN (Online), Node-B: Unclean (OFFLINE). To be honest I have
much trouble diagnosing it (BTW: is there a some kind of documentation
how to read logs of pacemaker?)

One thing I found that makes me worried is:

Mar 20 04:16:39 rivendell-A kernel: [  774.635312] stonithd[10089]:
segfault at 0 ip 00007f51a1aa5bd4 sp 00007fff20c7fb50 error 4 in

And it happens on both nodes. And also it seems that it only happens
when I define manual fencing device (meatware) as such:

primitive manual-fencing-of-A stonith:meatware \
        params hostlist="rivendell-B" \
        op monitor interval="60s" \
        meta target-role="Started"
primitive manual-fencing-of-B stonith:meatware \
        params hostlist="rivendell-A" \
        op monitor interval="60s" \
        meta target-role="Started"
location location-manual-fencing-of-A manual-fencing-of-A -inf: rivendell-A
location location-manual-fencing-of-B manual-fencing-of-B -inf: rivendell-B

Here is our configuration which currently is used (without manual
fencing) - http://pastebin.com/CudX6wx3

BTW - is there a way to recover from such situation? I can only fix it
by restarting corosync or rebooting a node. But it then kills other node
because of UNCLEAN state.

Also if it is a pacemaker bug how to debug it/fix it? We are currently
using Debian Wheezy 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff.

I see there are more up to date versions but not with Debian. Should I
consider upgrading?

Thank you!

Michał Margula, alchemyx at uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]

More information about the Pacemaker mailing list