[Pacemaker] server lockup failures

Fri Nov 6 07:12:16 EST 2009

On Fri, Oct 30, 2009 at 11:25 AM, Bernd Schubert
<bernd.schubert at fastmail.fm> wrote:
> On Friday 30 October 2009, Lars Marowsky-Bree wrote:
>> On 2009-10-29T09:58:13, Andrew Beekhof <andrew at beekhof.net> wrote:
>> > > Heartbeat based, I still didn't have the time to look into openais.
>> >
>> > I guess heartbeat wasn't hung then... otherwise it would have stopped
>> > sending "i'm here" packets (and dropped out of the membership list).
>>
>> Both heartbeat and OpenAIS do quite try not to touch the IO layers to
>> avoid being struck by IO latencies.
>>
>> Probably not even crmd needs to touch the fs, so it would still send its
>> DC keepalive packets and/or respond as the DC. Things like this need to
>> be caught via resource agent monitoring.
>
> I'm afraid it is not that simple. One of the resources was marked as failed in
> crm_mon output, but still pacemaker didn't do anything to migrate the
> resource. Manual attempts to stop resources also failed. Only after I invoked
> stonith myself to reboot the failed server, DC also migrate and pacemaker
> started to work again. I hope I will have some time in the afternoon to start
> to debug this.

Do you have the logs that span the problem?
In theory everything should have still worked.  Even the CIB spawns a
new process for writing config updates to disk so its hard to imagine
why we couldn't recover.