[Pacemaker] server lockup failures

Bernd Schubert bernd.schubert at fastmail.fm
Wed Oct 28 18:51:47 EDT 2009


On Wednesday 28 October 2009, Andrew Beekhof wrote:
> On Wed, Oct 28, 2009 at 2:44 PM, Bernd Schubert
> 
> <bs_lists at aakef.fastmail.fm> wrote:
> > On Wednesday 28 October 2009, Andrew Beekhof wrote:
> >> On Wed, Oct 28, 2009 at 1:05 PM, Bernd Schubert
> >>
> >> <bs_lists at aakef.fastmail.fm> wrote:
> >> > Hello,
> >> >
> >> > I think there is a severe server failure pacemaker doesn't detect.
> >> > Over night a Lustre server failed in shrink_icache_memory() and
> >> > probably it had a lock on dcache_lock. Now this is a global filesystem
> >> > lock and when a filesystem fails while this is locked, any IO on this
> >> > system just hangs.
> >>
> >> And the FS in question was / so Pacemaker basically hung?
> >
> > I couldn't login any more, but my guess is 'yes it hung'. But no, it was
> > not the root (/) FS. But if any FS crashes while it holds dcache_lock,
> > any other filesystem will hang as well.
> 
> ooohhhhh
> 
> > There is nothing we can do about that except of
> > rewriting the linux vfs ;) My question is just what can we do to get
> > Pacemaker fixed to stonith that node.
> 
> Hmmm.  Was this an openais or heartbeat based cluster?
> If all the processes hung I'd have expected it to drop out of the
> membership list and get shot by the new DC...

Heartbeat based, I still didn't have the time to look into openais. But I can 
test on my virtual machines during the next days. Since I have a good idea how 
to lock a node using dcache_lock, it also should be easily reproducible for me 
:)



Cheers,
Bernd




More information about the Pacemaker mailing list