[Pacemaker] server lockup failures
    Bernd Schubert 
    bs_lists at aakef.fastmail.fm
       
    Wed Oct 28 11:05:24 UTC 2009
    
    
  
Hello,
I think there is a severe server failure pacemaker doesn't detect. Over night 
a Lustre server failed in shrink_icache_memory() and probably it had a lock on 
dcache_lock. Now this is a global filesystem lock and when a filesystem fails 
while this is locked, any IO on this system just hangs. And I think pacemaker 
doesn't detect this failure. So DC was the failed node and of course, I 
couldn't login anymore, but ping still worked. On the other server crm_mon 
showed one failed resource (monitor), but it simply didn't do anything.
This is with pacemaker 1.04.
I think I should be able to reproduce this rather quickly, by adding a wrong 
dcache_lock into Lustre. The question is now how can we fix this in pacemaker? 
Thanks,
Bernd
-- 
Bernd Schubert
DataDirect Networks
    
    
More information about the Pacemaker
mailing list