No subject

Sun Apr 3 02:52:37 EDT 2011

> sbd -d /dev/disk/by-id/scsi-3600a0b8000420d5a00001cf14dc3a9a2-part1 list
> 0       multix244       clear
> 1       multix245       clear
> 2       multix246       reset   multix245

suggests that multix246 actually was sent the request; and thus, should
be considered 'fenced' by the remaining cluster.

Looking back in your mails further:

>> /dev/disk/by-id/scsi-3600a0b8000420d5a00001cf14dc3a9a2-part1 dump
>> Header version     : 2
>> Number of slots    : 255
>> Sector size        : 512
>> Timeout (watchdog) : 60
>> Timeout (allocate) : 2
>> Timeout (loop)     : 1
>> Timeout (msgwait)  : 120

You've set extremely long timeouts for the watchdog, and in particular
for the msgwait - this means that a fence will only be considered
completed after 120s by sbd. At the same time, you've set
stonith-timeout to 60s, so if the fence takes longer than that, it'll be
considered failed.

You've set up your cluster so that it can never complete a successful
fence - congratulations! ;-)

If you've got a legitimate reason for setting the msgwait timeout to
120s, you need to set the stonith-timeout to >120s - 140s, for example.

Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde