[ClusterLabs] Set "start-failure-is-fatal=false" on only one resource?

Sam Gardner SGardner at trustwave.com
Fri Mar 25 12:08:48 EDT 2016


On 3/25/16, 10:26 AM, "Lars Ellenberg" <lars.ellenberg at linbit.com> wrote:


>On Thu, Mar 24, 2016 at 09:01:18PM +0000, Sam Gardner wrote:
>> I'm having some trouble on a few of my clusters in which the DRBD Slave
>>resource does not want to come up after a reboot until I manually run
>>resource cleanup.
>
>Logs?

syslog has some relevant info, but I can't tease anything out of the
pacemaker logs that looks more useful than the following:

Mar 25 15:58:49 ha-d2 Filesystem(DRBDFS)[29570]: WARNING: Couldn't find
device [/dev/drbd/by-res/wwwdata/0]. Expected /dev/??? to exist
Mar 25 15:58:49 ha-d2 drbd(DRBDSlave)[29689]: ERROR: wwwdata: Called
drbdadm -c /etc/drbd.conf syncer wwwdata
Mar 25 15:58:49 ha-d2 drbd(DRBDSlave)[29689]: ERROR: wwwdata: Exit code 1
Mar 25 15:58:49 ha-d2 drbd(DRBDSlave)[29689]: ERROR: wwwdata: Command
output:
Mar 25 15:58:49 ha-d2 kernel:[   56.596595] drbd wwwdata: Starting worker
thread (from drbdsetup-84 [29938])
Mar 25 15:58:49 ha-d2 kernel:[   56.597705] drbd wwwdata: Method to ensure
write ordering: flush
Mar 25 15:58:49 ha-d2 kernel:[   56.606513] drbd wwwdata: conn( StandAlone
-> Unconnected )
Mar 25 15:58:49 ha-d2 kernel:[   56.606524] drbd wwwdata: Starting
receiver thread (from drbd_w_wwwdata [29940])
Mar 25 15:58:49 ha-d2 kernel:[   56.612984] drbd wwwdata: receiver
(re)started
Mar 25 15:58:49 ha-d2 kernel:[   56.612996] drbd wwwdata: conn(
Unconnected -> WFConnection )
Mar 25 15:58:49 ha-d2 kernel:[   56.721683] drbd wwwdata: conn(
WFConnection -> Disconnecting )
Mar 25 15:58:49 ha-d2 kernel:[   56.721733] drbd wwwdata: sock_recvmsg
returned -4
Mar 25 15:58:49 ha-d2 kernel:[   56.721999] drbd wwwdata: Connection closed
Mar 25 15:58:49 ha-d2 kernel:[   56.722009] drbd wwwdata: conn(
Disconnecting -> StandAlone )
Mar 25 15:58:49 ha-d2 kernel:[   56.722017] drbd wwwdata: State change
failed: Need a connection to start verify or resync
Mar 25 15:58:49 ha-d2 kernel:[   56.722146] drbd wwwdata:  mask = 0x1f0
val = 0x80
Mar 25 15:58:49 ha-d2 kernel:[   56.722222] drbd wwwdata:
old_conn:StandAlone wanted_conn:WFConnection
Mar 25 15:58:49 ha-d2 kernel:[   56.722314] drbd wwwdata: receiver
terminated
Mar 25 15:58:49 ha-d2 kernel:[   56.722316] drbd wwwdata: Terminating
drbd_r_wwwdata
Mar 25 15:58:49 ha-d2 kernel:[   56.722410] drbd wwwdata: Terminating
drbd_w_wwwdata


dmesg has a little bit more info:

[   56.394204] drbd: failed to initialize debugfs -- will not be available
[   56.394208] drbd: initialized. Version: 8.4.5 (api:1/proto:86-101)
[   56.394209] drbd: srcversion: 315FB2BBD4B521D13C20BF4
[   56.394210] drbd: registered as block device major 147
[   56.596595] drbd wwwdata: Starting worker thread (from drbdsetup-84
[29938])
[   56.597662] block drbd1: disk( Diskless -> Attaching )
[   56.597705] drbd wwwdata: Method to ensure write ordering: flush
[   56.597707] block drbd1: max BIO size = 1048576
[   56.597711] block drbd1: drbd_bm_resize called with capacity == 2097016
[   56.597718] block drbd1: resync bitmap: bits=262127 words=4096 pages=8
[   56.597720] block drbd1: size = 1024 MB (1048508 KB)
[   56.603630] block drbd1: recounting of set bits took additional 0
jiffies
[   56.603634] block drbd1: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[   56.603639] block drbd1: disk( Attaching -> UpToDate )
[   56.603642] block drbd1: attached to UUIDs
04121320C308E9D6:0000000000000000:876FDA03C925F772:876EDA03C925F773
[   56.606513] drbd wwwdata: conn( StandAlone -> Unconnected )
[   56.606524] drbd wwwdata: Starting receiver thread (from drbd_w_wwwdata
[29940])
[   56.612984] drbd wwwdata: receiver (re)started
[   56.612996] drbd wwwdata: conn( Unconnected -> WFConnection )
[   56.721683] drbd wwwdata: conn( WFConnection -> Disconnecting )
[   56.721733] drbd wwwdata: sock_recvmsg returned -4
[   56.721999] drbd wwwdata: Connection closed
[   56.722009] drbd wwwdata: conn( Disconnecting -> StandAlone )
[   56.722017] drbd wwwdata: State change failed: Need a connection to
start verify or resync
[   56.722146] drbd wwwdata:  mask = 0x1f0 val = 0x80
[   56.722222] drbd wwwdata:  old_conn:StandAlone wanted_conn:WFConnection
[   56.722314] drbd wwwdata: receiver terminated
[   56.722316] drbd wwwdata: Terminating drbd_r_wwwdata
[   56.722333] block drbd1: disk( UpToDate -> Failed )
[   56.722344] block drbd1: bitmap WRITE of 0 pages took 0 jiffies
[   56.722346] block drbd1: 0 KB (0 bits) marked out-of-sync by on disk
bit-map.
[   56.722349] block drbd1: disk( Failed -> Diskless )
[   56.722406] block drbd1: drbd_bm_resize called with capacity == 0
[   56.722410] drbd wwwdata: Terminating drbd_w_wwwdata


Here's the DRBD config (again, the DRBD pairing works fine after startup):

resource wwwdata {
 meta-disk internal;
 device  /dev/drbd1;
 disk   /dev/VolGroup/lv_drbd;
 syncer {
  verify-alg sha1;
 }
 net {
  protocol C;
  allow-two-primaries;
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
 }
 on ha-d2.dev.com {
  address  XXX.XXX.XXX.XXX:YYYY;
 }
  on ha-d1.dev.com {
  address XXX.XXX.XXX.XXX:YYYY;
 }
}


>I mean, to get a failure count,
>you have to have some operation fail.
>And you should figure out which, when, and why.
>
>Is it the start that fails?
>Why does it fail?

Yes, it is the start that fails for the DRBDSlave resource. The DRBDMaster
resource comes up fine, no matter which node is primary.

No idea what is causing it not to come up at boot (a manual pcs resource
cleanup) immediately fixes it once I have SSH access to do so).

We're using Corosync 1.4.8 and Pacemaker 1.1.12 with DRBD 8.4.5 and
drbd-utils 8.9.3 (all compiled from source, this is an internal
Red-Hat-like OS).

This particular pair is two VMs, but we will eventually transfer whatever
config we come up with to multiple pairs running on actual hardware.

The following config actually works:
start-failure-is-fatal=false
failure-timeout=33s on DRBDSlave

However, I'm concerned that setting start-failure-is-fatal=false will
eventually cause problems on one of my other resources.

Thanks for the help,
Sam


________________________________

This transmission may contain information that is privileged, confidential, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format.




More information about the Users mailing list