[ClusterLabs] Pacemaker on-fail standby recovery does not start DRBD slave resource

Wed Mar 30 12:46:21 EDT 2016

On 03/30/2016 11:20 AM, Sam Gardner wrote:
> I have configured some network resources to automatically standby their node if the system detects a failure on them. However, the DRBD slave that I have configured does not automatically restart after the node is "unstandby-ed" after the failure-timeout expires.
> Is there any way to make the "stopped" DRBDSlave resource automatically start again once the node is recovered?
> 
> See the  progression of events below:
> 
> Running cluster:
> Wed Mar 30 16:04:20 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:04:20 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
> 
> 
> Online: [ ha-d1.tw.com ha-d2.tw.com ]
> 
> Full list of resources:
> 
>  Resource Group: network
>      inif       (ocf::custom:ip.sh):       Started ha-d1.tw.com
>      outif      (ocf::custom:ip.sh):       Started ha-d1.tw.com
>      dmz1       (ocf::custom:ip.sh):       Started ha-d1.tw.com
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>      Masters: [ ha-d1.tw.com ]
>      Slaves: [ ha-d2.tw.com ]
>  Resource Group: filesystem
>      DRBDFS     (ocf::heartbeat:Filesystem):    Started ha-d1.tw.com
>  Resource Group: application
>      service_failover   (ocf::custom:service_failover):    Started ha-d1.tw.com
> 
> 
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
> 
>  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>     ns:4 nr:0 dw:4 dr:757 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> [153766.565352] block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
> [153766.568303] block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
> [153766.568316] block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1
> [153766.568356] block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 255 (0xfffffffe)
> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 bits set]).
> [153766.568444] block drbd1: updated sync UUID B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> [153766.577700] block drbd1: updated UUIDs B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952
> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )¯
> 
> Failure detected:
> Wed Mar 30 16:08:22 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:08:22 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
> 
> 
> Node ha-d1.tw.com: standby (on-fail)
> Online: [ ha-d2.tw.com ]
> 
> Full list of resources:
> 
>  Resource Group: network
>      inif       (ocf::custom:ip.sh):       Started ha-d1.tw.com
>      outif      (ocf::custom:ip.sh):       Started ha-d1.tw.com
>      dmz1       (ocf::custom:ip.sh):       FAILED ha-d1.tw.com
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>      Masters: [ ha-d1.tw.com ]
>      Slaves: [ ha-d2.tw.com ]
>  Resource Group: filesystem
>      DRBDFS     (ocf::heartbeat:Filesystem):    Started ha-d1.tw.com
>  Resource Group: application
>      service_failover   (ocf::custom:service_failover):    Started ha-d1.tw.com
> 
> Failed actions:
>     dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, exec=0ms
> 
> 
> 
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
> 
>  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>     ns:4 nr:0 dw:4 dr:765 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> [153766.568356] block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 255 (0xfffffffe)
> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 bits set]).
> [153766.568444] block drbd1: updated sync UUID B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> [153766.577700] block drbd1: updated UUIDs B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952
> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> [154057.455270] e1000: eth2 NIC Link is Down
> [154057.455451] e1000 0000:02:02.0 eth2: Reset adapter
> 
> Failover complete:
> Wed Mar 30 16:09:02 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:09:02 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
> 
> 
> Node ha-d1.tw.com: standby (on-fail)
> Online: [ ha-d2.tw.com ]
> 
> Full list of resources:
> 
>  Resource Group: network
>      inif       (ocf::custom:ip.sh):       Started ha-d2.tw.com
>      outif      (ocf::custom:ip.sh):       Started ha-d2.tw.com
>      dmz1       (ocf::custom:ip.sh):       Started ha-d2.tw.com
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>      Masters: [ ha-d2.tw.com ]
>      Stopped: [ ha-d1.tw.com ]
>  Resource Group: filesystem
>      DRBDFS     (ocf::heartbeat:Filesystem):    Started ha-d2.tw.com
>  Resource Group: application
>      service_failover   (ocf::custom:service_failover):    Started ha-d2.tw.com
> 
> Failed actions:
>     dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, exec=0ms
> 
> 
> 
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
> [154094.894524] drbd wwwdata: conn( Disconnecting -> StandAlone )
> [154094.894525] drbd wwwdata: receiver terminated
> [154094.894527] drbd wwwdata: Terminating drbd_r_wwwdata
> [154094.894559] block drbd1: disk( UpToDate -> Failed )
> [154094.894569] block drbd1: bitmap WRITE of 0 pages took 0 jiffies
> [154094.894571] block drbd1: 4 KB (1 bits) marked out-of-sync by on disk bit-map.
> [154094.894574] block drbd1: disk( Failed -> Diskless )
> [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0
> [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata
> 
> Standby node recovered, with DRBDSlave stopped (I want DRBDSlave started here):
> Wed Mar 30 16:13:01 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:13:01 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
> 
> 
> Online: [ ha-d1.tw.com ha-d2.tw.com ]
> 
> Full list of resources:
> 
>  Resource Group: network
>      inif       (ocf::custom:ip.sh):       Started ha-d2.tw.com
>      outif      (ocf::custom:ip.sh):       Started ha-d2.tw.com
>      dmz1       (ocf::custom:ip.sh):       Started ha-d2.tw.com
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>      Masters: [ ha-d2.tw.com ]
>      Stopped: [ ha-d1.tw.com ]
>  Resource Group: filesystem
>      DRBDFS     (ocf::heartbeat:Filesystem):    Started ha-d2.tw.com
>  Resource Group: application
>      service_failover   (ocf::custom:service_failover):    Started ha-d2.tw.com
> 
> 
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
> [154094.894574] block drbd1: disk( Failed -> Diskless )
> [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0
> [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata
> 
> --
> Sam Gardner
> Trustwave | SMART SECURITY ON DEMAND

This might be a bug. A crm_report covering a few minutes around when the
failure expires might help.

Does the slave start after the next cluster-recheck-interval?