[ClusterLabs] Antw: Re: Pacemaker on-fail standby recovery does not start DRBD slave resource

Thu Apr 7 08:20:33 CEST 2016

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 07.04.2016 um 00:04 in
Nachricht
<57058805.8050307 at redhat.com>:
> On 03/30/2016 12:18 PM, Sam Gardner wrote:
>> I'll check about the cluster-recheck-interval. Attached is a crm_report.
>> 
>> In the meantime, what is all performed on that interval? The Red Hat docs
>> say the following, which doesn't make much sense to me:

In my understanding, the cluster re-probes resources, and if there is a
mismatch between actual and believed state, actions are triggered and
performed. This can be goog, or can be bad: If you deliberately do not monitor
some resources, a cluster recheck will actually "monitor" it an perform
actions...

> 
> Normally, the cluster only recalculates what actions need to be taken
> when an interesting event occurs -- node or resource failure,
> configuration change, node attribute change, etc.
> 
> The cluster-recheck-interval allows that recalculation to happen
> regardless of (the lack of) events. For example, let's say you have
> rules that specify that certain constraints only apply between 9am and
> 5pm. If there are no events happening at 9am, the rules won't actually
> be noticed or take effect. So the cluster-recheck-interval is the
> granularity of such "time-based changes". A cluster-recheck-interval of
> 5m ensures the rules kick in no later than 9:05am.
> 
> Looking at the crm_report:
> 
> I see "Configuration ERRORs found during PE processing.  Please run
> "crm_verify -L" to identify issues." The offending bit is described a
> little earlier: "error: RecurringOp: Invalid recurring action
> DRBDSlave-start-interval-30s wth name: 'start'". There was a discussion
> on the mailing list recently about this -- a recurring start action is
> meaningless.
> 
> That constraint will be ignored. If you want to set on-fail=standby for
> DRBD starts, use an interval of 0.
> 
> I'd recommend running "crm_verify -L" to see if there are any other
> issues, and take care of them. Once you have a clean crm_verify, run
> "cibadmin --upgrade" to upgrade the XML of your configuration to the
> latest schema. This is just good housekeeping when keeping an older
> configuration after pacemaker upgrades.
> 
> I see "e1000: eth2 NIC Link is Down" shortly before the issue. If you're
> using ifdown/ifup to test failure, be aware that corosync can't recover
> from that particular scenario (known issue, nontrivial to fix). It's
> recommended to simulate a network failure by blocking corosync traffic
> via the local firewall (both inbound and outbound). Or of course you can
> unplug a network cable.
> 
> Are you limited to the "classic openais (with plugin)" cluster stack?
> Corosync 2 is preferred these days, and even corosync 1 + CMAN gets more
> testing than the old plugin.
> 
> If it still happens after looking into those items, I'd need logs from
> both nodes from the failure time to a couple minutes after the
> unstandby. The other node will be the DC at this point and will have the
> more interesting bits.
> 
>>         Polling interval for time-based changes to options, resource 
> parameters
>> and constraints. Allowed values: Zero disables polling, positive values
>> are an interval in seconds (unless other SI units are specified, such as
>> 5min).
>> --
>> Sam Gardner
>> Trustwave | SMART SECURITY ON DEMAND
>> 
>> 
>> 
>> On 3/30/16, 11:46 AM, "Ken Gaillot" <kgaillot at redhat.com> wrote:
>> 
>>> On 03/30/2016 11:20 AM, Sam Gardner wrote:
>>>> I have configured some network resources to automatically standby their
>>>> node if the system detects a failure on them. However, the DRBD slave
>>>> that I have configured does not automatically restart after the node is
>>>> "unstandby-ed" after the failure-timeout expires.
>>>> Is there any way to make the "stopped" DRBDSlave resource automatically
>>>> start again once the node is recovered?
>>>>
>>>> See the  progression of events below:
>>>>
>>>> Running cluster:
>>>> Wed Mar 30 16:04:20 UTC 2016
>>>> Cluster name:
>>>> Last updated: Wed Mar 30 16:04:20 2016
>>>> Last change: Wed Mar 30 16:03:24 2016
>>>> Stack: classic openais (with plugin)
>>>> Current DC:
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum
>>>> Version: 1.1.12-561c4cf
>>>> 2 Nodes configured, 2 expected votes
>>>> 7 Resources configured
>>>>
>>>>
>>>> Online: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ]
>>>>
>>>> Full list of resources:
>>>>
>>>>  Resource Group: network
>>>>      inif       (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>      outif      (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>      dmz1       (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>  Master/Slave Set: DRBDMaster [DRBDSlave]
>>>>      Masters: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom ]
>>>>      Slaves: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ]
>>>>  Resource Group: filesystem
>>>>      DRBDFS     (ocf::heartbeat:Filesystem):    Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>  Resource Group: application
>>>>      service_failover   (ocf::custom:service_failover):    Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>
>>>>
>>>> version: 8.4.5 (api:1/proto:86-101)
>>>> srcversion: 315FB2BBD4B521D13C20BF4
>>>>
>>>>  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>>>>     ns:4 nr:0 dw:4 dr:757 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>>>> [153766.565352] block drbd1: send bitmap stats [Bytes(packets)]: plain
>>>> 0(0), RLE 21(1), total 21; compression: 100.0%
>>>> [153766.568303] block drbd1: receive bitmap stats [Bytes(packets)]:
>>>> plain 0(0), RLE 21(1), total 21; compression: 100.0%
>>>> [153766.568316] block drbd1: helper command: /sbin/drbdadm
>>>> before-resync-source minor-1
>>>> [153766.568356] block drbd1: helper command: /sbin/drbdadm
>>>> before-resync-source minor-1 exit code 255 (0xfffffffe)
>>>> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk(
>>>> Consistent -> Inconsistent )
>>>> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB
>>>> [1 bits set]).
>>>> [153766.568444] block drbd1: updated sync UUID
>>>> B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
>>>> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4
>>>> K/sec)
>>>> [153766.577700] block drbd1: updated UUIDs
>>>> B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952
>>>> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk(
>>>> Inconsistent -> UpToDate )̄
>>>>
>>>> Failure detected:
>>>> Wed Mar 30 16:08:22 UTC 2016
>>>> Cluster name:
>>>> Last updated: Wed Mar 30 16:08:22 2016
>>>> Last change: Wed Mar 30 16:03:24 2016
>>>> Stack: classic openais (with plugin)
>>>> Current DC:
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum
>>>> Version: 1.1.12-561c4cf
>>>> 2 Nodes configured, 2 expected votes
>>>> 7 Resources configured
>>>>
>>>>
>>>> Node ha-d1.tw.com: standby (on-fail)
>>>> Online: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ]
>>>>
>>>> Full list of resources:
>>>>
>>>>  Resource Group: network
>>>>      inif       (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>      outif      (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>      dmz1       (ocf::custom:ip.sh):       FAILED
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>  Master/Slave Set: DRBDMaster [DRBDSlave]
>>>>      Masters: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom ]
>>>>      Slaves: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ]
>>>>  Resource Group: filesystem
>>>>      DRBDFS     (ocf::heartbeat:Filesystem):    Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>  Resource Group: application
>>>>      service_failover   (ocf::custom:service_failover):    Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>>
>>>> Failed actions:
>>>>     dmz1_monitor_7000 on
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom 'not running' (7):
>>>> call=156, status=complete, last-rc-change='Wed Mar 30 16:08:19 2016',
>>>> queued=0ms, exec=0ms
>>>>
>>>>
>>>>
>>>> version: 8.4.5 (api:1/proto:86-101)
>>>> srcversion: 315FB2BBD4B521D13C20BF4
>>>>
>>>>  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>>>>     ns:4 nr:0 dw:4 dr:765 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>>>> [153766.568356] block drbd1: helper command: /sbin/drbdadm
>>>> before-resync-source minor-1 exit code 255 (0xfffffffe)
>>>> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk(
>>>> Consistent -> Inconsistent )
>>>> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB
>>>> [1 bits set]).
>>>> [153766.568444] block drbd1: updated sync UUID
>>>> B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
>>>> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4
>>>> K/sec)
>>>> [153766.577700] block drbd1: updated UUIDs
>>>> B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952
>>>> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk(
>>>> Inconsistent -> UpToDate )
>>>> [154057.455270] e1000: eth2 NIC Link is Down
>>>> [154057.455451] e1000 0000:02:02.0 eth2: Reset adapter
>>>>
>>>> Failover complete:
>>>> Wed Mar 30 16:09:02 UTC 2016
>>>> Cluster name:
>>>> Last updated: Wed Mar 30 16:09:02 2016
>>>> Last change: Wed Mar 30 16:03:24 2016
>>>> Stack: classic openais (with plugin)
>>>> Current DC:
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum
>>>> Version: 1.1.12-561c4cf
>>>> 2 Nodes configured, 2 expected votes
>>>> 7 Resources configured
>>>>
>>>>
>>>> Node ha-d1.tw.com: standby (on-fail)
>>>> Online: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ]
>>>>
>>>> Full list of resources:
>>>>
>>>>  Resource Group: network
>>>>      inif       (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>      outif      (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>      dmz1       (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>  Master/Slave Set: DRBDMaster [DRBDSlave]
>>>>      Masters: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ]
>>>>      Stopped: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom ]
>>>>  Resource Group: filesystem
>>>>      DRBDFS     (ocf::heartbeat:Filesystem):    Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>  Resource Group: application
>>>>      service_failover   (ocf::custom:service_failover):    Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>
>>>> Failed actions:
>>>>     dmz1_monitor_7000 on
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom 'not running' (7):
>>>> call=156, status=complete, last-rc-change='Wed Mar 30 16:08:19 2016',
>>>> queued=0ms, exec=0ms
>>>>
>>>>
>>>>
>>>> version: 8.4.5 (api:1/proto:86-101)
>>>> srcversion: 315FB2BBD4B521D13C20BF4
>>>> [154094.894524] drbd wwwdata: conn( Disconnecting -> StandAlone )
>>>> [154094.894525] drbd wwwdata: receiver terminated
>>>> [154094.894527] drbd wwwdata: Terminating drbd_r_wwwdata
>>>> [154094.894559] block drbd1: disk( UpToDate -> Failed )
>>>> [154094.894569] block drbd1: bitmap WRITE of 0 pages took 0 jiffies
>>>> [154094.894571] block drbd1: 4 KB (1 bits) marked out-of-sync by on
>>>> disk bit-map.
>>>> [154094.894574] block drbd1: disk( Failed -> Diskless )
>>>> [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0
>>>> [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata
>>>>
>>>> Standby node recovered, with DRBDSlave stopped (I want DRBDSlave
>>>> started here):
>>>> Wed Mar 30 16:13:01 UTC 2016
>>>> Cluster name:
>>>> Last updated: Wed Mar 30 16:13:01 2016
>>>> Last change: Wed Mar 30 16:03:24 2016
>>>> Stack: classic openais (with plugin)
>>>> Current DC:
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum
>>>> Version: 1.1.12-561c4cf
>>>> 2 Nodes configured, 2 expected votes
>>>> 7 Resources configured
>>>>
>>>>
>>>> Online: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ]
>>>>
>>>> Full list of resources:
>>>>
>>>>  Resource Group: network
>>>>      inif       (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>      outif      (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>      dmz1       (ocf::custom:ip.sh):       Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>  Master/Slave Set: DRBDMaster [DRBDSlave]
>>>>      Masters: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ]
>>>>      Stopped: [
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom ]
>>>>  Resource Group: filesystem
>>>>      DRBDFS     (ocf::heartbeat:Filesystem):    Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>  Resource Group: application
>>>>      service_failover   (ocf::custom:service_failover):    Started
>>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh

>>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom
>>>>
>>>>
>>>> version: 8.4.5 (api:1/proto:86-101)
>>>> srcversion: 315FB2BBD4B521D13C20BF4
>>>> [154094.894574] block drbd1: disk( Failed -> Diskless )
>>>> [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0
>>>> [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata
>>>>
>>>> --
>>>> Sam Gardner
>>>> Trustwave | SMART SECURITY ON DEMAND
>>>
>>> This might be a bug. A crm_report covering a few minutes around when the
>>> failure expires might help.
>>>
>>> Does the slave start after the next cluster-recheck-interval?
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org