[Pacemaker] Trying to figure out a constraint

Digimer lists at alteeve.ca
Thu Jun 19 00:16:54 EDT 2014


On 19/06/14 12:06 AM, Digimer wrote:
<snip>
>
> After sending this, I found that adding:
>
> handlers {
>      fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>      after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
>
> Allowed the constraint to be removed, so eventually node 2 (an-a04n02)
> eventually promoted, but not before going into the failed state shown
> above.
>
> Subsequent stop -> start of pacemaker on both nodes started cleanly, not
> fence action reported in /var/log/messages. I notices this time that the
> drbd module was loaded, not sure if that made a difference.
>
> Will keep testing... Any insight is much appreciated.

Ok, that didn't help... It's still resource-fencing on start *most* (not 
all) of the time.

When I start pacemaker, and pacemaker start DRBD (nearly simultaneously 
on both nodes), I see this:

====
Jun 19 00:14:22 an-a04n01 crmd[16895]:   notice: do_state_transition: 
State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Jun 19 00:14:22 an-a04n01 attrd[16893]:   notice: attrd_local_callback: 
Sending full refresh (origin=crmd)
Jun 19 00:14:22 an-a04n01 pengine[16894]:   notice: unpack_config: On 
loss of CCM Quorum: Ignore
Jun 19 00:14:22 an-a04n01 pengine[16894]:   notice: LogActions: Start 
fence_n01_ipmi#011(an-a04n01.alteeve.ca)
Jun 19 00:14:22 an-a04n01 pengine[16894]:   notice: LogActions: Start 
fence_n02_ipmi#011(an-a04n02.alteeve.ca)
Jun 19 00:14:22 an-a04n01 pengine[16894]:   notice: LogActions: Start 
drbd_r0:0#011(an-a04n01.alteeve.ca)
Jun 19 00:14:22 an-a04n01 pengine[16894]:   notice: LogActions: Start 
drbd_r0:1#011(an-a04n02.alteeve.ca)
Jun 19 00:14:22 an-a04n01 pengine[16894]:   notice: process_pe_message: 
Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-230.bz2
Jun 19 00:14:22 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 8: monitor fence_n01_ipmi_monitor_0 on 
an-a04n02.alteeve.ca
Jun 19 00:14:22 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 4: monitor fence_n01_ipmi_monitor_0 on 
an-a04n01.alteeve.ca (local)
Jun 19 00:14:22 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 9: monitor fence_n02_ipmi_monitor_0 on 
an-a04n02.alteeve.ca
Jun 19 00:14:22 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 5: monitor fence_n02_ipmi_monitor_0 on 
an-a04n01.alteeve.ca (local)
Jun 19 00:14:22 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 6: monitor drbd_r0:0_monitor_0 on an-a04n01.alteeve.ca 
(local)
Jun 19 00:14:22 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 10: monitor drbd_r0:1_monitor_0 on an-a04n02.alteeve.ca
Jun 19 00:14:23 an-a04n01 crmd[16895]:   notice: process_lrm_event: LRM 
operation drbd_r0_monitor_0 (call=14, rc=7, cib-update=28, 
confirmed=true) not running
Jun 19 00:14:23 an-a04n01 crmd[16895]:   notice: process_lrm_event: 
an-a04n01.alteeve.ca-drbd_r0_monitor_0:14 [ \n ]
Jun 19 00:14:23 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 3: probe_complete probe_complete on 
an-a04n01.alteeve.ca (local) - no waiting
Jun 19 00:14:23 an-a04n01 attrd[16893]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Jun 19 00:14:23 an-a04n01 attrd[16893]:   notice: attrd_perform_update: 
Sent update 4: probe_complete=true
Jun 19 00:14:23 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 7: probe_complete probe_complete on 
an-a04n02.alteeve.ca - no waiting
Jun 19 00:14:23 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 11: start fence_n01_ipmi_start_0 on 
an-a04n01.alteeve.ca (local)
Jun 19 00:14:23 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 13: start fence_n02_ipmi_start_0 on an-a04n02.alteeve.ca
Jun 19 00:14:23 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 15: start drbd_r0:0_start_0 on an-a04n01.alteeve.ca 
(local)
Jun 19 00:14:24 an-a04n01 stonith-ng[16891]:   notice: 
stonith_device_register: Device 'fence_n01_ipmi' already existed in 
device list (2 active devices)
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 17: start drbd_r0:1_start_0 on an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: process_lrm_event: LRM 
operation fence_n01_ipmi_start_0 (call=19, rc=0, cib-update=29, 
confirmed=true) ok
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 12: monitor fence_n01_ipmi_monitor_60000 on 
an-a04n01.alteeve.ca (local)
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 14: monitor fence_n02_ipmi_monitor_60000 on 
an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: process_lrm_event: LRM 
operation fence_n01_ipmi_monitor_60000 (call=24, rc=0, cib-update=30, 
confirmed=false) ok
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Starting worker thread 
(from cqueue [3265])
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: disk( Diskless -> 
Attaching )
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Found 4 transactions (126 
active extents) in activity log.
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Method to ensure write 
ordering: flush
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: drbd_bm_resize called 
with capacity == 909525832
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: resync bitmap: 
bits=113690729 words=1776418 pages=3470
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: size = 434 GB (454762916 KB)
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: bitmap READ of 3470 pages 
took 8 jiffies
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: recounting of set bits 
took additional 16 jiffies
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: 0 KB (0 bits) marked 
out-of-sync by on disk bit-map.
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: disk( Attaching -> 
Consistent )
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: attached to UUIDs 
561F3328043888C0:0000000000000000:052A1A6B59936EC5:05291A6B59936EC5
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: conn( StandAlone -> 
Unconnected )
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Starting receiver thread 
(from drbd0_worker [17045])
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: receiver (re)started
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: conn( Unconnected -> 
WFConnection )
Jun 19 00:14:24 an-a04n01 attrd[16893]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd_r0 (5)
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: process_lrm_event: LRM 
operation drbd_r0_start_0 (call=21, rc=0, cib-update=31, confirmed=true) ok
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 48: notify drbd_r0:0_post_notify_start_0 on 
an-a04n01.alteeve.ca (local)
Jun 19 00:14:24 an-a04n01 attrd[16893]:   notice: attrd_perform_update: 
Sent update 9: master-drbd_r0=5
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 49: notify drbd_r0:1_post_notify_start_0 on 
an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 attrd[16893]:   notice: attrd_perform_update: 
Sent update 11: master-drbd_r0=5
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: process_lrm_event: LRM 
operation drbd_r0_notify_0 (call=28, rc=0, cib-update=0, confirmed=true) ok
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: run_graph: Transition 0 
(Complete=23, Pending=0, Fired=0, Skipped=2, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-230.bz2): Stopped
Jun 19 00:14:24 an-a04n01 pengine[16894]:   notice: unpack_config: On 
loss of CCM Quorum: Ignore
Jun 19 00:14:24 an-a04n01 pengine[16894]:   notice: LogActions: Promote 
drbd_r0:0#011(Slave -> Master an-a04n01.alteeve.ca)
Jun 19 00:14:24 an-a04n01 pengine[16894]:   notice: LogActions: Promote 
drbd_r0:1#011(Slave -> Master an-a04n02.alteeve.ca)
Jun 19 00:14:24 an-a04n01 pengine[16894]:   notice: process_pe_message: 
Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-231.bz2
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 52: notify drbd_r0_pre_notify_promote_0 on 
an-a04n01.alteeve.ca (local)
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 54: notify drbd_r0_pre_notify_promote_0 on 
an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: process_lrm_event: LRM 
operation drbd_r0_notify_0 (call=31, rc=0, cib-update=0, confirmed=true) ok
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 13: promote drbd_r0_promote_0 on an-a04n01.alteeve.ca 
(local)
Jun 19 00:14:24 an-a04n01 crmd[16895]:   notice: te_rsc_command: 
Initiating action 16: promote drbd_r0_promote_0 on an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: helper command: 
/sbin/drbdadm fence-peer minor-0
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: Handshake successful: 
Agreed network protocol version 97
Jun 19 00:14:25 an-a04n01 crm-fence-peer.sh[17156]: invoked for r0
Jun 19 00:14:25 an-a04n01 cibadmin[17188]:   notice: crm_log_args: 
Invoked: cibadmin -C -o constraints -X <rsc_location rsc="drbd_r0_Clone" 
id="drbd-fence-by-handler-r0-drbd_r0_Clone">#012  <rule role="Master" 
score="-INFINITY" id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">#012 
    <expression attribute="#uname" operation="ne" 
value="an-a04n01.alteeve.ca" 
id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>#012 
</rule>#012</rsc_location>
Jun 19 00:14:25 an-a04n01 crmd[16895]:   notice: handle_request: Current 
ping state: S_TRANSITION_ENGINE
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: Diff: --- 0.94.19
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: Diff: +++ 
0.95.1 4f095b8add6dcbb173de1254bf02fcf6
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: -- <cib 
admin_epoch="0" epoch="94" num_updates="19"/>
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: ++ 
<rsc_location rsc="drbd_r0_Clone" 
id="drbd-fence-by-handler-r0-drbd_r0_Clone">
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: ++ 
<rule role="Master" score="-INFINITY" 
id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: ++ 
<expression attribute="#uname" operation="ne" 
value="an-a04n01.alteeve.ca" 
id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: ++         </rule>
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: ++ 
</rsc_location>
Jun 19 00:14:25 an-a04n01 stonith-ng[16891]:   notice: unpack_config: On 
loss of CCM Quorum: Ignore
Jun 19 00:14:25 an-a04n01 crm-fence-peer.sh[17156]: INFO peer is 
reachable, my disk is Consistent: placed constraint 
'drbd-fence-by-handler-r0-drbd_r0_Clone'
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: helper command: 
/sbin/drbdadm fence-peer minor-0 exit code 4 (0x400)
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: fence-peer helper 
returned 4 (peer was fenced)
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: role( Secondary -> 
Primary ) disk( Consistent -> UpToDate ) pdsk( DUnknown -> Outdated )
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: new current UUID 
25DF173CF8D89023:561F3328043888C0:052A1A6B59936EC5:05291A6B59936EC5
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: conn( WFConnection -> 
WFReportParams )
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: Starting asender thread 
(from drbd0_receiver [17062])
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: data-integrity-alg: 
<not-used>
Jun 19 00:14:25 an-a04n01 stonith-ng[16891]:   notice: 
stonith_device_register: Device 'fence_n01_ipmi' already existed in 
device list (2 active devices)
Jun 19 00:14:25 an-a04n01 cib[16890]:  warning: update_results: Action 
cib_create failed: Name not unique on network (cde=-76)
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures   <failed>
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures     <failed_update 
id="drbd-fence-by-handler-r0-drbd_r0_Clone" object_type="rsc_location" 
operation="cib_create" reason="Name not unique on network">
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures       <rsc_location rsc="drbd_r0_Clone" 
id="drbd-fence-by-handler-r0-drbd_r0_Clone">
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures         <rule role="Master" score="-INFINITY" 
id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures           <expression attribute="#uname" operation="ne" 
value="an-a04n02.alteeve.ca" 
id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures         </rule>
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures       </rsc_location>
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures     </failed_update>
Jun 19 00:14:25 an-a04n01 cib[16890]:    error: cib_process_create: CIB 
Update failures   </failed>
Jun 19 00:14:25 an-a04n01 cib[16890]:  warning: cib_process_request: 
Completed cib_create operation for section constraints: Name not unique 
on network (rc=-76, origin=an-a04n02.alteeve.ca/cibadmin/2, version=0.95.1)
Jun 19 00:14:25 an-a04n01 stonith-ng[16891]:   notice: 
stonith_device_register: Added 'fence_n02_ipmi' to the device list (2 
active devices)
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: drbd_sync_handshake:
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: self 
25DF173CF8D89023:561F3328043888C0:052A1A6B59936EC5:05291A6B59936EC5 
bits:0 flags:0
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: peer 
561F3328043888C0:0000000000000000:052A1A6B59936EC4:05291A6B59936EC5 
bits:0 flags:0
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: uuid_compare()=1 by rule 70
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: peer( Unknown -> 
Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> 
Consistent )
Jun 19 00:14:25 an-a04n01 crmd[16895]:   notice: process_lrm_event: LRM 
operation drbd_r0_promote_0 (call=34, rc=0, cib-update=33, 
confirmed=true) ok
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: helper command: 
/sbin/drbdadm before-resync-source minor-0
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: helper command: 
/sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: conn( WFBitMapS -> 
SyncSource ) pdsk( Consistent -> Inconsistent )
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: Began resync as 
SyncSource (will sync 0 KB [0 bits set]).
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: updated sync UUID 
25DF173CF8D89023:56203328043888C0:561F3328043888C0:052A1A6B59936EC5
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: Diff: --- 0.95.2
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: Diff: +++ 
0.96.1 86f147e11a7e9934f7b2a686715dcca6
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: -- 
<rsc_location rsc="drbd_r0_Clone" 
id="drbd-fence-by-handler-r0-drbd_r0_Clone">
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: -- 
<rule role="Master" score="-INFINITY" 
id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: -- 
<expression attribute="#uname" operation="ne" 
value="an-a04n01.alteeve.ca" 
id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: --         </rule>
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: -- 
</rsc_location>
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: ++ <cib 
admin_epoch="0" cib-last-written="Thu Jun 19 00:14:26 2014" 
crm_feature_set="3.0.7" epoch="96" have-quorum="1" num_updates="1" 
update-client="cibadmin" update-origin="an-a04n02.alteeve.ca" 
validate-with="pacemaker-1.2" dc-uuid="an-a04n01.alteeve.ca"/>
Jun 19 00:14:26 an-a04n01 stonith-ng[16891]:   notice: unpack_config: On 
loss of CCM Quorum: Ignore
Jun 19 00:14:26 an-a04n01 stonith-ng[16891]:   notice: 
stonith_device_register: Device 'fence_n01_ipmi' already existed in 
device list (2 active devices)
Jun 19 00:14:26 an-a04n01 stonith-ng[16891]:   notice: 
stonith_device_register: Added 'fence_n02_ipmi' to the device list (2 
active devices)
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: Resync done (total 1 sec; 
paused 0 sec; 0 K/sec)
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: updated UUIDs 
25DF173CF8D89023:0000000000000000:56203328043888C0:561F3328043888C0
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: conn( SyncSource -> 
Connected ) pdsk( Inconsistent -> UpToDate )
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: bitmap WRITE of 3470 
pages took 9 jiffies
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: 0 KB (0 bits) marked 
out-of-sync by on disk bit-map.
====

It seems to immediately fence as soon as DRBD starts, and I can't see 
why it feels the need to do this...

RHEL 6.5, DRBD 8.3.16.

I am really stumped... any help would be much appreciated!

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?




More information about the Pacemaker mailing list