[Pacemaker] Pacemaker fencing and DLM/cLVM

Mon Nov 24 09:44:25 EST 2014

Am Montag, 24. November 2014, 15:14:26 schrieb Daniel Dehennin:
> Hello,
> 
> In my pacemaker/corosync cluster it looks like I have some issues with
> fencing ACK on DLM/cLVM.
> 
> When a node is fenced, dlm/cLVM are not aware of the fencing results and
> LVM commands hangs unless I run “dlm_tools fence_ack <ID_OF_THE_NODE>”
> 
> Here are some log around the fencing of nebula1:
> 
> Nov 24 09:51:06 nebula3 crmd[6043]:  warning: update_failcount: Updating
> failcount for clvm on nebula1 after failed stop: rc=1 (update=INFINITY,
> time=1416819066) Nov 24 09:51:06 nebula3 pengine[6042]:  warning:
> unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown
> error (1) Nov 24 09:51:06 nebula3 pengine[6042]:  warning: pe_fence_node:
> Node nebula1 will be fenced because of resource failure(s) Nov 24 09:51:06
> nebula3 pengine[6042]:  warning: stage6: Scheduling Node nebula1 for
> STONITH Nov 24 09:51:06 nebula3 pengine[6042]:   notice:
> native_stop_constraints: Stop of failed resource clvm:0 is implicit after
> nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]:   notice:
> LogActions: Move    Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
> Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop   
> dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]:   notice:
> LogActions: Stop    clvm:0#011(nebula1) Nov 24 09:51:06 nebula3
> pengine[6042]:  warning: process_pe_message: Calculated Transition 4:
> /var/lib/pacemaker/pengine/pe-warn-1.bz2 Nov 24 09:51:06 nebula3
> pengine[6042]:  warning: unpack_rsc_op: Processing failed op stop for
> clvm:0 on nebula1: unknown error (1) Nov 24 09:51:06 nebula3 pengine[6042]:
>  warning: pe_fence_node: Node nebula1 will be fenced because of resource
> failure(s) Nov 24 09:51:06 nebula3 pengine[6042]:  warning: stage6:
> Scheduling Node nebula1 for STONITH Nov 24 09:51:06 nebula3 pengine[6042]: 
>  notice: native_stop_constraints: Stop of failed resource clvm:0 is
> implicit after nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]:  
> notice: LogActions: Move    Stonith-nebula3-IPMILAN#011(Started nebula1 ->
> nebula2) Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop 
>   dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]:   notice:
> LogActions: Stop    clvm:0#011(nebula1) Nov 24 09:51:06 nebula3
> pengine[6042]:  warning: process_pe_message: Calculated Transition 5:
> /var/lib/pacemaker/pengine/pe-warn-2.bz2 Nov 24 09:51:06 nebula3
> crmd[6043]:   notice: te_fence_node: Executing reboot fencing operation
> (79) on nebula1 (timeout=30000) Nov 24 09:51:06 nebula3 stonith-ng[6039]:  
> notice: handle_request: Client crmd.6043.5ec58277 wants to fence (reboot)
> 'nebula1' with device '(any)' Nov 24 09:51:06 nebula3 stonith-ng[6039]:  
> notice: initiate_remote_stonith_op: Initiating remote operation reboot for
> nebula1: 50c93bed-e66f-48a5-bd2f-100a9e7ca7a1 (0) Nov 24 09:51:06 nebula3
> stonith-ng[6039]:   notice: can_fence_host_with_device:
> Stonith-nebula1-IPMILAN can fence nebula1: static-list Nov 24 09:51:06
> nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device:
> Stonith-nebula2-IPMILAN can not fence nebula1: static-list Nov 24 09:51:06
> nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device:
> Stonith-ONE-Frontend can not fence nebula1: static-list Nov 24 09:51:09
> nebula3 corosync[5987]:   [TOTEM ] A processor failed, forming new
> configuration. Nov 24 09:51:13 nebula3 corosync[5987]:   [TOTEM ] A new
> membership (192.168.231.71:81200) was formed. Members left: 1084811078 Nov
> 24 09:51:13 nebula3 lvm[6311]: confchg callback. 0 joined, 1 left, 2
> members Nov 24 09:51:13 nebula3 corosync[5987]:   [QUORUM] Members[2]:
> 1084811079 1084811080 Nov 24 09:51:13 nebula3 corosync[5987]:   [MAIN  ]
> Completed service synchronization, ready to provide service. Nov 24
> 09:51:13 nebula3 pacemakerd[6036]:   notice: crm_update_peer_state:
> pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was
> member) Nov 24 09:51:13 nebula3 crmd[6043]:   notice:
> crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] -
> state is now lost (was member) Nov 24 09:51:13 nebula3 kernel: [ 
> 510.140107] dlm: closing connection to node 1084811078 Nov 24 09:51:13
> nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from
> 1084811079 walltime 1416819073 local 509 Nov 24 09:51:13 nebula3
> dlm_controld[6263]: 509 fence request 1084811078 pid 7142 nodedown time
> 1416819073 fence_all dlm_stonith Nov 24 09:51:13 nebula3
> dlm_controld[6263]: 509 fence result 1084811078 pid 7142 result 1 exit
> status Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status
> 1084811078 receive 1 from 1084811080 walltime 1416819073 local 509 Nov 24
> 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 no actor
> Nov 24 09:51:13 nebula3 stonith-ng[6039]:   notice: remote_op_done:
> Operation reboot of nebula1 by nebula2 for crmd.6043 at nebula3.50c93bed: OK
> Nov 24 09:51:13 nebula3 crmd[6043]:   notice: tengine_stonith_callback:
> Stonith operation 4/79:5:0:817919e5-fa6d-4381-b0bd-42141ce0bb41: OK (0) Nov
> 24 09:51:13 nebula3 crmd[6043]:   notice: tengine_stonith_notify: Peer
> nebula1 was terminated (reboot) by nebula2 for nebula3: OK
> (ref=50c93bed-e66f-48a5-bd2f-100a9e7ca7a1) by client crmd.6043 Nov 24
> 09:51:13 nebula3 crmd[6043]:   notice: te_rsc_command: Initiating action
> 22: start Stonith-nebula3-IPMILAN_start_0 on nebula2 Nov 24 09:51:14
> nebula3 crmd[6043]:   notice: run_graph: Transition 5 (Complete=11,
> Pending=0, Fired=0, Skipped=1, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Nov 24 09:51:14
> nebula3 pengine[6042]:   notice: process_pe_message: Calculated Transition
> 6: /var/lib/pacemaker/pengine/pe-input-2.bz2 Nov 24 09:51:14 nebula3
> crmd[6043]:   notice: te_rsc_command: Initiating action 21: monitor
> Stonith-nebula3-IPMILAN_monitor_1800000 on nebula2 Nov 24 09:51:15 nebula3
> crmd[6043]:   notice: run_graph: Transition 6 (Complete=1, Pending=0,
> Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-2.bz2): Complete Nov 24 09:51:15
> nebula3 crmd[6043]:   notice: do_state_transition: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
> origin=notify_crmd ] Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566
> datastores wait for fencing Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566
> clvmd wait for fencing Nov 24 09:55:10 nebula3 dlm_controld[6263]: 747
> fence status 1084811078 receive -125 from 1084811079 walltime 1416819310
> local 747
> 
> When the node is fenced I have “clvmd wait for fencing” and “datastores
> wait for fencing” (datastores is my GFS2 volume).
> 
> Any idea of something I can check when this happens?
> 
> Regards.

Yes. You have to tell all the underlying infrastructure to use the fencing of 
pacemaker. I assume that you are working on a RH clone.

See: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch08s02s03.html

Mit freundlichen Grüßen,

Michael Schwartzkopff

-- 
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141124/ba08ed5d/attachment-0003.sig>