[Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh

Thu Jan 2 12:10:59 EST 2014

----- Original Message -----
> From: "Digimer" <lists at alteeve.ca>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, December 23, 2013 6:57:53 PM
> Subject: Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 +	fence_virsh
> 
> On 23/12/13 04:31 PM, Digimer wrote:
> > On 23/12/13 02:31 PM, David Vossel wrote:
> >>
> >>
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Digimer" <lists at alteeve.ca>
> >>> To: "The Pacemaker cluster resource manager"
> >>> <pacemaker at oss.clusterlabs.org>
> >>> Sent: Monday, December 23, 2013 12:42:23 PM
> >>> Subject: Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker
> >>> 1.1.10 +    fence_virsh
> >>>
> >>> On 23/12/13 01:30 PM, David Vossel wrote:
> >>>> ----- Original Message -----
> >>>>> From: "Digimer" <lists at alteeve.ca>
> >>>>> To: "The Pacemaker cluster resource manager"
> >>>>> <pacemaker at oss.clusterlabs.org>
> >>>>> Sent: Saturday, December 21, 2013 2:39:46 PM
> >>>>> Subject: [Pacemaker] Problem with stonith in rhel7 + pacemaker
> >>>>> 1.1.10 +
> >>>>>     fence_virsh
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>>      I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta
> >>>>> VMs. I've got stonith configured and it technically works (crashed
> >>>>> node
> >>>>> reboots), but pacemaker hangs...
> >>>>>
> >>>>> Here is the config:
> >>>>>
> >>>>> ====
> >>>>> Cluster Name: rhel7-pcmk
> >>>>> Corosync Nodes:
> >>>>>     rhel7-01.alteeve.ca rhel7-02.alteeve.ca
> >>>>> Pacemaker Nodes:
> >>>>>     rhel7-01.alteeve.ca rhel7-02.alteeve.ca
> >>>>>
> >>>>> Resources:
> >>>>>
> >>>>> Stonith Devices:
> >>>>>     Resource: fence_n01_virsh (class=stonith type=fence_virsh)
> >>>>>      Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot
> >>>>> login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01
> >>>>>      Operations: monitor interval=60s
> >>>>>      (fence_n01_virsh-monitor-interval-60s)
> >>>>>     Resource: fence_n02_virsh (class=stonith type=fence_virsh)
> >>>>>      Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot
> >>>>
> >>>>
> >>>> When using fence_virt, the easiest way to configure everything is to
> >>>> name
> >>>> the actual virtual machines the same as what their corosync node
> >>>> names are
> >>>> going to be.
> >>>>
> >>>> If you run this command in a virtual machine, you can see the names
> >>>> fence_virt thinks all the nodes are.
> >>>> fence_xvm -o list
> >>>> node1          c4dbe904-f51a-d53f-7ef0-2b03361c6401 on
> >>>> node2          c4dbe904-f51a-d53f-7ef0-2b03361c6402 on
> >>>> node3          c4dbe904-f51a-d53f-7ef0-2b03361c6403 on
> >>>>
> >>>> If you name the vm the same as the node name, you don't even have to
> >>>> list
> >>>> the static host list. Stonith will do all that magic behind the
> >>>> scenes. If
> >>>> the node names do not match, try the 'pcmk_host_map' option. I
> >>>> believe you
> >>>> should be able to map the corosync node name to the vm's name using
> >>>> that
> >>>> option.
> >>>>
> >>>> Hope that helps :)
> >>>>
> >>>> -- Vossel
> >>>
> >>> Hi David,
> >>>
> >>>     I'm using fence_virsh,
> >>
> >> ah sorry, missed that.
> >>
> >>> not fence_virtd/fence_xvm. For reasons I've
> >>> not been able to resolve, fence_xvm has been unreliable on Fedora for
> >>> some time now.
> >>
> >> the multicast bug :(
> >
> > That's the one.
> >
> > I'm rebuilding the nodes now with VM/virsh names that match the host
> > name. Will see if that helps/makes a difference.
> >
> 
> This looks a little better:
> 
> ====
> Dec 23 19:53:33 an-c03n02 corosync[1652]: [TOTEM ] A processor failed,
> forming new configuration.
> Dec 23 19:53:34 an-c03n02 corosync[1652]: [TOTEM ] A new membership
> (192.168.122.102:24) was formed. Members left: 1
> Dec 23 19:53:34 an-c03n02 corosync[1652]: [QUORUM] Members[1]: 2
> Dec 23 19:53:34 an-c03n02 corosync[1652]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Dec 23 19:53:34 an-c03n02 pacemakerd[1667]: notice:
> crm_update_peer_state: pcmk_quorum_notification: Node
> an-c03n01.alteeve.ca[1] - state is now lost (was member)
> Dec 23 19:53:34 an-c03n02 crmd[1673]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node an-c03n01.alteeve.ca[1] - state is now
> lost (was member)
> Dec 23 19:53:34 an-c03n02 crmd[1673]: warning: match_down_event: No
> match for shutdown action on 1
> Dec 23 19:53:34 an-c03n02 crmd[1673]: notice: peer_update_callback:
> Stonith/shutdown of an-c03n01.alteeve.ca not matched
> Dec 23 19:53:34 an-c03n02 crmd[1673]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Dec 23 19:53:34 an-c03n02 crmd[1673]: warning: match_down_event: No
> match for shutdown action on 1
> Dec 23 19:53:34 an-c03n02 crmd[1673]: notice: peer_update_callback:
> Stonith/shutdown of an-c03n01.alteeve.ca not matched
> Dec 23 19:53:34 an-c03n02 attrd[1671]: notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Dec 23 19:53:34 an-c03n02 attrd[1671]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Dec 23 19:53:35 an-c03n02 pengine[1672]: notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Dec 23 19:53:35 an-c03n02 pengine[1672]: warning: pe_fence_node: Node
> an-c03n01.alteeve.ca will be fenced because the node is no longer part
> of the cluster
> Dec 23 19:53:35 an-c03n02 pengine[1672]: warning:
> determine_online_status: Node an-c03n01.alteeve.ca is unclean
> Dec 23 19:53:35 an-c03n02 pengine[1672]: warning: custom_action: Action
> fence_n01_virsh_stop_0 on an-c03n01.alteeve.ca is unrunnable (offline)
> Dec 23 19:53:35 an-c03n02 pengine[1672]: warning: stage6: Scheduling
> Node an-c03n01.alteeve.ca for STONITH
> Dec 23 19:53:35 an-c03n02 pengine[1672]: notice: LogActions: Move
> fence_n01_virsh    (Started an-c03n01.alteeve.ca -> an-c03n02.alteeve.ca)
> Dec 23 19:53:35 an-c03n02 pengine[1672]: warning: process_pe_message:
> Calculated Transition 1: /var/lib/pacemaker/pengine/pe-warn-0.bz2
> Dec 23 19:53:35 an-c03n02 crmd[1673]: notice: te_fence_node: Executing
> reboot fencing operation (11) on an-c03n01.alteeve.ca (timeout=60000)
> Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice: handle_request:
> Client crmd.1673.ebd55f11 wants to fence (reboot) 'an-c03n01.alteeve.ca'
> with device '(any)'
> Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> an-c03n01.alteeve.ca: 12d11de0-ba58-4b28-b0ce-90069b49a177 (0)
> Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
> can_fence_host_with_device: fence_n01_virsh can fence
> an-c03n01.alteeve.ca: static-list
> Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
> can_fence_host_with_device: fence_n02_virsh can not fence
> an-c03n01.alteeve.ca: static-list
> Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
> can_fence_host_with_device: fence_n01_virsh can fence
> an-c03n01.alteeve.ca: static-list
> Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
> can_fence_host_with_device: fence_n02_virsh can not fence
> an-c03n01.alteeve.ca: static-list
> Dec 23 19:53:35 an-c03n02 fence_virsh: Parse error: Ignoring unknown
> option 'nodename=an-c03n01.alteeve.ca
> Dec 23 19:53:52 an-c03n02 stonith-ng[1669]: notice: log_operation:
> Operation 'reboot' [1767] (call 2 from crmd.1673) for host
> 'an-c03n01.alteeve.ca' with device 'fence_n01_virsh' returned: 0 (OK)
> Dec 23 19:53:52 an-c03n02 stonith-ng[1669]: notice: remote_op_done:
> Operation reboot of an-c03n01.alteeve.ca by an-c03n02.alteeve.ca for
> crmd.1673 at an-c03n02.alteeve.ca.12d11de0: OK
> Dec 23 19:53:52 an-c03n02 crmd[1673]: notice: tengine_stonith_callback:
> Stonith operation 2/11:1:0:e2533a5d-933a-4c0b-bbba-ca59493a09bd: OK (0)
> Dec 23 19:53:52 an-c03n02 crmd[1673]: notice: tengine_stonith_notify:
> Peer an-c03n01.alteeve.ca was terminated (reboot) by
> an-c03n02.alteeve.ca for an-c03n02.alteeve.ca: OK
> (ref=12d11de0-ba58-4b28-b0ce-90069b49a177) by client crmd.1673
> Dec 23 19:53:52 an-c03n02 crmd[1673]: notice: te_rsc_command: Initiating
> action 6: start fence_n01_virsh_start_0 on an-c03n02.alteeve.ca (local)
> Dec 23 19:53:52 an-c03n02 stonith-ng[1669]: notice:
> stonith_device_register: Device 'fence_n01_virsh' already existed in
> device list (2 active devices)
> Dec 23 19:53:54 an-c03n02 crmd[1673]: notice: process_lrm_event: LRM
> operation fence_n01_virsh_start_0 (call=12, rc=0, cib-update=46,
> confirmed=true) ok
> Dec 23 19:53:54 an-c03n02 crmd[1673]: notice: run_graph: Transition 1
> (Complete=5, Pending=0, Fired=0, Skipped=1, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped
> Dec 23 19:53:54 an-c03n02 pengine[1672]: notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Dec 23 19:53:54 an-c03n02 pengine[1672]: notice: process_pe_message:
> Calculated Transition 2: /var/lib/pacemaker/pengine/pe-input-2.bz2
> Dec 23 19:53:54 an-c03n02 crmd[1673]: notice: te_rsc_command: Initiating
> action 7: monitor fence_n01_virsh_monitor_60000 on an-c03n02.alteeve.ca
> (local)
> Dec 23 19:53:55 an-c03n02 crmd[1673]: notice: process_lrm_event: LRM
> operation fence_n01_virsh_monitor_60000 (call=13, rc=0, cib-update=48,
> confirmed=false) ok
> Dec 23 19:53:55 an-c03n02 crmd[1673]: notice: run_graph: Transition 2
> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-2.bz2): Complete
> Dec 23 19:53:55 an-c03n02 crmd[1673]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> ====
> 
> Once the node booted back up, it was able to rejoin the surviving peer.
> I've not tested much more yet, but that's already an improvement, so far
> as I can tell.
> 
> So if the failure was caused by the VM name (as seen by virsh) not
> matching the node's hostname, would that be a pacemaker or fence_virsh bug?

fence_virsh, But before you file a bug, make sure to give the pcmk_host_map fence option a try. It lets you map cluster node names to vm node names for the fencing agent.

-- Vossel

> Thanks for the help, fellow "what's a holiday?"er!
> 
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>