[ClusterLabs] fence_scsi No devices found in cluster to fence

Ken Gaillot kgaillot at redhat.com
Wed Jun 17 13:58:25 UTC 2015



----- Original Message -----
> Hi
> 
> I setup a two node cluster based on kvm (it is a test).
> 
> Now I'm trying to setup fence_scsi to prevent split brain. One of my
> test scenario is to suspend the kvm instance and to resume it (if I kill
> corosync the someone restart it). I see that the other node do a
> failover but after I resume the kvm instance the cluster node join again
> (the service group switch back :-( ) and the node is not dying (reboot,
> cluster framework down etc) :-)

Hi,

fence_scsi does not reboot or shut down the node; it cuts off the node's
access to the SCSI device by removing the registration key. The idea is
that if the node can't write to shared disk, then it can't cause any
harm.

> I also create a extreme high system load (while true; do yes >/dev/null&
> done), the resources switched to the other node. After I killed the
> thounsends of yes processes the pcs service is down but I think that is
> not caused by the fence_scsi. In the log I found that "notice:
> too_many_st_failures: No devices found in cluster to fence server1in1,
> giving up"
> 
> What is wrong in my configuration? In need two stonith devices? or what
> is wrong?

One stonith device is fine (assuming cutting off disk access is enough
in your setup to prevent any trouble), but you can set up more if you
want. If you prefer that the node be shut down, you could set up a device
that does that as the preferred stonith device and keep fence_scsi as
a backup, or any other configuration that meets your needs.

> See below for the logs and other output... Many thanks in advance!
> 
> Regards
> 
> Frank
> 
> I created the stonith resource with this command:
> 
> pcs stonith create scsi_server fence_scsi pcmk_host_list="server1in1
> server2in1" pcmk_monitor_action="metadata"
> devices="/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc" meta
> provides="unfencing"
> 
> I see two keys on my iSCSI device
> 
> [root at server1 ~]# sg_persist --in --no-inquiry --read-key
> --device=/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc
>    PR generation=0x2, 2 registered reservation keys follow:
>      0xfe290001
>      0xfe290000
> 
> The current cluster state locks good to me.
> 
> Cluster name: nonvcs_cl
> Last updated: Tue Jun 16 10:11:41 2015
> Last change: Tue Jun 16 10:11:37 2015
> Stack: corosync
> Current DC: server2in1 (2) - partition with quorum
> Version: 1.1.12-a14efad
> 2 Nodes configured
> 3 Resources configured
> 
> 
> Online: [ server1in1 server2in1 ]
> 
> Full list of resources:
> 
>   Resource Group: servicea_sg
>       ClusterIP2    (ocf::heartbeat:IPaddr2):    Started server1in1
>       www2_mnt    (ocf::heartbeat:Filesystem):    Started server1in1
>   scsi_server    (stonith:fence_scsi):    Started server2in1
> 
> PCSD Status:
>    server1in1: Online
>    server2in1: Online
> 
> Daemon Status:
>    corosync: active/enabled
>    pacemaker: active/disabled
>    pcsd: active/enabled
> 
> ##################
> messages | grep stonith
> 
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> stonith_device_register: Added 'scsi_server' to the device list (1
> active devices)
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: handle_request: Client
> crmd.1087.b3e11b2e wants to fence (on) 'server2in1' with device '(any)'
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> initiate_remote_stonith_op: Initiating remote operation on for
> server2in1: fd8b714f-6ac3-4227-9937-0d4e7b98e454 (0)
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: handle_request: Client
> crmd.1087.b3e11b2e wants to fence (on) 'server1in1' with device '(any)'
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> initiate_remote_stonith_op: Initiating remote operation on for
> server1in1: e54b60cf-87a3-403f-9061-a4cac2fa7d0d (0)
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> can_fence_host_with_device: scsi_server can fence (on) server2in1:
> static-list
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> can_fence_host_with_device: scsi_server can fence (on) server2in1:
> static-list
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: log_operation:
> Operation 'on' [13198] (call 22 from crmd.1087) for host 'server2in1'
> with device 'scsi_server' returned: 0 (OK)

Notice fence_scsi is the rare fence agent that needs an "on" operation
(aka "unfencing") when the node first joins the cluster. You can see it
returned OK here, which means it registered server2in1's key successsfully.

> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: remote_op_done:
> Operation on of server2in1 by <no-one> for crmd.1087 at server2in1.fd8b714f: OK
> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 22/4:91:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_notify:
> server2in1 was successfully unfenced by <anyone> (at the request of
> server2in1)
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: remote_op_done:
> Operation on of server1in1 by <no-one> for crmd.1087 at server2in1.e54b60cf: OK
> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 23/3:91:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_notify:
> server1in1 was successfully unfenced by <anyone> (at the request of
> server2in1)
> 
> ##################
> Verify is OK
> 
> [root at server1 ~]# crm_verify -L
> [root at server1 ~]#
> 
> ##################
> After resume one key is deleted
> [root at server1 ~]# sg_persist --in --no-inquiry --read-key
> --device=/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc
>    PR generation=0x3, 1 registered reservation key follows:
>      0xfe290001
> 
> ##################
> Log yes > /dev/null extreme system load
> 
> Jun 16 10:45:39 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:39 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:41 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1668) was formed. Members joined: 1
> Jun 16 10:45:41 server2 corosync[959]: [QUORUM] Members[2]: 2 1
> Jun 16 10:45:41 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:41 server2 crmd[1087]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> Jun 16 10:45:41 server2 pacemakerd[1081]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> 
> ########################
> pcs status "whilte true.. yes > /dev/null&"  extreme system load
> 
> [root at server1 ~]# pcs status
> Error: cluster is not currently running on this node
> 
> [root at server2 ~]# pcs status | grep server1in1
> Node server1in1 (1): pending
>    server1in1: Online
> 
> ########################
> Full log
> 
> Jun 16 10:31:04 server2 corosync[959]: [TOTEM ] A processor failed,
> forming new configuration.
> Jun 16 10:31:05 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1184) was formed. Members left: 1
> Jun 16 10:31:05 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:31:05 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:31:05 server2 attrd[1085]: notice: crm_update_peer_state:
> attrd_peer_change_cb: Node server1in1[1] - state is now lost (was member)
> Jun 16 10:31:05 server2 attrd[1085]: notice: attrd_peer_remove: Removing
> all server1in1 attributes for attrd_peer_change_cb
> Jun 16 10:31:05 server2 pacemakerd[1081]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now lost (was
> member)
> Jun 16 10:31:05 server2 crmd[1087]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now lost (was
> member)
> Jun 16 10:31:05 server2 crmd[1087]: warning: match_down_event: No match
> for shutdown action on 1
> Jun 16 10:31:05 server2 crmd[1087]: notice: peer_update_callback:
> Stonith/shutdown of server1in1 not matched
> Jun 16 10:31:05 server2 crmd[1087]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jun 16 10:31:05 server2 crmd[1087]: warning: match_down_event: No match
> for shutdown action on 1
> Jun 16 10:31:05 server2 crmd[1087]: notice: peer_update_callback:
> Stonith/shutdown of server1in1 not matched
> Jun 16 10:31:06 server2 pengine[1086]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 16 10:31:06 server2 pengine[1086]: warning: pe_fence_node: Node
> server1in1 will be fenced because the node is no longer part of the cluster
> Jun 16 10:31:06 server2 pengine[1086]: warning: determine_online_status:
> Node server1in1 is unclean
> Jun 16 10:31:06 server2 pengine[1086]: warning: custom_action: Action
> ClusterIP2_stop_0 on server1in1 is unrunnable (offline)
> Jun 16 10:31:06 server2 pengine[1086]: warning: custom_action: Action
> www2_mnt_stop_0 on server1in1 is unrunnable (offline)
> Jun 16 10:31:06 server2 pengine[1086]: warning: stage6: Scheduling Node
> server1in1 for STONITH
> Jun 16 10:31:06 server2 pengine[1086]: notice: LogActions: Move
> ClusterIP2    (Started server1in1 -> server2in1)
> Jun 16 10:31:06 server2 pengine[1086]: notice: LogActions: Move
> www2_mnt    (Started server1in1 -> server2in1)
> Jun 16 10:31:06 server2 pengine[1086]: warning: process_pe_message:
> Calculated Transition 93: /var/lib/pacemaker/pengine/pe-warn-53.bz2
> Jun 16 10:31:06 server2 crmd[1087]: notice: te_fence_node: Executing
> reboot fencing operation (20) on server1in1 (timeout=60000)
> Jun 16 10:31:06 server2 stonith-ng[1083]: notice: handle_request: Client
> crmd.1087.b3e11b2e wants to fence (reboot) 'server1in1' with device '(any)'
> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> server1in1: 910d86d6-a53c-4d14-8b66-3e8ef2043bbf (0)
> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
> can_fence_host_with_device: scsi_server can fence (reboot) server1in1:
> static-list
> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
> can_fence_host_with_device: scsi_server can fence (reboot) server1in1:
> static-list
> Jun 16 10:31:06 server2 stonith-ng[1083]: warning:
> stonith_device_execute: Agent 'fence_scsi' does not advertise support
> for 'reboot', performing 'off' action instead
> Jun 16 10:31:07 server2 stonith-ng[1083]: notice: log_operation:
> Operation 'reboot' [13384] (call 24 from crmd.1087) for host
> 'server1in1' with device 'scsi_server' returned: 0 (OK)

Here you can see that server1in1 was fenced, successfully.

Pacemaker still refers to it as a "reboot", but each fence agent
can implement that however it chooses. You can see just above this
that fence_scsi doesn't support "reboot" actions at all, so it
performs an "off" instead, which in its case unregisters the key.

> Jun 16 10:31:07 server2 stonith-ng[1083]: notice: remote_op_done:
> Operation reboot of server1in1 by <no-one> for
> crmd.1087 at server2in1.910d86d6: OK
> Jun 16 10:31:07 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 24/20:93:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
> Jun 16 10:31:07 server2 crmd[1087]: notice: tengine_stonith_notify: Peer
> server1in1 was terminated (reboot) by <anyone> for server2in1: OK
> (ref=910d86d6-a53c-4d14-8b66-3e8ef2043bbf) by client crmd.1087
> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 8: start ClusterIP2_start_0 on server2in1 (local)
> Jun 16 10:31:07 server2 crmd[1087]: notice: abort_transition_graph:
> Transition aborted by deletion of lrm[@id='1']: Resource state removal
> (cib=0.320.17, source=te_update_diff:429,
> path=/cib/status/node_state[@id='1']/lrm[@id='1'], 0)
> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO: Adding inet
> address 192.168.122.112/32 with broadcast address 192.168.122.255 to
> device 122er
> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO: Bringing
> device 122er up
> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
> /var/run/resource-agents/send_arp-192.168.122.112 122er 192.168.122.112
> auto not_used not_used
> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
> ClusterIP2_start_0: ok (node=server2in1, call=118, rc=0, cib-update=601,
> confirmed=true)
> Jun 16 10:31:07 server2 crmd[1087]: notice: run_graph: Transition 93
> (Complete=9, Pending=0, Fired=0, Skipped=4, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-53.bz2): Stopped
> Jun 16 10:31:07 server2 pengine[1086]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 16 10:31:07 server2 pengine[1086]: notice: LogActions: Start
> www2_mnt    (server2in1)
> Jun 16 10:31:07 server2 pengine[1086]: notice: process_pe_message:
> Calculated Transition 94: /var/lib/pacemaker/pengine/pe-input-280.bz2
> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 9: monitor ClusterIP2_monitor_30000 on server2in1 (local)
> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 10: start www2_mnt_start_0 on server2in1 (local)
> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
> ClusterIP2_monitor_30000: ok (node=server2in1, call=119, rc=0,
> cib-update=603, confirmed=false)
> Jun 16 10:31:07 server2 Filesystem(www2_mnt)[13490]: INFO: Running start
> for /dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc-part1 on
> /var/www2
> Jun 16 10:31:07 server2 kernel: EXT4-fs (sda1): recovery complete
> Jun 16 10:31:07 server2 kernel: EXT4-fs (sda1): mounted filesystem with
> ordered data mode. Opts: (null)
> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
> www2_mnt_start_0: ok (node=server2in1, call=120, rc=0, cib-update=604,
> confirmed=true)
> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 11: monitor www2_mnt_monitor_20000 on server2in1 (local)
> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
> www2_mnt_monitor_20000: ok (node=server2in1, call=121, rc=0,
> cib-update=605, confirmed=false)
> Jun 16 10:31:07 server2 crmd[1087]: notice: run_graph: Transition 94
> (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-280.bz2): Complete
> Jun 16 10:31:07 server2 crmd[1087]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 16 10:31:55 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1188) was formed. Members joined: 1
> Jun 16 10:31:55 server2 attrd[1085]: notice: crm_update_peer_state:
> attrd_peer_change_cb: Node server1in1[1] - state is now member (was lost)
> Jun 16 10:31:55 server2 crmd[1087]: error: pcmk_cpg_membership: Node
> server1in1[1] appears to be online even though we think it is dead

Here server1in1 rejoins the cluster. Without seeing server1in1's logs
it's hard to say what happened here, but this probably shouldn't happen
without manual intervention by you.

> Jun 16 10:31:55 server2 crmd[1087]: notice: crm_update_peer_state:
> pcmk_cpg_membership: Node server1in1[1] - state is now member (was lost)
> Jun 16 10:31:55 server2 corosync[959]: [QUORUM] Members[2]: 2 1
> Jun 16 10:31:55 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:31:55 server2 pacemakerd[1081]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> Jun 16 10:31:55 server2 crmd[1087]: notice: do_state_transition: State
> transition S_IDLE -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> Jun 16 10:31:56 server2 crmd[1087]: notice: do_state_transition: State
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_TIMER_POPPED origin=election_timeout_popped ]
> Jun 16 10:31:56 server2 crmd[1087]: warning: do_log: FSA: Input
> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
> Jun 16 10:31:58 server2 pengine[1086]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 16 10:31:58 server2 pengine[1086]: error: native_create_actions:
> Resource ClusterIP2 (ocf::IPaddr2) is active on 2 nodes attempting recovery

Here you can see that the IP address resource did not die when the
node was fenced, because it does not depend on the shared storage.
If you plan on keeping just fence_scsi, make the IP address depend on
the shared filesystem resource (via a group or ordering constraint)
so it won't start unless the filesystem is active.

> Jun 16 10:31:58 server2 pengine[1086]: warning: native_create_actions:
> See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
> information.
> Jun 16 10:31:58 server2 pengine[1086]: error: native_create_actions:
> Resource www2_mnt (ocf::Filesystem) is active on 2 nodes attempting recovery
> Jun 16 10:31:58 server2 pengine[1086]: warning: native_create_actions:
> See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
> information.
> Jun 16 10:31:58 server2 pengine[1086]: notice: LogActions: Restart
> ClusterIP2    (Started server1in1)
> Jun 16 10:31:58 server2 pengine[1086]: notice: LogActions: Restart
> www2_mnt    (Started server1in1)
> Jun 16 10:31:58 server2 pengine[1086]: error: process_pe_message:
> Calculated Transition 95: /var/lib/pacemaker/pengine/pe-error-3.bz2
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 16: stop www2_mnt_stop_0 on server2in1 (local)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 15: stop www2_mnt_stop_0 on server1in1
> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: Running stop
> for /dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc-part1 on
> /var/www2
> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: Trying to
> unmount /var/www2
> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: unmounted
> /var/www2 successfully
> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
> www2_mnt_stop_0: ok (node=server2in1, call=123, rc=0, cib-update=637,
> confirmed=true)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 13: stop ClusterIP2_stop_0 on server2in1 (local)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 12: stop ClusterIP2_stop_0 on server1in1
> Jun 16 10:31:58 server2 IPaddr2(ClusterIP2)[13866]: INFO: IP status =
> ok, IP_CIP=
> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
> ClusterIP2_stop_0: ok (node=server2in1, call=125, rc=0, cib-update=638,
> confirmed=true)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 14: start ClusterIP2_start_0 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 2: monitor ClusterIP2_monitor_30000 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 17: start www2_mnt_start_0 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 1: monitor www2_mnt_monitor_20000 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: run_graph: Transition 95
> (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
> Jun 16 10:31:58 server2 crmd[1087]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> ^C
> [root at server2 ~]# tail -f /var/log/messages
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 13: stop ClusterIP2_stop_0 on server2in1 (local)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 12: stop ClusterIP2_stop_0 on server1in1
> Jun 16 10:31:58 server2 IPaddr2(ClusterIP2)[13866]: INFO: IP status =
> ok, IP_CIP=
> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
> ClusterIP2_stop_0: ok (node=server2in1, call=125, rc=0, cib-update=638,
> confirmed=true)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 14: start ClusterIP2_start_0 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 2: monitor ClusterIP2_monitor_30000 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 17: start www2_mnt_start_0 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 1: monitor www2_mnt_monitor_20000 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: run_graph: Transition 95
> (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
> Jun 16 10:31:58 server2 crmd[1087]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> #####################################################
> messages yes > /dev/null&
> 
> Jun 16 10:44:37 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:44:49 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1612) was formed. Members
> Jun 16 10:44:49 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:44:49 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:00 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1620) was formed. Members
> Jun 16 10:45:00 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:00 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:12 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1628) was formed. Members
> Jun 16 10:45:12 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:12 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:25 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1636) was formed. Members
> Jun 16 10:45:25 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:25 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:30 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1644) was formed. Members
> Jun 16 10:45:30 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:30 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:35 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1652) was formed. Members
> Jun 16 10:45:35 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:35 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:39 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1660) was formed. Members
> Jun 16 10:45:39 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:39 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:41 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1668) was formed. Members joined: 1
> Jun 16 10:45:41 server2 corosync[959]: [QUORUM] Members[2]: 2 1
> Jun 16 10:45:41 server2 corosync[959]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:41 server2 crmd[1087]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> Jun 16 10:45:41 server2 pacemakerd[1081]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> Jun 16 10:55:00 server2 crmd[1087]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jun 16 10:55:00 server2 pengine[1086]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
> ClusterIP2_monitor_0 on server1in1 is unrunnable (pending)
> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
> www2_mnt_monitor_0 on server1in1 is unrunnable (pending)
> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
> scsi_server_monitor_0 on server1in1 is unrunnable (pending)
> Jun 16 10:55:00 server2 pengine[1086]: notice: trigger_unfencing:
> Unfencing server1in1: node discovery
> Jun 16 10:55:00 server2 pengine[1086]: notice: process_pe_message:
> Calculated Transition 101: /var/lib/pacemaker/pengine/pe-input-284.bz2
> Jun 16 10:55:00 server2 crmd[1087]: notice: te_fence_node: Executing on
> fencing operation (4) on server1in1 (timeout=60000)
> Jun 16 10:55:00 server2 stonith-ng[1083]: notice: handle_request: Client
> crmd.1087.b3e11b2e wants to fence (on) 'server1in1' with device '(any)'
> Jun 16 10:55:00 server2 stonith-ng[1083]: notice:
> initiate_remote_stonith_op: Initiating remote operation on for
> server1in1: 3b0b3967-6f33-4b68-9f4d-246b69e0370a (0)
> Jun 16 10:55:00 server2 stonith-ng[1083]: notice: stonith_choose_peer:
> Couldn't find anyone to fence server1in1 with <any>
> Jun 16 10:55:00 server2 stonith-ng[1083]: error: remote_op_done:
> Operation on of server1in1 by <no-one> for
> crmd.1087 at server2in1.3b0b3967: No such device
> Jun 16 10:55:00 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 27/4:101:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: No
> such device (-19)
> Jun 16 10:55:00 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 27 for server1in1 failed (No such device): aborting
> transition.
> Jun 16 10:55:00 server2 crmd[1087]: notice: abort_transition_graph:
> Transition aborted: Stonith failed (source=tengine_stonith_callback:697, 0)
> Jun 16 10:55:00 server2 crmd[1087]: error: tengine_stonith_notify:
> Unfencing of server1in1 by <anyone> failed: No such device (-19)
> Jun 16 10:55:00 server2 crmd[1087]: notice: run_graph: Transition 101
> (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-284.bz2): Stopped
> Jun 16 10:55:00 server2 crmd[1087]: notice: too_many_st_failures: No
> devices found in cluster to fence server1in1, giving up
> Jun 16 10:55:00 server2 crmd[1087]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]





More information about the Users mailing list