[ClusterLabs] fence_scsi No devices found in cluster to fence

frbnrw at gmx.de frbnrw at gmx.de
Wed Jun 17 21:55:14 UTC 2015


Hi

thanks Ken for the description of the function. I changed the configuration and did some tests.

First I suspend the server2 and the resources are switched to server1. After I resumed server2 he tried to failback but it failed caused by the fence_scsi, the key was deleted.

For me it runs fine!

Jun 17 23:19:00 server2 iscsid: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jun 17 23:19:01 server2 chronyd[619]: Selected source 131.234.137.23
Jun 17 23:19:02 server2 chronyd[619]: Selected source 144.76.4.122
Jun 17 23:19:02 server2 chronyd[619]: System clock wrong by 236.567632 seconds, adjustment started
Jun 17 23:19:02 server2 chronyd[619]: Selected source 131.234.137.23
Jun 17 23:19:02 server2 chronyd[619]: System clock wrong by 1.713843 seconds, adjustment started
Jun 17 23:19:02 server2 kernel: sd 2:0:0:0: reservation conflict
Jun 17 23:19:02 server2 kernel: sd 2:0:0:0: [sda]  
Jun 17 23:19:02 server2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jun 17 23:19:02 server2 kernel: sd 2:0:0:0: [sda] CDB: 
Jun 17 23:19:02 server2 kernel: Write(10): 2a 00 00 00 08 00 00 00 08 00
Jun 17 23:19:02 server2 kernel: end_request: critical nexus error, dev sda, sector 2048
Jun 17 23:19:02 server2 kernel: Buffer I/O error on device sda1, logical block 0
Jun 17 23:19:02 server2 kernel: lost page write due to I/O error on sda1
Jun 17 23:19:02 server2 Filesystem(www2_mnt)[2191]: INFO: unmounted /var/www2 successfully
Jun 17 23:19:02 server2 iscsid: connection1:0 is operational after recovery (1 attempts)
 
But I've a new issue :-)

If I set "pcs constraint order www2_mnt then ClusterIP2" it is impossible for me to switch the resources or the group.

[root at server2 ~]# pcs constraint order www2_mnt then ClusterIP2
Adding www2_mnt ClusterIP2 (kind: Mandatory) (Options: first-action=start then-action=start)

Resource	Action	Before/After	Action	Score	Remove
ClusterIP2	starts	after www2_mnt	starts	

He has problems with the dependency ..pcs resource move servicea_sg server2in1 or pcs resource move servicea_sg server1in1

Jun 17 22:11:44 server2 crmd[1102]: notice: print_graph: Graph 9 with 11 actions: batch-limit=11 jobs, network-delay=0ms
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   19]: Pending pseudo op servicea_sg_stopped_0            on N/A (priority: 0, waiting:  10 13)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   18]: Completed pseudo op servicea_sg_stop_0             on N/A (priority: 0, waiting: none)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   17]: Pending pseudo op servicea_sg_running_0            on N/A (priority: 0, waiting:  11 14 16)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   16]: Pending pseudo op servicea_sg_start_0              on N/A (priority: 0, waiting:  19)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   12]: Pending rsc op ClusterIP2_monitor_30000            on server2in1 (priority: 0, waiting:  11)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   11]: Pending rsc op ClusterIP2_start_0                  on server2in1 (priority: 0, waiting:  10 14 16)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   10]: Pending rsc op ClusterIP2_stop_0                   on server1in1 (priority: 0, waiting:  13)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   15]: Pending rsc op www2_mnt_monitor_20000              on server2in1 (priority: 0, waiting:  14)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   14]: Pending rsc op www2_mnt_start_0                    on server2in1 (priority: 0, waiting:  11 13 16)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   13]: Pending rsc op www2_mnt_stop_0                     on server1in1 (priority: 0, waiting:  10)
Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action    6]: Pending pseudo op all_stopped                      on N/A (priority: 0, waiting:  10 13)
Jun 17 22:11:44 server2 crmd[1102]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 17 22:12:35 server2 crmd[1102]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]

I also tried to set a stop rule ... but it dosen't help. ClusterIP2 down then ww2_mnt.. etc.
If I remove the dependency the switch will be done.

Jun 17 22:12:35 server2 pengine[1101]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 17 22:12:35 server2 pengine[1101]: notice: LogActions: Move    ClusterIP2   (Started server1in1 -> server2in1)
Jun 17 22:12:35 server2 pengine[1101]: notice: LogActions: Move    www2_mnt     (Started server1in1 -> server2in1)
Jun 17 22:12:35 server2 pengine[1101]: notice: process_pe_message: Calculated Transition 10: /var/lib/pacemaker/pengine/pe-input-296.bz2
Jun 17 22:12:35 server2 crmd[1102]: notice: te_rsc_command: Initiating action 13: stop www2_mnt_stop_0 on server1in1
Jun 17 22:12:35 server2 crmd[1102]: notice: te_rsc_command: Initiating action 10: stop ClusterIP2_stop_0 on server1in1
Jun 17 22:12:35 server2 crmd[1102]: notice: te_rsc_command: Initiating action 11: start ClusterIP2_start_0 on server2in1 (local)
Jun 17 22:12:35 server2 IPaddr2(ClusterIP2)[2395]: INFO: Adding inet address 192.168.122.112/32 with broadcast address 192.168.122.255 to device 122er
Jun 17 22:12:35 server2 IPaddr2(ClusterIP2)[2395]: INFO: Bringing device 122er up 

Regards

Frank

Gesendet: Mittwoch, 17. Juni 2015 um 15:58 Uhr
Von: "Ken Gaillot" <kgaillot at redhat.com>
An: "Cluster Labs - All topics related to open-source clustering welcomed" <users at clusterlabs.org>
Betreff: Re: [ClusterLabs] fence_scsi No devices found in cluster to fence

----- Original Message -----
> Hi
>
> I setup a two node cluster based on kvm (it is a test).
>
> Now I'm trying to setup fence_scsi to prevent split brain. One of my
> test scenario is to suspend the kvm instance and to resume it (if I kill
> corosync the someone restart it). I see that the other node do a
> failover but after I resume the kvm instance the cluster node join again
> (the service group switch back :-( ) and the node is not dying (reboot,
> cluster framework down etc) :-)

Hi,

fence_scsi does not reboot or shut down the node; it cuts off the node's
access to the SCSI device by removing the registration key. The idea is
that if the node can't write to shared disk, then it can't cause any
harm.

> I also create a extreme high system load (while true; do yes >/dev/null&
> done), the resources switched to the other node. After I killed the
> thounsends of yes processes the pcs service is down but I think that is
> not caused by the fence_scsi. In the log I found that "notice:
> too_many_st_failures: No devices found in cluster to fence server1in1,
> giving up"
>
> What is wrong in my configuration? In need two stonith devices? or what
> is wrong?

One stonith device is fine (assuming cutting off disk access is enough
in your setup to prevent any trouble), but you can set up more if you
want. If you prefer that the node be shut down, you could set up a device
that does that as the preferred stonith device and keep fence_scsi as
a backup, or any other configuration that meets your needs.

> See below for the logs and other output... Many thanks in advance!
>
> Regards
>
> Frank
>
> I created the stonith resource with this command:
>
> pcs stonith create scsi_server fence_scsi pcmk_host_list="server1in1
> server2in1" pcmk_monitor_action="metadata"
> devices="/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc" meta
> provides="unfencing"
>
> I see two keys on my iSCSI device
>
> [root at server1 ~]# sg_persist --in --no-inquiry --read-key
> --device=/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc
> PR generation=0x2, 2 registered reservation keys follow:
> 0xfe290001
> 0xfe290000
>
> The current cluster state locks good to me.
>
> Cluster name: nonvcs_cl
> Last updated: Tue Jun 16 10:11:41 2015
> Last change: Tue Jun 16 10:11:37 2015
> Stack: corosync
> Current DC: server2in1 (2) - partition with quorum
> Version: 1.1.12-a14efad
> 2 Nodes configured
> 3 Resources configured
>
>
> Online: [ server1in1 server2in1 ]
>
> Full list of resources:
>
> Resource Group: servicea_sg
> ClusterIP2 (ocf::heartbeat:IPaddr2): Started server1in1
> www2_mnt (ocf::heartbeat:Filesystem): Started server1in1
> scsi_server (stonith:fence_scsi): Started server2in1
>
> PCSD Status:
> server1in1: Online
> server2in1: Online
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/disabled
> pcsd: active/enabled
>
> ##################
> messages | grep stonith
>
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> stonith_device_register: Added 'scsi_server' to the device list (1
> active devices)
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: handle_request: Client
> crmd.1087.b3e11b2e wants to fence (on) 'server2in1' with device '(any)'
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> initiate_remote_stonith_op: Initiating remote operation on for
> server2in1: fd8b714f-6ac3-4227-9937-0d4e7b98e454 (0)
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: handle_request: Client
> crmd.1087.b3e11b2e wants to fence (on) 'server1in1' with device '(any)'
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> initiate_remote_stonith_op: Initiating remote operation on for
> server1in1: e54b60cf-87a3-403f-9061-a4cac2fa7d0d (0)
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> can_fence_host_with_device: scsi_server can fence (on) server2in1:
> static-list
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
> can_fence_host_with_device: scsi_server can fence (on) server2in1:
> static-list
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: log_operation:
> Operation 'on' [13198] (call 22 from crmd.1087) for host 'server2in1'
> with device 'scsi_server' returned: 0 (OK)

Notice fence_scsi is the rare fence agent that needs an "on" operation
(aka "unfencing") when the node first joins the cluster. You can see it
returned OK here, which means it registered server2in1's key successsfully.

> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: remote_op_done:
> Operation on of server2in1 by <no-one> for crmd.1087 at server2in1.fd8b714f: OK
> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 22/4:91:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_notify:
> server2in1 was successfully unfenced by <anyone> (at the request of
> server2in1)
> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: remote_op_done:
> Operation on of server1in1 by <no-one> for crmd.1087 at server2in1.e54b60cf: OK
> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 23/3:91:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_notify:
> server1in1 was successfully unfenced by <anyone> (at the request of
> server2in1)
>
> ##################
> Verify is OK
>
> [root at server1 ~]# crm_verify -L
> [root at server1 ~]#
>
> ##################
> After resume one key is deleted
> [root at server1 ~]# sg_persist --in --no-inquiry --read-key
> --device=/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc
> PR generation=0x3, 1 registered reservation key follows:
> 0xfe290001
>
> ##################
> Log yes > /dev/null extreme system load
>
> Jun 16 10:45:39 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:39 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:41 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1668) was formed. Members joined: 1
> Jun 16 10:45:41 server2 corosync[959]: [QUORUM] Members[2]: 2 1
> Jun 16 10:45:41 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:41 server2 crmd[1087]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> Jun 16 10:45:41 server2 pacemakerd[1081]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
>
> ########################
> pcs status "whilte true.. yes > /dev/null&" extreme system load
>
> [root at server1 ~]# pcs status
> Error: cluster is not currently running on this node
>
> [root at server2 ~]# pcs status | grep server1in1
> Node server1in1 (1): pending
> server1in1: Online
>
> ########################
> Full log
>
> Jun 16 10:31:04 server2 corosync[959]: [TOTEM ] A processor failed,
> forming new configuration.
> Jun 16 10:31:05 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1184) was formed. Members left: 1
> Jun 16 10:31:05 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:31:05 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:31:05 server2 attrd[1085]: notice: crm_update_peer_state:
> attrd_peer_change_cb: Node server1in1[1] - state is now lost (was member)
> Jun 16 10:31:05 server2 attrd[1085]: notice: attrd_peer_remove: Removing
> all server1in1 attributes for attrd_peer_change_cb
> Jun 16 10:31:05 server2 pacemakerd[1081]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now lost (was
> member)
> Jun 16 10:31:05 server2 crmd[1087]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now lost (was
> member)
> Jun 16 10:31:05 server2 crmd[1087]: warning: match_down_event: No match
> for shutdown action on 1
> Jun 16 10:31:05 server2 crmd[1087]: notice: peer_update_callback:
> Stonith/shutdown of server1in1 not matched
> Jun 16 10:31:05 server2 crmd[1087]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jun 16 10:31:05 server2 crmd[1087]: warning: match_down_event: No match
> for shutdown action on 1
> Jun 16 10:31:05 server2 crmd[1087]: notice: peer_update_callback:
> Stonith/shutdown of server1in1 not matched
> Jun 16 10:31:06 server2 pengine[1086]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 16 10:31:06 server2 pengine[1086]: warning: pe_fence_node: Node
> server1in1 will be fenced because the node is no longer part of the cluster
> Jun 16 10:31:06 server2 pengine[1086]: warning: determine_online_status:
> Node server1in1 is unclean
> Jun 16 10:31:06 server2 pengine[1086]: warning: custom_action: Action
> ClusterIP2_stop_0 on server1in1 is unrunnable (offline)
> Jun 16 10:31:06 server2 pengine[1086]: warning: custom_action: Action
> www2_mnt_stop_0 on server1in1 is unrunnable (offline)
> Jun 16 10:31:06 server2 pengine[1086]: warning: stage6: Scheduling Node
> server1in1 for STONITH
> Jun 16 10:31:06 server2 pengine[1086]: notice: LogActions: Move
> ClusterIP2 (Started server1in1 -> server2in1)
> Jun 16 10:31:06 server2 pengine[1086]: notice: LogActions: Move
> www2_mnt (Started server1in1 -> server2in1)
> Jun 16 10:31:06 server2 pengine[1086]: warning: process_pe_message:
> Calculated Transition 93: /var/lib/pacemaker/pengine/pe-warn-53.bz2
> Jun 16 10:31:06 server2 crmd[1087]: notice: te_fence_node: Executing
> reboot fencing operation (20) on server1in1 (timeout=60000)
> Jun 16 10:31:06 server2 stonith-ng[1083]: notice: handle_request: Client
> crmd.1087.b3e11b2e wants to fence (reboot) 'server1in1' with device '(any)'
> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> server1in1: 910d86d6-a53c-4d14-8b66-3e8ef2043bbf (0)
> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
> can_fence_host_with_device: scsi_server can fence (reboot) server1in1:
> static-list
> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
> can_fence_host_with_device: scsi_server can fence (reboot) server1in1:
> static-list
> Jun 16 10:31:06 server2 stonith-ng[1083]: warning:
> stonith_device_execute: Agent 'fence_scsi' does not advertise support
> for 'reboot', performing 'off' action instead
> Jun 16 10:31:07 server2 stonith-ng[1083]: notice: log_operation:
> Operation 'reboot' [13384] (call 24 from crmd.1087) for host
> 'server1in1' with device 'scsi_server' returned: 0 (OK)

Here you can see that server1in1 was fenced, successfully.

Pacemaker still refers to it as a "reboot", but each fence agent
can implement that however it chooses. You can see just above this
that fence_scsi doesn't support "reboot" actions at all, so it
performs an "off" instead, which in its case unregisters the key.

> Jun 16 10:31:07 server2 stonith-ng[1083]: notice: remote_op_done:
> Operation reboot of server1in1 by <no-one> for
> crmd.1087 at server2in1.910d86d6: OK
> Jun 16 10:31:07 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 24/20:93:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
> Jun 16 10:31:07 server2 crmd[1087]: notice: tengine_stonith_notify: Peer
> server1in1 was terminated (reboot) by <anyone> for server2in1: OK
> (ref=910d86d6-a53c-4d14-8b66-3e8ef2043bbf) by client crmd.1087
> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 8: start ClusterIP2_start_0 on server2in1 (local)
> Jun 16 10:31:07 server2 crmd[1087]: notice: abort_transition_graph:
> Transition aborted by deletion of lrm[@id='1']: Resource state removal
> (cib=0.320.17, source=te_update_diff:429,
> path=/cib/status/node_state[@id='1']/lrm[@id='1'], 0)
> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO: Adding inet
> address 192.168.122.112/32 with broadcast address 192.168.122.255 to
> device 122er
> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO: Bringing
> device 122er up
> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
> /var/run/resource-agents/send_arp-192.168.122.112 122er 192.168.122.112
> auto not_used not_used
> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
> ClusterIP2_start_0: ok (node=server2in1, call=118, rc=0, cib-update=601,
> confirmed=true)
> Jun 16 10:31:07 server2 crmd[1087]: notice: run_graph: Transition 93
> (Complete=9, Pending=0, Fired=0, Skipped=4, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-53.bz2): Stopped
> Jun 16 10:31:07 server2 pengine[1086]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 16 10:31:07 server2 pengine[1086]: notice: LogActions: Start
> www2_mnt (server2in1)
> Jun 16 10:31:07 server2 pengine[1086]: notice: process_pe_message:
> Calculated Transition 94: /var/lib/pacemaker/pengine/pe-input-280.bz2
> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 9: monitor ClusterIP2_monitor_30000 on server2in1 (local)
> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 10: start www2_mnt_start_0 on server2in1 (local)
> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
> ClusterIP2_monitor_30000: ok (node=server2in1, call=119, rc=0,
> cib-update=603, confirmed=false)
> Jun 16 10:31:07 server2 Filesystem(www2_mnt)[13490]: INFO: Running start
> for /dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc-part1 on
> /var/www2
> Jun 16 10:31:07 server2 kernel: EXT4-fs (sda1): recovery complete
> Jun 16 10:31:07 server2 kernel: EXT4-fs (sda1): mounted filesystem with
> ordered data mode. Opts: (null)
> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
> www2_mnt_start_0: ok (node=server2in1, call=120, rc=0, cib-update=604,
> confirmed=true)
> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 11: monitor www2_mnt_monitor_20000 on server2in1 (local)
> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
> www2_mnt_monitor_20000: ok (node=server2in1, call=121, rc=0,
> cib-update=605, confirmed=false)
> Jun 16 10:31:07 server2 crmd[1087]: notice: run_graph: Transition 94
> (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-280.bz2): Complete
> Jun 16 10:31:07 server2 crmd[1087]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 16 10:31:55 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1188) was formed. Members joined: 1
> Jun 16 10:31:55 server2 attrd[1085]: notice: crm_update_peer_state:
> attrd_peer_change_cb: Node server1in1[1] - state is now member (was lost)
> Jun 16 10:31:55 server2 crmd[1087]: error: pcmk_cpg_membership: Node
> server1in1[1] appears to be online even though we think it is dead

Here server1in1 rejoins the cluster. Without seeing server1in1's logs
it's hard to say what happened here, but this probably shouldn't happen
without manual intervention by you.

> Jun 16 10:31:55 server2 crmd[1087]: notice: crm_update_peer_state:
> pcmk_cpg_membership: Node server1in1[1] - state is now member (was lost)
> Jun 16 10:31:55 server2 corosync[959]: [QUORUM] Members[2]: 2 1
> Jun 16 10:31:55 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:31:55 server2 pacemakerd[1081]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> Jun 16 10:31:55 server2 crmd[1087]: notice: do_state_transition: State
> transition S_IDLE -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> Jun 16 10:31:56 server2 crmd[1087]: notice: do_state_transition: State
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_TIMER_POPPED origin=election_timeout_popped ]
> Jun 16 10:31:56 server2 crmd[1087]: warning: do_log: FSA: Input
> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
> Jun 16 10:31:58 server2 pengine[1086]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 16 10:31:58 server2 pengine[1086]: error: native_create_actions:
> Resource ClusterIP2 (ocf::IPaddr2) is active on 2 nodes attempting recovery

Here you can see that the IP address resource did not die when the
node was fenced, because it does not depend on the shared storage.
If you plan on keeping just fence_scsi, make the IP address depend on
the shared filesystem resource (via a group or ordering constraint)
so it won't start unless the filesystem is active.

> Jun 16 10:31:58 server2 pengine[1086]: warning: native_create_actions:
> See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
> information.
> Jun 16 10:31:58 server2 pengine[1086]: error: native_create_actions:
> Resource www2_mnt (ocf::Filesystem) is active on 2 nodes attempting recovery
> Jun 16 10:31:58 server2 pengine[1086]: warning: native_create_actions:
> See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active[http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active] for more
> information.
> Jun 16 10:31:58 server2 pengine[1086]: notice: LogActions: Restart
> ClusterIP2 (Started server1in1)
> Jun 16 10:31:58 server2 pengine[1086]: notice: LogActions: Restart
> www2_mnt (Started server1in1)
> Jun 16 10:31:58 server2 pengine[1086]: error: process_pe_message:
> Calculated Transition 95: /var/lib/pacemaker/pengine/pe-error-3.bz2
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 16: stop www2_mnt_stop_0 on server2in1 (local)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 15: stop www2_mnt_stop_0 on server1in1
> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: Running stop
> for /dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc-part1 on
> /var/www2
> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: Trying to
> unmount /var/www2
> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: unmounted
> /var/www2 successfully
> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
> www2_mnt_stop_0: ok (node=server2in1, call=123, rc=0, cib-update=637,
> confirmed=true)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 13: stop ClusterIP2_stop_0 on server2in1 (local)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 12: stop ClusterIP2_stop_0 on server1in1
> Jun 16 10:31:58 server2 IPaddr2(ClusterIP2)[13866]: INFO: IP status =
> ok, IP_CIP=
> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
> ClusterIP2_stop_0: ok (node=server2in1, call=125, rc=0, cib-update=638,
> confirmed=true)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 14: start ClusterIP2_start_0 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 2: monitor ClusterIP2_monitor_30000 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 17: start www2_mnt_start_0 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 1: monitor www2_mnt_monitor_20000 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: run_graph: Transition 95
> (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
> Jun 16 10:31:58 server2 crmd[1087]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> ^C
> [root at server2 ~]# tail -f /var/log/messages
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 13: stop ClusterIP2_stop_0 on server2in1 (local)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 12: stop ClusterIP2_stop_0 on server1in1
> Jun 16 10:31:58 server2 IPaddr2(ClusterIP2)[13866]: INFO: IP status =
> ok, IP_CIP=
> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
> ClusterIP2_stop_0: ok (node=server2in1, call=125, rc=0, cib-update=638,
> confirmed=true)
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 14: start ClusterIP2_start_0 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 2: monitor ClusterIP2_monitor_30000 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 17: start www2_mnt_start_0 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
> action 1: monitor www2_mnt_monitor_20000 on server1in1
> Jun 16 10:31:58 server2 crmd[1087]: notice: run_graph: Transition 95
> (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
> Jun 16 10:31:58 server2 crmd[1087]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
>
> #####################################################
> messages yes > /dev/null&
>
> Jun 16 10:44:37 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:44:49 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1612) was formed. Members
> Jun 16 10:44:49 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:44:49 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:00 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1620) was formed. Members
> Jun 16 10:45:00 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:00 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:12 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1628) was formed. Members
> Jun 16 10:45:12 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:12 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:25 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1636) was formed. Members
> Jun 16 10:45:25 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:25 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:30 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1644) was formed. Members
> Jun 16 10:45:30 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:30 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:35 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1652) was formed. Members
> Jun 16 10:45:35 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:35 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:39 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1660) was formed. Members
> Jun 16 10:45:39 server2 corosync[959]: [QUORUM] Members[1]: 2
> Jun 16 10:45:39 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:41 server2 corosync[959]: [TOTEM ] A new membership
> (192.168.200.131:1668) was formed. Members joined: 1
> Jun 16 10:45:41 server2 corosync[959]: [QUORUM] Members[2]: 2 1
> Jun 16 10:45:41 server2 corosync[959]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Jun 16 10:45:41 server2 crmd[1087]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> Jun 16 10:45:41 server2 pacemakerd[1081]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
> lost)
> Jun 16 10:55:00 server2 crmd[1087]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jun 16 10:55:00 server2 pengine[1086]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
> ClusterIP2_monitor_0 on server1in1 is unrunnable (pending)
> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
> www2_mnt_monitor_0 on server1in1 is unrunnable (pending)
> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
> scsi_server_monitor_0 on server1in1 is unrunnable (pending)
> Jun 16 10:55:00 server2 pengine[1086]: notice: trigger_unfencing:
> Unfencing server1in1: node discovery
> Jun 16 10:55:00 server2 pengine[1086]: notice: process_pe_message:
> Calculated Transition 101: /var/lib/pacemaker/pengine/pe-input-284.bz2
> Jun 16 10:55:00 server2 crmd[1087]: notice: te_fence_node: Executing on
> fencing operation (4) on server1in1 (timeout=60000)
> Jun 16 10:55:00 server2 stonith-ng[1083]: notice: handle_request: Client
> crmd.1087.b3e11b2e wants to fence (on) 'server1in1' with device '(any)'
> Jun 16 10:55:00 server2 stonith-ng[1083]: notice:
> initiate_remote_stonith_op: Initiating remote operation on for
> server1in1: 3b0b3967-6f33-4b68-9f4d-246b69e0370a (0)
> Jun 16 10:55:00 server2 stonith-ng[1083]: notice: stonith_choose_peer:
> Couldn't find anyone to fence server1in1 with <any>
> Jun 16 10:55:00 server2 stonith-ng[1083]: error: remote_op_done:
> Operation on of server1in1 by <no-one> for
> crmd.1087 at server2in1.3b0b3967: No such device
> Jun 16 10:55:00 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 27/4:101:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: No
> such device (-19)
> Jun 16 10:55:00 server2 crmd[1087]: notice: tengine_stonith_callback:
> Stonith operation 27 for server1in1 failed (No such device): aborting
> transition.
> Jun 16 10:55:00 server2 crmd[1087]: notice: abort_transition_graph:
> Transition aborted: Stonith failed (source=tengine_stonith_callback:697, 0)
> Jun 16 10:55:00 server2 crmd[1087]: error: tengine_stonith_notify:
> Unfencing of server1in1 by <anyone> failed: No such device (-19)
> Jun 16 10:55:00 server2 crmd[1087]: notice: run_graph: Transition 101
> (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-284.bz2): Stopped
> Jun 16 10:55:00 server2 crmd[1087]: notice: too_many_st_failures: No
> devices found in cluster to fence server1in1, giving up
> Jun 16 10:55:00 server2 crmd[1087]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]


_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users[http://clusterlabs.org/mailman/listinfo/users]

Project Home: http://www.clusterlabs.org[http://www.clusterlabs.org]
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf[http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf]
Bugs: http://bugs.clusterlabs.org[http://bugs.clusterlabs.org]




More information about the Users mailing list