[ClusterLabs] fence_scsi No devices found in cluster to fence

Mon Aug 3 23:47:31 EDT 2015

> On 18 Jun 2015, at 7:55 am, frbnrw at gmx.de wrote:
> 
> Hi
> 
> thanks Ken for the description of the function. I changed the configuration and did some tests.
> 
> First I suspend the server2 and the resources are switched to server1. After I resumed server2 he tried to failback but it failed caused by the fence_scsi, the key was deleted.
> 
> For me it runs fine!
> 
> Jun 17 23:19:00 server2 iscsid: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
> Jun 17 23:19:01 server2 chronyd[619]: Selected source 131.234.137.23
> Jun 17 23:19:02 server2 chronyd[619]: Selected source 144.76.4.122
> Jun 17 23:19:02 server2 chronyd[619]: System clock wrong by 236.567632 seconds, adjustment started
> Jun 17 23:19:02 server2 chronyd[619]: Selected source 131.234.137.23
> Jun 17 23:19:02 server2 chronyd[619]: System clock wrong by 1.713843 seconds, adjustment started
> Jun 17 23:19:02 server2 kernel: sd 2:0:0:0: reservation conflict
> Jun 17 23:19:02 server2 kernel: sd 2:0:0:0: [sda]  
> Jun 17 23:19:02 server2 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Jun 17 23:19:02 server2 kernel: sd 2:0:0:0: [sda] CDB: 
> Jun 17 23:19:02 server2 kernel: Write(10): 2a 00 00 00 08 00 00 00 08 00
> Jun 17 23:19:02 server2 kernel: end_request: critical nexus error, dev sda, sector 2048
> Jun 17 23:19:02 server2 kernel: Buffer I/O error on device sda1, logical block 0
> Jun 17 23:19:02 server2 kernel: lost page write due to I/O error on sda1
> Jun 17 23:19:02 server2 Filesystem(www2_mnt)[2191]: INFO: unmounted /var/www2 successfully
> Jun 17 23:19:02 server2 iscsid: connection1:0 is operational after recovery (1 attempts)
>  
> But I've a new issue :-)
> 
> If I set "pcs constraint order www2_mnt then ClusterIP2" it is impossible for me to switch the resources or the group.
> 
> [root at server2 ~]# pcs constraint order www2_mnt then ClusterIP2
> Adding www2_mnt ClusterIP2 (kind: Mandatory) (Options: first-action=start then-action=start)

crm_report?  Its hard to say whats going on from a small snapshot of the logs and config.

> 
> Resource	Action	Before/After	Action	Score	Remove
> ClusterIP2	starts	after www2_mnt	starts	
> 
> He has problems with the dependency ..pcs resource move servicea_sg server2in1 or pcs resource move servicea_sg server1in1
> 
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_graph: Graph 9 with 11 actions: batch-limit=11 jobs, network-delay=0ms
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   19]: Pending pseudo op servicea_sg_stopped_0            on N/A (priority: 0, waiting:  10 13)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   18]: Completed pseudo op servicea_sg_stop_0             on N/A (priority: 0, waiting: none)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   17]: Pending pseudo op servicea_sg_running_0            on N/A (priority: 0, waiting:  11 14 16)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   16]: Pending pseudo op servicea_sg_start_0              on N/A (priority: 0, waiting:  19)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   12]: Pending rsc op ClusterIP2_monitor_30000            on server2in1 (priority: 0, waiting:  11)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   11]: Pending rsc op ClusterIP2_start_0                  on server2in1 (priority: 0, waiting:  10 14 16)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   10]: Pending rsc op ClusterIP2_stop_0                   on server1in1 (priority: 0, waiting:  13)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   15]: Pending rsc op www2_mnt_monitor_20000              on server2in1 (priority: 0, waiting:  14)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   14]: Pending rsc op www2_mnt_start_0                    on server2in1 (priority: 0, waiting:  11 13 16)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action   13]: Pending rsc op www2_mnt_stop_0                     on server1in1 (priority: 0, waiting:  10)
> Jun 17 22:11:44 server2 crmd[1102]: notice: print_synapse: [Action    6]: Pending pseudo op all_stopped                      on N/A (priority: 0, waiting:  10 13)
> Jun 17 22:11:44 server2 crmd[1102]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 17 22:12:35 server2 crmd[1102]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> 
> I also tried to set a stop rule ... but it dosen't help. ClusterIP2 down then ww2_mnt.. etc.
> If I remove the dependency the switch will be done.
> 
> Jun 17 22:12:35 server2 pengine[1101]: notice: unpack_config: On loss of CCM Quorum: Ignore
> Jun 17 22:12:35 server2 pengine[1101]: notice: LogActions: Move    ClusterIP2   (Started server1in1 -> server2in1)
> Jun 17 22:12:35 server2 pengine[1101]: notice: LogActions: Move    www2_mnt     (Started server1in1 -> server2in1)
> Jun 17 22:12:35 server2 pengine[1101]: notice: process_pe_message: Calculated Transition 10: /var/lib/pacemaker/pengine/pe-input-296.bz2
> Jun 17 22:12:35 server2 crmd[1102]: notice: te_rsc_command: Initiating action 13: stop www2_mnt_stop_0 on server1in1
> Jun 17 22:12:35 server2 crmd[1102]: notice: te_rsc_command: Initiating action 10: stop ClusterIP2_stop_0 on server1in1
> Jun 17 22:12:35 server2 crmd[1102]: notice: te_rsc_command: Initiating action 11: start ClusterIP2_start_0 on server2in1 (local)
> Jun 17 22:12:35 server2 IPaddr2(ClusterIP2)[2395]: INFO: Adding inet address 192.168.122.112/32 with broadcast address 192.168.122.255 to device 122er
> Jun 17 22:12:35 server2 IPaddr2(ClusterIP2)[2395]: INFO: Bringing device 122er up 
> 
> Regards
> 
> Frank
> 
> Gesendet: Mittwoch, 17. Juni 2015 um 15:58 Uhr
> Von: "Ken Gaillot" <kgaillot at redhat.com>
> An: "Cluster Labs - All topics related to open-source clustering welcomed" <users at clusterlabs.org>
> Betreff: Re: [ClusterLabs] fence_scsi No devices found in cluster to fence
> 
> ----- Original Message -----
>> Hi
>> 
>> I setup a two node cluster based on kvm (it is a test).
>> 
>> Now I'm trying to setup fence_scsi to prevent split brain. One of my
>> test scenario is to suspend the kvm instance and to resume it (if I kill
>> corosync the someone restart it). I see that the other node do a
>> failover but after I resume the kvm instance the cluster node join again
>> (the service group switch back :-( ) and the node is not dying (reboot,
>> cluster framework down etc) :-)
> 
> Hi,
> 
> fence_scsi does not reboot or shut down the node; it cuts off the node's
> access to the SCSI device by removing the registration key. The idea is
> that if the node can't write to shared disk, then it can't cause any
> harm.
> 
>> I also create a extreme high system load (while true; do yes >/dev/null&
>> done), the resources switched to the other node. After I killed the
>> thounsends of yes processes the pcs service is down but I think that is
>> not caused by the fence_scsi. In the log I found that "notice:
>> too_many_st_failures: No devices found in cluster to fence server1in1,
>> giving up"
>> 
>> What is wrong in my configuration? In need two stonith devices? or what
>> is wrong?
> 
> One stonith device is fine (assuming cutting off disk access is enough
> in your setup to prevent any trouble), but you can set up more if you
> want. If you prefer that the node be shut down, you could set up a device
> that does that as the preferred stonith device and keep fence_scsi as
> a backup, or any other configuration that meets your needs.
> 
>> See below for the logs and other output... Many thanks in advance!
>> 
>> Regards
>> 
>> Frank
>> 
>> I created the stonith resource with this command:
>> 
>> pcs stonith create scsi_server fence_scsi pcmk_host_list="server1in1
>> server2in1" pcmk_monitor_action="metadata"
>> devices="/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc" meta
>> provides="unfencing"
>> 
>> I see two keys on my iSCSI device
>> 
>> [root at server1 ~]# sg_persist --in --no-inquiry --read-key
>> --device=/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc
>> PR generation=0x2, 2 registered reservation keys follow:
>> 0xfe290001
>> 0xfe290000
>> 
>> The current cluster state locks good to me.
>> 
>> Cluster name: nonvcs_cl
>> Last updated: Tue Jun 16 10:11:41 2015
>> Last change: Tue Jun 16 10:11:37 2015
>> Stack: corosync
>> Current DC: server2in1 (2) - partition with quorum
>> Version: 1.1.12-a14efad
>> 2 Nodes configured
>> 3 Resources configured
>> 
>> 
>> Online: [ server1in1 server2in1 ]
>> 
>> Full list of resources:
>> 
>> Resource Group: servicea_sg
>> ClusterIP2 (ocf::heartbeat:IPaddr2): Started server1in1
>> www2_mnt (ocf::heartbeat:Filesystem): Started server1in1
>> scsi_server (stonith:fence_scsi): Started server2in1
>> 
>> PCSD Status:
>> server1in1: Online
>> server2in1: Online
>> 
>> Daemon Status:
>> corosync: active/enabled
>> pacemaker: active/disabled
>> pcsd: active/enabled
>> 
>> ##################
>> messages | grep stonith
>> 
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
>> stonith_device_register: Added 'scsi_server' to the device list (1
>> active devices)
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: handle_request: Client
>> crmd.1087.b3e11b2e wants to fence (on) 'server2in1' with device '(any)'
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
>> initiate_remote_stonith_op: Initiating remote operation on for
>> server2in1: fd8b714f-6ac3-4227-9937-0d4e7b98e454 (0)
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: handle_request: Client
>> crmd.1087.b3e11b2e wants to fence (on) 'server1in1' with device '(any)'
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
>> initiate_remote_stonith_op: Initiating remote operation on for
>> server1in1: e54b60cf-87a3-403f-9061-a4cac2fa7d0d (0)
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
>> can_fence_host_with_device: scsi_server can fence (on) server2in1:
>> static-list
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice:
>> can_fence_host_with_device: scsi_server can fence (on) server2in1:
>> static-list
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: log_operation:
>> Operation 'on' [13198] (call 22 from crmd.1087) for host 'server2in1'
>> with device 'scsi_server' returned: 0 (OK)
> 
> Notice fence_scsi is the rare fence agent that needs an "on" operation
> (aka "unfencing") when the node first joins the cluster. You can see it
> returned OK here, which means it registered server2in1's key successsfully.
> 
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: remote_op_done:
>> Operation on of server2in1 by <no-one> for crmd.1087 at server2in1.fd8b714f: OK
>> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_callback:
>> Stonith operation 22/4:91:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
>> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_notify:
>> server2in1 was successfully unfenced by <anyone> (at the request of
>> server2in1)
>> Jun 16 10:11:37 server2 stonith-ng[1083]: notice: remote_op_done:
>> Operation on of server1in1 by <no-one> for crmd.1087 at server2in1.e54b60cf: OK
>> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_callback:
>> Stonith operation 23/3:91:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
>> Jun 16 10:11:37 server2 crmd[1087]: notice: tengine_stonith_notify:
>> server1in1 was successfully unfenced by <anyone> (at the request of
>> server2in1)
>> 
>> ##################
>> Verify is OK
>> 
>> [root at server1 ~]# crm_verify -L
>> [root at server1 ~]#
>> 
>> ##################
>> After resume one key is deleted
>> [root at server1 ~]# sg_persist --in --no-inquiry --read-key
>> --device=/dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc
>> PR generation=0x3, 1 registered reservation key follows:
>> 0xfe290001
>> 
>> ##################
>> Log yes > /dev/null extreme system load
>> 
>> Jun 16 10:45:39 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:45:39 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:41 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1668) was formed. Members joined: 1
>> Jun 16 10:45:41 server2 corosync[959]: [QUORUM] Members[2]: 2 1
>> Jun 16 10:45:41 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:41 server2 crmd[1087]: notice: crm_update_peer_state:
>> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
>> lost)
>> Jun 16 10:45:41 server2 pacemakerd[1081]: notice: crm_update_peer_state:
>> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
>> lost)
>> 
>> ########################
>> pcs status "whilte true.. yes > /dev/null&" extreme system load
>> 
>> [root at server1 ~]# pcs status
>> Error: cluster is not currently running on this node
>> 
>> [root at server2 ~]# pcs status | grep server1in1
>> Node server1in1 (1): pending
>> server1in1: Online
>> 
>> ########################
>> Full log
>> 
>> Jun 16 10:31:04 server2 corosync[959]: [TOTEM ] A processor failed,
>> forming new configuration.
>> Jun 16 10:31:05 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1184) was formed. Members left: 1
>> Jun 16 10:31:05 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:31:05 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:31:05 server2 attrd[1085]: notice: crm_update_peer_state:
>> attrd_peer_change_cb: Node server1in1[1] - state is now lost (was member)
>> Jun 16 10:31:05 server2 attrd[1085]: notice: attrd_peer_remove: Removing
>> all server1in1 attributes for attrd_peer_change_cb
>> Jun 16 10:31:05 server2 pacemakerd[1081]: notice: crm_update_peer_state:
>> pcmk_quorum_notification: Node server1in1[1] - state is now lost (was
>> member)
>> Jun 16 10:31:05 server2 crmd[1087]: notice: crm_update_peer_state:
>> pcmk_quorum_notification: Node server1in1[1] - state is now lost (was
>> member)
>> Jun 16 10:31:05 server2 crmd[1087]: warning: match_down_event: No match
>> for shutdown action on 1
>> Jun 16 10:31:05 server2 crmd[1087]: notice: peer_update_callback:
>> Stonith/shutdown of server1in1 not matched
>> Jun 16 10:31:05 server2 crmd[1087]: notice: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> Jun 16 10:31:05 server2 crmd[1087]: warning: match_down_event: No match
>> for shutdown action on 1
>> Jun 16 10:31:05 server2 crmd[1087]: notice: peer_update_callback:
>> Stonith/shutdown of server1in1 not matched
>> Jun 16 10:31:06 server2 pengine[1086]: notice: unpack_config: On loss of
>> CCM Quorum: Ignore
>> Jun 16 10:31:06 server2 pengine[1086]: warning: pe_fence_node: Node
>> server1in1 will be fenced because the node is no longer part of the cluster
>> Jun 16 10:31:06 server2 pengine[1086]: warning: determine_online_status:
>> Node server1in1 is unclean
>> Jun 16 10:31:06 server2 pengine[1086]: warning: custom_action: Action
>> ClusterIP2_stop_0 on server1in1 is unrunnable (offline)
>> Jun 16 10:31:06 server2 pengine[1086]: warning: custom_action: Action
>> www2_mnt_stop_0 on server1in1 is unrunnable (offline)
>> Jun 16 10:31:06 server2 pengine[1086]: warning: stage6: Scheduling Node
>> server1in1 for STONITH
>> Jun 16 10:31:06 server2 pengine[1086]: notice: LogActions: Move
>> ClusterIP2 (Started server1in1 -> server2in1)
>> Jun 16 10:31:06 server2 pengine[1086]: notice: LogActions: Move
>> www2_mnt (Started server1in1 -> server2in1)
>> Jun 16 10:31:06 server2 pengine[1086]: warning: process_pe_message:
>> Calculated Transition 93: /var/lib/pacemaker/pengine/pe-warn-53.bz2
>> Jun 16 10:31:06 server2 crmd[1087]: notice: te_fence_node: Executing
>> reboot fencing operation (20) on server1in1 (timeout=60000)
>> Jun 16 10:31:06 server2 stonith-ng[1083]: notice: handle_request: Client
>> crmd.1087.b3e11b2e wants to fence (reboot) 'server1in1' with device '(any)'
>> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
>> initiate_remote_stonith_op: Initiating remote operation reboot for
>> server1in1: 910d86d6-a53c-4d14-8b66-3e8ef2043bbf (0)
>> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
>> can_fence_host_with_device: scsi_server can fence (reboot) server1in1:
>> static-list
>> Jun 16 10:31:06 server2 stonith-ng[1083]: notice:
>> can_fence_host_with_device: scsi_server can fence (reboot) server1in1:
>> static-list
>> Jun 16 10:31:06 server2 stonith-ng[1083]: warning:
>> stonith_device_execute: Agent 'fence_scsi' does not advertise support
>> for 'reboot', performing 'off' action instead
>> Jun 16 10:31:07 server2 stonith-ng[1083]: notice: log_operation:
>> Operation 'reboot' [13384] (call 24 from crmd.1087) for host
>> 'server1in1' with device 'scsi_server' returned: 0 (OK)
> 
> Here you can see that server1in1 was fenced, successfully.
> 
> Pacemaker still refers to it as a "reboot", but each fence agent
> can implement that however it chooses. You can see just above this
> that fence_scsi doesn't support "reboot" actions at all, so it
> performs an "off" instead, which in its case unregisters the key.
> 
>> Jun 16 10:31:07 server2 stonith-ng[1083]: notice: remote_op_done:
>> Operation reboot of server1in1 by <no-one> for
>> crmd.1087 at server2in1.910d86d6: OK
>> Jun 16 10:31:07 server2 crmd[1087]: notice: tengine_stonith_callback:
>> Stonith operation 24/20:93:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: OK (0)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: tengine_stonith_notify: Peer
>> server1in1 was terminated (reboot) by <anyone> for server2in1: OK
>> (ref=910d86d6-a53c-4d14-8b66-3e8ef2043bbf) by client crmd.1087
>> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 8: start ClusterIP2_start_0 on server2in1 (local)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: abort_transition_graph:
>> Transition aborted by deletion of lrm[@id='1']: Resource state removal
>> (cib=0.320.17, source=te_update_diff:429,
>> path=/cib/status/node_state[@id='1']/lrm[@id='1'], 0)
>> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO: Adding inet
>> address 192.168.122.112/32 with broadcast address 192.168.122.255 to
>> device 122er
>> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO: Bringing
>> device 122er up
>> Jun 16 10:31:07 server2 IPaddr2(ClusterIP2)[13411]: INFO:
>> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
>> /var/run/resource-agents/send_arp-192.168.122.112 122er 192.168.122.112
>> auto not_used not_used
>> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
>> ClusterIP2_start_0: ok (node=server2in1, call=118, rc=0, cib-update=601,
>> confirmed=true)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: run_graph: Transition 93
>> (Complete=9, Pending=0, Fired=0, Skipped=4, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-warn-53.bz2): Stopped
>> Jun 16 10:31:07 server2 pengine[1086]: notice: unpack_config: On loss of
>> CCM Quorum: Ignore
>> Jun 16 10:31:07 server2 pengine[1086]: notice: LogActions: Start
>> www2_mnt (server2in1)
>> Jun 16 10:31:07 server2 pengine[1086]: notice: process_pe_message:
>> Calculated Transition 94: /var/lib/pacemaker/pengine/pe-input-280.bz2
>> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 9: monitor ClusterIP2_monitor_30000 on server2in1 (local)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 10: start www2_mnt_start_0 on server2in1 (local)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
>> ClusterIP2_monitor_30000: ok (node=server2in1, call=119, rc=0,
>> cib-update=603, confirmed=false)
>> Jun 16 10:31:07 server2 Filesystem(www2_mnt)[13490]: INFO: Running start
>> for /dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc-part1 on
>> /var/www2
>> Jun 16 10:31:07 server2 kernel: EXT4-fs (sda1): recovery complete
>> Jun 16 10:31:07 server2 kernel: EXT4-fs (sda1): mounted filesystem with
>> ordered data mode. Opts: (null)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
>> www2_mnt_start_0: ok (node=server2in1, call=120, rc=0, cib-update=604,
>> confirmed=true)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 11: monitor www2_mnt_monitor_20000 on server2in1 (local)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: process_lrm_event: Operation
>> www2_mnt_monitor_20000: ok (node=server2in1, call=121, rc=0,
>> cib-update=605, confirmed=false)
>> Jun 16 10:31:07 server2 crmd[1087]: notice: run_graph: Transition 94
>> (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-280.bz2): Complete
>> Jun 16 10:31:07 server2 crmd[1087]: notice: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Jun 16 10:31:55 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1188) was formed. Members joined: 1
>> Jun 16 10:31:55 server2 attrd[1085]: notice: crm_update_peer_state:
>> attrd_peer_change_cb: Node server1in1[1] - state is now member (was lost)
>> Jun 16 10:31:55 server2 crmd[1087]: error: pcmk_cpg_membership: Node
>> server1in1[1] appears to be online even though we think it is dead
> 
> Here server1in1 rejoins the cluster. Without seeing server1in1's logs
> it's hard to say what happened here, but this probably shouldn't happen
> without manual intervention by you.
> 
>> Jun 16 10:31:55 server2 crmd[1087]: notice: crm_update_peer_state:
>> pcmk_cpg_membership: Node server1in1[1] - state is now member (was lost)
>> Jun 16 10:31:55 server2 corosync[959]: [QUORUM] Members[2]: 2 1
>> Jun 16 10:31:55 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:31:55 server2 pacemakerd[1081]: notice: crm_update_peer_state:
>> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
>> lost)
>> Jun 16 10:31:55 server2 crmd[1087]: notice: do_state_transition: State
>> transition S_IDLE -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
>> origin=do_election_count_vote ]
>> Jun 16 10:31:56 server2 crmd[1087]: notice: do_state_transition: State
>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>> cause=C_TIMER_POPPED origin=election_timeout_popped ]
>> Jun 16 10:31:56 server2 crmd[1087]: warning: do_log: FSA: Input
>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
>> Jun 16 10:31:58 server2 pengine[1086]: notice: unpack_config: On loss of
>> CCM Quorum: Ignore
>> Jun 16 10:31:58 server2 pengine[1086]: error: native_create_actions:
>> Resource ClusterIP2 (ocf::IPaddr2) is active on 2 nodes attempting recovery
> 
> Here you can see that the IP address resource did not die when the
> node was fenced, because it does not depend on the shared storage.
> If you plan on keeping just fence_scsi, make the IP address depend on
> the shared filesystem resource (via a group or ordering constraint)
> so it won't start unless the filesystem is active.
> 
>> Jun 16 10:31:58 server2 pengine[1086]: warning: native_create_actions:
>> See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
>> information.
>> Jun 16 10:31:58 server2 pengine[1086]: error: native_create_actions:
>> Resource www2_mnt (ocf::Filesystem) is active on 2 nodes attempting recovery
>> Jun 16 10:31:58 server2 pengine[1086]: warning: native_create_actions:
>> See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active[http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active] for more
>> information.
>> Jun 16 10:31:58 server2 pengine[1086]: notice: LogActions: Restart
>> ClusterIP2 (Started server1in1)
>> Jun 16 10:31:58 server2 pengine[1086]: notice: LogActions: Restart
>> www2_mnt (Started server1in1)
>> Jun 16 10:31:58 server2 pengine[1086]: error: process_pe_message:
>> Calculated Transition 95: /var/lib/pacemaker/pengine/pe-error-3.bz2
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 16: stop www2_mnt_stop_0 on server2in1 (local)
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 15: stop www2_mnt_stop_0 on server1in1
>> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: Running stop
>> for /dev/disk/by-id/wwn-0x6001405516563e3d75b5d3cceda0a1dc-part1 on
>> /var/www2
>> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: Trying to
>> unmount /var/www2
>> Jun 16 10:31:58 server2 Filesystem(www2_mnt)[13787]: INFO: unmounted
>> /var/www2 successfully
>> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
>> www2_mnt_stop_0: ok (node=server2in1, call=123, rc=0, cib-update=637,
>> confirmed=true)
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 13: stop ClusterIP2_stop_0 on server2in1 (local)
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 12: stop ClusterIP2_stop_0 on server1in1
>> Jun 16 10:31:58 server2 IPaddr2(ClusterIP2)[13866]: INFO: IP status =
>> ok, IP_CIP=
>> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
>> ClusterIP2_stop_0: ok (node=server2in1, call=125, rc=0, cib-update=638,
>> confirmed=true)
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 14: start ClusterIP2_start_0 on server1in1
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 2: monitor ClusterIP2_monitor_30000 on server1in1
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 17: start www2_mnt_start_0 on server1in1
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 1: monitor www2_mnt_monitor_20000 on server1in1
>> Jun 16 10:31:58 server2 crmd[1087]: notice: run_graph: Transition 95
>> (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
>> Jun 16 10:31:58 server2 crmd[1087]: notice: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> ^C
>> [root at server2 ~]# tail -f /var/log/messages
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 13: stop ClusterIP2_stop_0 on server2in1 (local)
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 12: stop ClusterIP2_stop_0 on server1in1
>> Jun 16 10:31:58 server2 IPaddr2(ClusterIP2)[13866]: INFO: IP status =
>> ok, IP_CIP=
>> Jun 16 10:31:58 server2 crmd[1087]: notice: process_lrm_event: Operation
>> ClusterIP2_stop_0: ok (node=server2in1, call=125, rc=0, cib-update=638,
>> confirmed=true)
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 14: start ClusterIP2_start_0 on server1in1
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 2: monitor ClusterIP2_monitor_30000 on server1in1
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 17: start www2_mnt_start_0 on server1in1
>> Jun 16 10:31:58 server2 crmd[1087]: notice: te_rsc_command: Initiating
>> action 1: monitor www2_mnt_monitor_20000 on server1in1
>> Jun 16 10:31:58 server2 crmd[1087]: notice: run_graph: Transition 95
>> (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
>> Jun 16 10:31:58 server2 crmd[1087]: notice: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> 
>> 
>> #####################################################
>> messages yes > /dev/null&
>> 
>> Jun 16 10:44:37 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:44:49 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1612) was formed. Members
>> Jun 16 10:44:49 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:44:49 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:00 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1620) was formed. Members
>> Jun 16 10:45:00 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:45:00 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:12 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1628) was formed. Members
>> Jun 16 10:45:12 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:45:12 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:25 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1636) was formed. Members
>> Jun 16 10:45:25 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:45:25 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:30 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1644) was formed. Members
>> Jun 16 10:45:30 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:45:30 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:35 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1652) was formed. Members
>> Jun 16 10:45:35 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:45:35 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:39 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1660) was formed. Members
>> Jun 16 10:45:39 server2 corosync[959]: [QUORUM] Members[1]: 2
>> Jun 16 10:45:39 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:41 server2 corosync[959]: [TOTEM ] A new membership
>> (192.168.200.131:1668) was formed. Members joined: 1
>> Jun 16 10:45:41 server2 corosync[959]: [QUORUM] Members[2]: 2 1
>> Jun 16 10:45:41 server2 corosync[959]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Jun 16 10:45:41 server2 crmd[1087]: notice: crm_update_peer_state:
>> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
>> lost)
>> Jun 16 10:45:41 server2 pacemakerd[1081]: notice: crm_update_peer_state:
>> pcmk_quorum_notification: Node server1in1[1] - state is now member (was
>> lost)
>> Jun 16 10:55:00 server2 crmd[1087]: notice: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_TIMER_POPPED origin=crm_timer_popped ]
>> Jun 16 10:55:00 server2 pengine[1086]: notice: unpack_config: On loss of
>> CCM Quorum: Ignore
>> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
>> ClusterIP2_monitor_0 on server1in1 is unrunnable (pending)
>> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
>> www2_mnt_monitor_0 on server1in1 is unrunnable (pending)
>> Jun 16 10:55:00 server2 pengine[1086]: warning: custom_action: Action
>> scsi_server_monitor_0 on server1in1 is unrunnable (pending)
>> Jun 16 10:55:00 server2 pengine[1086]: notice: trigger_unfencing:
>> Unfencing server1in1: node discovery
>> Jun 16 10:55:00 server2 pengine[1086]: notice: process_pe_message:
>> Calculated Transition 101: /var/lib/pacemaker/pengine/pe-input-284.bz2
>> Jun 16 10:55:00 server2 crmd[1087]: notice: te_fence_node: Executing on
>> fencing operation (4) on server1in1 (timeout=60000)
>> Jun 16 10:55:00 server2 stonith-ng[1083]: notice: handle_request: Client
>> crmd.1087.b3e11b2e wants to fence (on) 'server1in1' with device '(any)'
>> Jun 16 10:55:00 server2 stonith-ng[1083]: notice:
>> initiate_remote_stonith_op: Initiating remote operation on for
>> server1in1: 3b0b3967-6f33-4b68-9f4d-246b69e0370a (0)
>> Jun 16 10:55:00 server2 stonith-ng[1083]: notice: stonith_choose_peer:
>> Couldn't find anyone to fence server1in1 with <any>
>> Jun 16 10:55:00 server2 stonith-ng[1083]: error: remote_op_done:
>> Operation on of server1in1 by <no-one> for
>> crmd.1087 at server2in1.3b0b3967: No such device
>> Jun 16 10:55:00 server2 crmd[1087]: notice: tengine_stonith_callback:
>> Stonith operation 27/4:101:0:ee8fb283-e55b-40f1-ae89-83b84c76efac: No
>> such device (-19)
>> Jun 16 10:55:00 server2 crmd[1087]: notice: tengine_stonith_callback:
>> Stonith operation 27 for server1in1 failed (No such device): aborting
>> transition.
>> Jun 16 10:55:00 server2 crmd[1087]: notice: abort_transition_graph:
>> Transition aborted: Stonith failed (source=tengine_stonith_callback:697, 0)
>> Jun 16 10:55:00 server2 crmd[1087]: error: tengine_stonith_notify:
>> Unfencing of server1in1 by <anyone> failed: No such device (-19)
>> Jun 16 10:55:00 server2 crmd[1087]: notice: run_graph: Transition 101
>> (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-284.bz2): Stopped
>> Jun 16 10:55:00 server2 crmd[1087]: notice: too_many_st_failures: No
>> devices found in cluster to fence server1in1, giving up
>> Jun 16 10:55:00 server2 crmd[1087]: notice: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users[http://clusterlabs.org/mailman/listinfo/users]
> 
> Project Home: http://www.clusterlabs.org[http://www.clusterlabs.org]
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf[http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf]
> Bugs: http://bugs.clusterlabs.org[http://bugs.clusterlabs.org]
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org