[ClusterLabs] Fence agent ends up stopped with no clear reason why

Wed Aug 1 17:03:54 EDT 2018

On Wed, 2018-08-01 at 13:43 -0600, Casey Allen Shobe wrote:
> Here is the corosync.log for the first host in the list at the
> indicated time.  Not sure what it's doing or why - all cluster nodes
> were up and running the entire time...no fencing events.
> 
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     Diff: --- 0.700.4 2
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     Diff: +++ 0.700.5 (null)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib:  @num_updates=5
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib/status/node_state[@id='3']/lrm[@id='3']/l
> rm_resources/lrm_resource[@id='vmware_fence']/lrm_rsc_op[@id='vmware_
> fence_last_0']:  @operation_key=vmware_fence_start_0,
> @operation=start, @transition-key=42:5084:0:68fc0c5a-8a09-4d53-90d5-
> c1a237542060, @transition-magic=4:1;42:5084:0:68fc0c5a-8a09-4d53-
> 90d5-c1a237542060, @call-id=42, @rc-code=1, @op-status=4, @exec-
> time=1510

This says that the start operation failed with exit code 1 (error), and
pacemaker's status for the operation was 4 (also error).

For fence devices, a start first registers the device with stonithd
(which should never fail). There should be a log message from stonithd
like "Added 'vmware_fence' to the device list". The cluster then does
an initial monitor. That is most likely what failed.

If you're lucky, the fence agent logged some detail about why that
monitor failed, or has a debug option to do so.

> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib/status/node_state[@id='3']/lrm[@id='3']/l
> rm_resources/lrm_resource[@id='vmware_fence']/lrm_rsc_op[@id='vmware_
> fence_last_failure_0']:  @operation_key=vmware_fence_start_0,
> @operation=start, @transition-key=42:5084:0:68fc0c5a-8a09-4d53-90d5-
> c1a237542060, @transition-magic=4:1;42:5084:0:68fc0c5a-8a09-4d53-
> 90d5-c1a237542060, @call-id=42, @interval=0, @last-rc-
> change=1532987187, @exec-time=1510, @op-digest=8653f310a5c96a63ab95a
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_process_request:        Completed cib_modify operation for
> section status: OK (rc=0, origin=q-gp2-dbpg57-3/crmd/32,
> version=0.700.5)
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:   notice:
> abort_transition_graph:     Transition aborted by
> vmware_fence_start_0 'modify' on q-gp2-dbpg57-3: Event failed
> (magic=4:1;42:5084:0:68fc0c5a-8a09-4d53-90d5-c1a237542060,
> cib=0.700.5, source=match_graph_event:381, 0)
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:     info:
> abort_transition_graph:     Transition aborted by
> vmware_fence_start_0 'modify' on q-gp2-dbpg57-3: Event failed
> (magic=4:1;42:5084:0:68fc0c5a-8a09-4d53-90d5-c1a237542060,
> cib=0.700.5, source=match_graph_event:381, 0)
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:   notice:
> run_graph:  Transition 5084 (Complete=3, Pending=0, Fired=0,
> Skipped=0, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-
> 729.bz2): Complete
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:     info:
> do_state_transition:        State transition S_TRANSITION_ENGINE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_process_request:        Forwarding cib_modify operation for
> section status to master (origin=local/attrd/46)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status_fencing:    Node q-gp2-dbpg57-1 is active
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status:    Node q-gp2-dbpg57-1 is online
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status_fencing:    Node q-gp2-dbpg57-3 is active
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status:    Node q-gp2-dbpg57-3 is online
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status_fencing:    Node q-gp2-dbpg57-2 is active
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status:    Node q-gp2-dbpg57-2 is online
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-master-vip active on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-master-vip active on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:0 active in master mode on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:0 active in master mode on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:1 active on q-gp2-dbpg57-3
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:1 active on q-gp2-dbpg57-3
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:  warning:
> unpack_rsc_op_failure:      Processing failed op start for
> vmware_fence on q-gp2-dbpg57-3: unknown error (1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:  warning:
> unpack_rsc_op_failure:      Processing failed op start for
> vmware_fence on q-gp2-dbpg57-3: unknown error (1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:  warning:
> unpack_rsc_op_failure:      Processing failed op monitor for
> vmware_fence on q-gp2-dbpg57-2: unknown error (1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:2 active on q-gp2-dbpg57-2
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:2 active on q-gp2-dbpg57-2
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> native_print:       postgresql-master-
> vip   (ocf::heartbeat:IPaddr2):       Started q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> clone_print:         Master/Slave Set: postgresql-ha [postgresql-10-
> main]
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> short_print:             Masters: [ q-gp2-dbpg57-1 ]
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> short_print:             Slaves: [ q-gp2-dbpg57-2 q-gp2-dbpg57-3 ]
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> native_print:       vmware_fence    (stonith:fence_vmware_rest):    F
> AILED q-gp2-dbpg57-3
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> get_failcount_full: vmware_fence has failed 5 times on q-gp2-dbpg57-2
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:  warning:
> common_apply_stickiness:    Forcing vmware_fence away from q-gp2-
> dbpg57-2 after 5 failures (max=5)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> get_failcount_full: vmware_fence has failed 1 times on q-gp2-dbpg57-3
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> common_apply_stickiness:    vmware_fence can fail 4 more times on q-
> gp2-dbpg57-3 before being forced off
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> master_color:       Promoting postgresql-10-main:0 (Master q-gp2-
> dbpg57-1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> master_color:       postgresql-ha: Promoted 1 instances of a possible
> 1 to master
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> RecurringOp:         Start recurring monitor (60s) for vmware_fence
> on q-gp2-dbpg57-3
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> LogActions: Leave   postgresql-master-vip   (Started q-gp2-dbpg57-1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> LogActions: Leave   postgresql-10-main:0    (Master q-gp2-dbpg57-1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> LogActions: Leave   postgresql-10-main:1    (Slave q-gp2-dbpg57-3)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> LogActions: Leave   postgresql-10-main:2    (Slave q-gp2-dbpg57-2)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:   notice:
> LogActions: Recover vmware_fence    (Started q-gp2-dbpg57-3)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     Diff: --- 0.700.5 2
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     Diff: +++ 0.700.6 (null)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib:  @num_updates=6
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib/status/node_state[@id='3']/transient_attr
> ibutes[@id='3']/instance_attributes[@id='status-
> 3']/nvpair[@id='status-3-fail-count-vmware_fence']:  @value=INFINITY
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:     info:
> do_state_transition:        State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:   notice:
> process_pe_message: Calculated Transition 5085:
> /var/lib/pacemaker/pengine/pe-input-730.bz2
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:   notice:
> abort_transition_graph:     Transition aborted by status-3-fail-
> count-vmware_fence, fail-count-vmware_fence=INFINITY: Transient
> attribute change (modify cib=0.700.6, source=abort_unless_down:329,
> path=/cib/status/node_state[@id='3']/transient_attributes[@id='3']/in
> stance_attributes[@id='status-3']/nvpair[@id='status-3-fail-count-
> vmware_fence'], 0)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_process_request:        Completed cib_modify operation for
> section status: OK (rc=0, origin=q-gp2-dbpg57-1/attrd/46,
> version=0.700.6)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_process_request:        Forwarding cib_modify operation for
> section status to master (origin=local/attrd/47)
> Jul 30 21:46:30 [3881] q-gp2-dbpg57-1      attrd:     info:
> attrd_cib_callback: Update 46 for fail-count-vmware_fence: OK (0)
> Jul 30 21:46:30 [3881] q-gp2-dbpg57-1      attrd:     info:
> attrd_cib_callback: Update 46 for fail-count-vmware_fence[q-gp2-
> dbpg57-2]=5: OK (0)
> Jul 30 21:46:30 [3881] q-gp2-dbpg57-1      attrd:     info:
> attrd_cib_callback: Update 46 for fail-count-vmware_fence[q-gp2-
> dbpg57-3]=INFINITY: OK (0)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     Diff: --- 0.700.6 2
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     Diff: +++ 0.700.7 (null)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib:  @num_updates=7
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib/status/node_state[@id='3']/lrm[@id='3']/l
> rm_resources/lrm_resource[@id='vmware_fence']/lrm_rsc_op[@id='vmware_
> fence_last_0']:  @operation_key=vmware_fence_stop_0, @operation=stop,
> @transition-key=4:5085:0:68fc0c5a-8a09-4d53-90d5-c1a237542060,
> @transition-magic=0:0;4:5085:0:68fc0c5a-8a09-4d53-90d5-c1a237542060,
> @call-id=43, @rc-code=0, @op-status=0, @last-run=1532987190, @last-
> rc-change=1532987190, @exec-time=0
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:   notice:
> run_graph:  Transition 5085 (Complete=2, Pending=0, Fired=0,
> Skipped=1, Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-
> 730.bz2): Stopped
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:     info:
> do_state_transition:        State transition S_TRANSITION_ENGINE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_process_request:        Completed cib_modify operation for
> section status: OK (rc=0, origin=q-gp2-dbpg57-3/crmd/33,
> version=0.700.7)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     Diff: --- 0.700.7 2
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     Diff: +++ 0.700.8 (null)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib:  @num_updates=8
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_perform_op:     +  /cib/status/node_state[@id='3']/transient_attr
> ibutes[@id='3']/instance_attributes[@id='status-
> 3']/nvpair[@id='status-3-last-failure-
> vmware_fence']:  @value=1532987190
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:     info:
> abort_transition_graph:     Transition aborted by status-3-last-
> failure-vmware_fence, last-failure-vmware_fence=1532987190: Transient
> attribute change (modify cib=0.700.8, source=abort_unless_down:329,
> path=/cib/status/node_state[@id='3']/transient_attributes[@id='3']/in
> stance_attributes[@id='status-3']/nvpair[@id='status-3-last-failure-
> vmware_fence'], 1)
> Jul 30 21:46:30 [3878] q-gp2-dbpg57-1        cib:     info:
> cib_process_request:        Completed cib_modify operation for
> section status: OK (rc=0, origin=q-gp2-dbpg57-1/attrd/47,
> version=0.700.8)
> Jul 30 21:46:30 [3881] q-gp2-dbpg57-1      attrd:     info:
> attrd_cib_callback: Update 47 for last-failure-vmware_fence: OK (0)
> Jul 30 21:46:30 [3881] q-gp2-dbpg57-1      attrd:     info:
> attrd_cib_callback: Update 47 for last-failure-vmware_fence[q-gp2-
> dbpg57-2]=1532448714: OK (0)
> Jul 30 21:46:30 [3881] q-gp2-dbpg57-1      attrd:     info:
> attrd_cib_callback: Update 47 for last-failure-vmware_fence[q-gp2-
> dbpg57-3]=1532987190: OK (0)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status_fencing:    Node q-gp2-dbpg57-1 is active
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status:    Node q-gp2-dbpg57-1 is online
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status_fencing:    Node q-gp2-dbpg57-3 is active
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status:    Node q-gp2-dbpg57-3 is online
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status_fencing:    Node q-gp2-dbpg57-2 is active
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_online_status:    Node q-gp2-dbpg57-2 is online
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-master-vip active on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-master-vip active on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:0 active in master mode on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:0 active in master mode on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:1 active on q-gp2-dbpg57-3
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:1 active on q-gp2-dbpg57-3
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:  warning:
> unpack_rsc_op_failure:      Processing failed op start for
> vmware_fence on q-gp2-dbpg57-3: unknown error (1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:  warning:
> unpack_rsc_op_failure:      Processing failed op monitor for
> vmware_fence on q-gp2-dbpg57-2: unknown error (1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:2 active on q-gp2-dbpg57-2
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> determine_op_status:        Operation monitor found resource
> postgresql-10-main:2 active on q-gp2-dbpg57-2
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> native_print:       postgresql-master-
> vip   (ocf::heartbeat:IPaddr2):       Started q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> clone_print:         Master/Slave Set: postgresql-ha [postgresql-10-
> main]
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> short_print:             Masters: [ q-gp2-dbpg57-1 ]
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> short_print:             Slaves: [ q-gp2-dbpg57-2 q-gp2-dbpg57-3 ]
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> native_print:       vmware_fence    (stonith:fence_vmware_rest):    S
> topped
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> get_failcount_full: vmware_fence has failed 5 times on q-gp2-dbpg57-2
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:  warning:
> common_apply_stickiness:    Forcing vmware_fence away from q-gp2-
> dbpg57-2 after 5 failures (max=5)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> get_failcount_full: vmware_fence has failed INFINITY times on q-gp2-
> dbpg57-3
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:  warning:
> common_apply_stickiness:    Forcing vmware_fence away from q-gp2-
> dbpg57-3 after 1000000 failures (max=5)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> master_color:       Promoting postgresql-10-main:0 (Master q-gp2-
> dbpg57-1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> master_color:       postgresql-ha: Promoted 1 instances of a possible
> 1 to master
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> RecurringOp:         Start recurring monitor (60s) for vmware_fence
> on q-gp2-dbpg57-1
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> LogActions: Leave   postgresql-master-vip   (Started q-gp2-dbpg57-1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> LogActions: Leave   postgresql-10-main:0    (Master q-gp2-dbpg57-1)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> LogActions: Leave   postgresql-10-main:1    (Slave q-gp2-dbpg57-3)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:     info:
> LogActions: Leave   postgresql-10-main:2    (Slave q-gp2-dbpg57-2)
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:   notice:
> LogActions: Start   vmware_fence    (q-gp2-dbpg57-1)
> Jul 30 21:46:30 [3883] q-gp2-dbpg57-1       crmd:     info:
> do_state_transition:        State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Jul 30 21:46:30 [3882] q-gp2-dbpg57-1    pengine:   notice:
> process_pe_message: Calculated Transition 5086:
> /var/lib/pacemaker/pengine/pe-input-731.bz2
> Jul 30 21:46:30 [3880] q-gp2-dbpg57-1       lrmd:     info:
> log_execute:        executing - rsc:vmware_fence action:start
> call_id:77
> Jul 30 21:46:30 [3879] q-gp2-dbpg57-1 stonith-ng:  warning:
> log_action: fence_vmware_rest[5739] stderr: [ 2018-07-30 21:46:30,895
> ERROR: Unable to connect/login to fencing device ]
> Jul 30 21:46:30 [3879] q-gp2-dbpg57-1 stonith-ng:  warning:
> log_action: fence_vmware_rest[5739] stderr: [  ]
> Jul 30 21:46:30 [3879] q-gp2-dbpg57-1 stonith-ng:  warning:
> log_action: fence_vmware_rest[5739] stderr: [  ]
> Jul 30 21:46:30 [3879] q-gp2-dbpg57-1 stonith-ng:     info:
> internal_stonith_action_execute:    Attempt 2 to execute
> fence_vmware_rest (monitor). remaining timeout is 20
> 
> 
> > On 2018-08-01, at 1:39 PM, Casey Allen Shobe <casey.allen.shobe at icl
> > oud.com> wrote:
> > 
> > Across our clusters, I see the fence agent stop working, with no
> > apparent reason.  It looks like shown below.  I've found that I can
> > do a `pcs resource cleanup vmware_fence` to cause it to start back
> > up again in a few seconds, but why is this happening and how can I
> > prevent it?
> > 
> > vmware_fence	(stonith:fence_vmware_rest):	Stopped
> > 
> > Failed Actions:
> > * vmware_fence_start_0 on q-gp2-dbpg57-1 'unknown error' (1):
> > call=77, status=Error, exitreason='none',
> >    last-rc-change='Mon Jul 30 21:46:30 2018', queued=1ms,
> > exec=1862ms
> > * vmware_fence_start_0 on q-gp2-dbpg57-3 'unknown error' (1):
> > call=42, status=Error, exitreason='none',
> >    last-rc-change='Mon Jul 30 21:46:27 2018', queued=0ms,
> > exec=1510ms
> > * vmware_fence_monitor_60000 on q-gp2-dbpg57-2 'unknown error' (1):
> > call=84, status=Error, exitreason='none',
> >    last-rc-change='Tue Jul 24 16:11:42 2018', queued=0ms,
> > exec=12142ms
> > 
> > Thank you,
> > -- 
> > Casey
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>