[ClusterLabs] Pengine always trying to start the resource on the standby node.

Thu Jun 7 16:49:31 EDT 2018

On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> Hi Andrei,
> 
> Thanks for your quickly reply. Still need help as below :
> 
> On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidjaar at gmail.co
> m> wrote:
> > 06.06.2018 04:27, Albert Weng пишет:
> > >  Hi All,
> > > 
> > > I have created active/passive pacemaker cluster on RHEL 7.
> > > 
> > > Here are my environment:
> > > clustera : 192.168.11.1 (passive)
> > > clusterb : 192.168.11.2 (master)
> > > clustera-ilo4 : 192.168.11.10
> > > clusterb-ilo4 : 192.168.11.11
> > > 
> > > cluster resource status :
> > >      cluster_fs        started on clusterb
> > >      cluster_vip       started on clusterb
> > >      cluster_sid       started on clusterb
> > >      cluster_listnr    started on clusterb
> > > 
> > > Both cluster node are online status.
> > > 
> > > i found my corosync.log contain many records like below:
> > > 
> > > clustera        pengine:     info:
> > determine_online_status_fencing:
> > > Node clusterb is active
> > > clustera        pengine:     info: determine_online_status:     
> >   Node
> > > clusterb is online
> > > clustera        pengine:     info:
> > determine_online_status_fencing:
> > > Node clustera is active
> > > clustera        pengine:     info: determine_online_status:     
> >   Node
> > > clustera is online
> > > 
> > > *clustera        pengine:  warning: unpack_rsc_op_failure: 
> > Processing
> > > failed op start for cluster_sid on clustera: unknown error (1)*
> > > *=> Question :  Why pengine always trying to start cluster_sid on
> > the
> > > passive node? how to fix it? *
> > > 
> > 
> > pacemaker does not have concept of "passive" or "master" node - it
> > is up
> > to you to decide when you configure resource placement. By default
> > pacemaker will attempt to spread resources across all eligible
> > nodes.
> > You can influence node selection by using constraints. See
> > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace
> > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
> > for details.
> > 
> > But in any case - all your resources MUST be capable of running of
> > both
> > nodes, otherwise cluster makes no sense. If one resource A depends
> > on
> > something that another resource B provides and can be started only
> > together with resource B (and after it is ready) - you must tell it
> > to
> > pacemaker by using resource colocations and ordering. See same
> > document
> > for details.
> > 
> > > clustera        pengine:     info: native_print:   ipmi-fence-
> > clustera
> > > (stonith:fence_ipmilan):        Started clustera
> > > clustera        pengine:     info: native_print:   ipmi-fence-
> > clusterb
> > > (stonith:fence_ipmilan):        Started clustera
> > > clustera        pengine:     info: group_print:     Resource
> > Group: cluster
> > > clustera        pengine:     info: native_print:       
> > cluster_fs
> > > (ocf::heartbeat:Filesystem):    Started clusterb
> > > clustera        pengine:     info: native_print:       
> > cluster_vip
> > > (ocf::heartbeat:IPaddr2):       Started clusterb
> > > clustera        pengine:     info: native_print:       
> > cluster_sid
> > > (ocf::heartbeat:oracle):        Started clusterb
> > > clustera        pengine:     info: native_print:
> > > cluster_listnr       (ocf::heartbeat:oralsnr):       Started
> > clusterb
> > > clustera        pengine:     info: get_failcount_full:   
> >  cluster_sid has
> > > failed INFINITY times on clustera
> > > 
> > > 
> > > *clustera        pengine:  warning: common_apply_stickiness:     
> >   Forcing
> > > cluster_sid away from clustera after 1000000 failures
> > (max=1000000)*
> > > *=> Question: too much trying result in forbid the resource start
> > on
> > > clustera ?*
> > > 
> > 
> > Yes.
> 
> How to find out the root cause of  1000000 failures? which log will
> contain the error message?

As an aside, 1,000,000 is "infinity" to pacemaker. It could mean
1,000,000 actual failures, or a "fatal" failure that causes pacemaker
to set the fail count to infinity.

The most recent failure of each resource will be shown in the status
display (crm_mon, pcs status, etc.). They will have a basic exit code
(which you can use to distinguish a timeout from an error received from
the agent), and if the agent provided one, an "exit-reason". That's the
first place to look.

Failures will remain in the status display, and affect the placement of
resources, until one of two things happen: you manually clean up the
failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if
you configured a failure-timeout for the resource, that much time has
passed with no more failures.

For deeper investigation, check the system log (wherever it's kept on
your distro). You can use the timestamp from the failure in the status
to know where to look.

For even more detail, you can look at pacemaker's detail log (the one
you posted excerpts from). This will have additional messages beyond
the system log, but they are harder to follow and more intended for
developers and advanced troubleshooting.

>  
> > > Couple days ago, the clusterb has been stonith by unknown reason,
> > but only
> > > "cluster_fs", "cluster_vip" moved to clustera successfully, but
> > > "cluster_sid" and "cluster_listnr" go to "STOP" status.
> > > like below messages, is it related with "op start for cluster_sid
> > on
> > > clustera..." ?
> > > 
> > 
> > Yes. Node clustera is now marked as being incapable of running
> > resource
> > so if node cluaterb fails, resource cannot be started anywhere.
> > 
> > 
> 
> How could i fix it? i need some hint for troubleshooting.
>  
> > > clustera    pengine:  warning: unpack_rsc_op_failure:  Processing
> > failed op
> > > start for cluster_sid on clustera: unknown error (1)
> > > clustera    pengine:     info: native_print:   ipmi-fence-
> > clustera
> > > (stonith:fence_ipmilan):        Started clustera
> > > clustera    pengine:     info: native_print:   ipmi-fence-
> > clusterb
> > > (stonith:fence_ipmilan):        Started clustera
> > > clustera    pengine:     info: group_print:     Resource Group:
> > cluster
> > > clustera    pengine:     info: native_print:        cluster_fs
> > > (ocf::heartbeat:Filesystem):    Started clusterb (UNCLEAN)
> > > clustera    pengine:     info: native_print:        cluster_vip
> > > (ocf::heartbeat:IPaddr2):       Started clusterb (UNCLEAN)
> > > clustera    pengine:     info: native_print:        cluster_sid
> > > (ocf::heartbeat:oracle):        Started clusterb (UNCLEAN)
> > > clustera    pengine:     info: native_print:       
> > cluster_listnr
> > > (ocf::heartbeat:oralsnr):       Started clusterb (UNCLEAN)
> > > clustera    pengine:     info: get_failcount_full:   
> >  cluster_sid has
> > > failed INFINITY times on clustera
> > > clustera    pengine:  warning: common_apply_stickiness:       
> > Forcing
> > > cluster_sid away from clustera after 1000000 failures
> > (max=1000000)
> > > clustera    pengine:     info: rsc_merge_weights:     
> > cluster_fs: Rolling
> > > back scores from cluster_sid
> > > clustera    pengine:     info: rsc_merge_weights:     
> > cluster_vip: Rolling
> > > back scores from cluster_sid
> > > clustera    pengine:     info: rsc_merge_weights:     
> > cluster_sid: Rolling
> > > back scores from cluster_listnr
> > > clustera    pengine:     info: native_color:   Resource
> > cluster_sid cannot
> > > run anywhere
> > > clustera    pengine:     info: native_color:   Resource
> > cluster_listnr
> > > cannot run anywhere
> > > clustera    pengine:  warning: custom_action:  Action
> > cluster_fs_stop_0 on
> > > clusterb is unrunnable (offline)
> > > clustera    pengine:     info: RecurringOp:     Start recurring
> > monitor
> > > (20s) for cluster_fs on clustera
> > > clustera    pengine:  warning: custom_action:  Action
> > cluster_vip_stop_0 on
> > > clusterb is unrunnable (offline)
> > > clustera    pengine:     info: RecurringOp:     Start recurring
> > monitor
> > > (10s) for cluster_vip on clustera
> > > clustera    pengine:  warning: custom_action:  Action
> > cluster_sid_stop_0 on
> > > clusterb is unrunnable (offline)
> > > clustera    pengine:  warning: custom_action:  Action
> > cluster_sid_stop_0 on
> > > clusterb is unrunnable (offline)
> > > clustera    pengine:  warning: custom_action:  Action
> > cluster_listnr_stop_0
> > > on clusterb is unrunnable (offline)
> > > clustera    pengine:  warning: custom_action:  Action
> > cluster_listnr_stop_0
> > > on clusterb is unrunnable (offline)
> > > clustera    pengine:  warning: stage6: Scheduling Node clusterb
> > for STONITH
> > > clustera    pengine:     info: native_stop_constraints:
> > > cluster_fs_stop_0 is implicit after clusterb is fenced
> > > clustera    pengine:     info: native_stop_constraints:
> > > cluster_vip_stop_0 is implicit after clusterb is fenced
> > > clustera    pengine:     info: native_stop_constraints:
> > > cluster_sid_stop_0 is implicit after clusterb is fenced
> > > clustera    pengine:     info: native_stop_constraints:
> > > cluster_listnr_stop_0 is implicit after clusterb is fenced
> > > clustera    pengine:     info: LogActions:     Leave   ipmi-
> > fence-db01
> > > (Started clustera)
> > > clustera    pengine:     info: LogActions:     Leave   ipmi-
> > fence-db02
> > > (Started clustera)
> > > clustera    pengine:   notice: LogActions:     Move    cluster_fs
> > > (Started clusterb -> clustera)
> > > clustera    pengine:   notice: LogActions:     Move   
> > cluster_vip
> > > (Started clusterb -> clustera)
> > > clustera    pengine:   notice: LogActions:     Stop   
> > cluster_sid
> > > (clusterb)
> > > clustera    pengine:   notice: LogActions:     Stop   
> > cluster_listnr
> > > (clusterb)
> > > clustera    pengine:  warning: process_pe_message:     Calculated
> > > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > clustera       crmd:     info: do_state_transition:    State
> > transition
> > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > clustera       crmd:     info: do_te_invoke:   Processing graph
> > 26821
> > > (ref=pe_calc-dc-1526868653-26882) derived from
> > > /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > clustera       crmd:   notice: te_fence_node:  Executing reboot
> > fencing
> > > operation (23) on clusterb (timeout=60000)
> > > 
> > > 
> > > Thanks ~~~~

Ken Gaillot <kgaillot at redhat.com>