[ClusterLabs] Pengine always trying to start the resource on the standby node.
Ken Gaillot
kgaillot at redhat.com
Thu Jun 7 16:49:31 EDT 2018
On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> Hi Andrei,
>
> Thanks for your quickly reply. Still need help as below :
>
> On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidjaar at gmail.co
> m> wrote:
> > 06.06.2018 04:27, Albert Weng пишет:
> > > Hi All,
> > >
> > > I have created active/passive pacemaker cluster on RHEL 7.
> > >
> > > Here are my environment:
> > > clustera : 192.168.11.1 (passive)
> > > clusterb : 192.168.11.2 (master)
> > > clustera-ilo4 : 192.168.11.10
> > > clusterb-ilo4 : 192.168.11.11
> > >
> > > cluster resource status :
> > > cluster_fs started on clusterb
> > > cluster_vip started on clusterb
> > > cluster_sid started on clusterb
> > > cluster_listnr started on clusterb
> > >
> > > Both cluster node are online status.
> > >
> > > i found my corosync.log contain many records like below:
> > >
> > > clustera pengine: info:
> > determine_online_status_fencing:
> > > Node clusterb is active
> > > clustera pengine: info: determine_online_status:
> > Node
> > > clusterb is online
> > > clustera pengine: info:
> > determine_online_status_fencing:
> > > Node clustera is active
> > > clustera pengine: info: determine_online_status:
> > Node
> > > clustera is online
> > >
> > > *clustera pengine: warning: unpack_rsc_op_failure:
> > Processing
> > > failed op start for cluster_sid on clustera: unknown error (1)*
> > > *=> Question : Why pengine always trying to start cluster_sid on
> > the
> > > passive node? how to fix it? *
> > >
> >
> > pacemaker does not have concept of "passive" or "master" node - it
> > is up
> > to you to decide when you configure resource placement. By default
> > pacemaker will attempt to spread resources across all eligible
> > nodes.
> > You can influence node selection by using constraints. See
> > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace
> > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
> > for details.
> >
> > But in any case - all your resources MUST be capable of running of
> > both
> > nodes, otherwise cluster makes no sense. If one resource A depends
> > on
> > something that another resource B provides and can be started only
> > together with resource B (and after it is ready) - you must tell it
> > to
> > pacemaker by using resource colocations and ordering. See same
> > document
> > for details.
> >
> > > clustera pengine: info: native_print: ipmi-fence-
> > clustera
> > > (stonith:fence_ipmilan): Started clustera
> > > clustera pengine: info: native_print: ipmi-fence-
> > clusterb
> > > (stonith:fence_ipmilan): Started clustera
> > > clustera pengine: info: group_print: Resource
> > Group: cluster
> > > clustera pengine: info: native_print:
> > cluster_fs
> > > (ocf::heartbeat:Filesystem): Started clusterb
> > > clustera pengine: info: native_print:
> > cluster_vip
> > > (ocf::heartbeat:IPaddr2): Started clusterb
> > > clustera pengine: info: native_print:
> > cluster_sid
> > > (ocf::heartbeat:oracle): Started clusterb
> > > clustera pengine: info: native_print:
> > > cluster_listnr (ocf::heartbeat:oralsnr): Started
> > clusterb
> > > clustera pengine: info: get_failcount_full:
> > cluster_sid has
> > > failed INFINITY times on clustera
> > >
> > >
> > > *clustera pengine: warning: common_apply_stickiness:
> > Forcing
> > > cluster_sid away from clustera after 1000000 failures
> > (max=1000000)*
> > > *=> Question: too much trying result in forbid the resource start
> > on
> > > clustera ?*
> > >
> >
> > Yes.
>
> How to find out the root cause of 1000000 failures? which log will
> contain the error message?
As an aside, 1,000,000 is "infinity" to pacemaker. It could mean
1,000,000 actual failures, or a "fatal" failure that causes pacemaker
to set the fail count to infinity.
The most recent failure of each resource will be shown in the status
display (crm_mon, pcs status, etc.). They will have a basic exit code
(which you can use to distinguish a timeout from an error received from
the agent), and if the agent provided one, an "exit-reason". That's the
first place to look.
Failures will remain in the status display, and affect the placement of
resources, until one of two things happen: you manually clean up the
failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if
you configured a failure-timeout for the resource, that much time has
passed with no more failures.
For deeper investigation, check the system log (wherever it's kept on
your distro). You can use the timestamp from the failure in the status
to know where to look.
For even more detail, you can look at pacemaker's detail log (the one
you posted excerpts from). This will have additional messages beyond
the system log, but they are harder to follow and more intended for
developers and advanced troubleshooting.
>
> > > Couple days ago, the clusterb has been stonith by unknown reason,
> > but only
> > > "cluster_fs", "cluster_vip" moved to clustera successfully, but
> > > "cluster_sid" and "cluster_listnr" go to "STOP" status.
> > > like below messages, is it related with "op start for cluster_sid
> > on
> > > clustera..." ?
> > >
> >
> > Yes. Node clustera is now marked as being incapable of running
> > resource
> > so if node cluaterb fails, resource cannot be started anywhere.
> >
> >
>
> How could i fix it? i need some hint for troubleshooting.
>
> > > clustera pengine: warning: unpack_rsc_op_failure: Processing
> > failed op
> > > start for cluster_sid on clustera: unknown error (1)
> > > clustera pengine: info: native_print: ipmi-fence-
> > clustera
> > > (stonith:fence_ipmilan): Started clustera
> > > clustera pengine: info: native_print: ipmi-fence-
> > clusterb
> > > (stonith:fence_ipmilan): Started clustera
> > > clustera pengine: info: group_print: Resource Group:
> > cluster
> > > clustera pengine: info: native_print: cluster_fs
> > > (ocf::heartbeat:Filesystem): Started clusterb (UNCLEAN)
> > > clustera pengine: info: native_print: cluster_vip
> > > (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN)
> > > clustera pengine: info: native_print: cluster_sid
> > > (ocf::heartbeat:oracle): Started clusterb (UNCLEAN)
> > > clustera pengine: info: native_print:
> > cluster_listnr
> > > (ocf::heartbeat:oralsnr): Started clusterb (UNCLEAN)
> > > clustera pengine: info: get_failcount_full:
> > cluster_sid has
> > > failed INFINITY times on clustera
> > > clustera pengine: warning: common_apply_stickiness:
> > Forcing
> > > cluster_sid away from clustera after 1000000 failures
> > (max=1000000)
> > > clustera pengine: info: rsc_merge_weights:
> > cluster_fs: Rolling
> > > back scores from cluster_sid
> > > clustera pengine: info: rsc_merge_weights:
> > cluster_vip: Rolling
> > > back scores from cluster_sid
> > > clustera pengine: info: rsc_merge_weights:
> > cluster_sid: Rolling
> > > back scores from cluster_listnr
> > > clustera pengine: info: native_color: Resource
> > cluster_sid cannot
> > > run anywhere
> > > clustera pengine: info: native_color: Resource
> > cluster_listnr
> > > cannot run anywhere
> > > clustera pengine: warning: custom_action: Action
> > cluster_fs_stop_0 on
> > > clusterb is unrunnable (offline)
> > > clustera pengine: info: RecurringOp: Start recurring
> > monitor
> > > (20s) for cluster_fs on clustera
> > > clustera pengine: warning: custom_action: Action
> > cluster_vip_stop_0 on
> > > clusterb is unrunnable (offline)
> > > clustera pengine: info: RecurringOp: Start recurring
> > monitor
> > > (10s) for cluster_vip on clustera
> > > clustera pengine: warning: custom_action: Action
> > cluster_sid_stop_0 on
> > > clusterb is unrunnable (offline)
> > > clustera pengine: warning: custom_action: Action
> > cluster_sid_stop_0 on
> > > clusterb is unrunnable (offline)
> > > clustera pengine: warning: custom_action: Action
> > cluster_listnr_stop_0
> > > on clusterb is unrunnable (offline)
> > > clustera pengine: warning: custom_action: Action
> > cluster_listnr_stop_0
> > > on clusterb is unrunnable (offline)
> > > clustera pengine: warning: stage6: Scheduling Node clusterb
> > for STONITH
> > > clustera pengine: info: native_stop_constraints:
> > > cluster_fs_stop_0 is implicit after clusterb is fenced
> > > clustera pengine: info: native_stop_constraints:
> > > cluster_vip_stop_0 is implicit after clusterb is fenced
> > > clustera pengine: info: native_stop_constraints:
> > > cluster_sid_stop_0 is implicit after clusterb is fenced
> > > clustera pengine: info: native_stop_constraints:
> > > cluster_listnr_stop_0 is implicit after clusterb is fenced
> > > clustera pengine: info: LogActions: Leave ipmi-
> > fence-db01
> > > (Started clustera)
> > > clustera pengine: info: LogActions: Leave ipmi-
> > fence-db02
> > > (Started clustera)
> > > clustera pengine: notice: LogActions: Move cluster_fs
> > > (Started clusterb -> clustera)
> > > clustera pengine: notice: LogActions: Move
> > cluster_vip
> > > (Started clusterb -> clustera)
> > > clustera pengine: notice: LogActions: Stop
> > cluster_sid
> > > (clusterb)
> > > clustera pengine: notice: LogActions: Stop
> > cluster_listnr
> > > (clusterb)
> > > clustera pengine: warning: process_pe_message: Calculated
> > > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > clustera crmd: info: do_state_transition: State
> > transition
> > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > clustera crmd: info: do_te_invoke: Processing graph
> > 26821
> > > (ref=pe_calc-dc-1526868653-26882) derived from
> > > /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > clustera crmd: notice: te_fence_node: Executing reboot
> > fencing
> > > operation (23) on clusterb (timeout=60000)
> > >
> > >
> > > Thanks ~~~~
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list