[ClusterLabs] Pengine always trying to start the resource on the standby node.
Albert Weng
weng.albert at gmail.com
Wed Jun 13 05:09:06 EDT 2018
Hi All,
Thanks for reply.
Recently, i run the following command :
(clustera) # crm_simulate --xml-file pe-warn.last
it returns the following results :
error: crm_abort: xpath_search: Triggered assert at xpath.c:153 :
xml_top != NULL
error: crm_element_value: Couldn't find validate-with in NULL
error: crm_abort: crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
Configuration validation is currently disabled. It is highly encouraged
and prevents many common cluster issues.
error: crm_element_value: Couldn't find validate-with in NULL
error: crm_abort: crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
error: crm_element_value: Couldn't find ignore-dtd in NULL
error: crm_abort: crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: validate_with: Triggered assert at schemas.c:522 :
xml != NULL
error: crm_abort: crm_xml_add: Triggered assert at xml.c:2494 : node
!= NULL
error: write_xml_stream: Cannot write NULL to
/var/lib/pacemaker/cib/shadow.20008
Could not create '/var/lib/pacemaker/cib/shadow.20008': Success
Could anyone help me how to read those messages and what's going on my
server?
Thanks a lot..
On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> > Hi Andrei,
> >
> > Thanks for your quickly reply. Still need help as below :
> >
> > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidjaar at gmail.co
> > m> wrote:
> > > 06.06.2018 04:27, Albert Weng пишет:
> > > > Hi All,
> > > >
> > > > I have created active/passive pacemaker cluster on RHEL 7.
> > > >
> > > > Here are my environment:
> > > > clustera : 192.168.11.1 (passive)
> > > > clusterb : 192.168.11.2 (master)
> > > > clustera-ilo4 : 192.168.11.10
> > > > clusterb-ilo4 : 192.168.11.11
> > > >
> > > > cluster resource status :
> > > > cluster_fs started on clusterb
> > > > cluster_vip started on clusterb
> > > > cluster_sid started on clusterb
> > > > cluster_listnr started on clusterb
> > > >
> > > > Both cluster node are online status.
> > > >
> > > > i found my corosync.log contain many records like below:
> > > >
> > > > clustera pengine: info:
> > > determine_online_status_fencing:
> > > > Node clusterb is active
> > > > clustera pengine: info: determine_online_status:
> > > Node
> > > > clusterb is online
> > > > clustera pengine: info:
> > > determine_online_status_fencing:
> > > > Node clustera is active
> > > > clustera pengine: info: determine_online_status:
> > > Node
> > > > clustera is online
> > > >
> > > > *clustera pengine: warning: unpack_rsc_op_failure:
> > > Processing
> > > > failed op start for cluster_sid on clustera: unknown error (1)*
> > > > *=> Question : Why pengine always trying to start cluster_sid on
> > > the
> > > > passive node? how to fix it? *
> > > >
> > >
> > > pacemaker does not have concept of "passive" or "master" node - it
> > > is up
> > > to you to decide when you configure resource placement. By default
> > > pacemaker will attempt to spread resources across all eligible
> > > nodes.
> > > You can influence node selection by using constraints. See
> > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace
> > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
> > > for details.
> > >
> > > But in any case - all your resources MUST be capable of running of
> > > both
> > > nodes, otherwise cluster makes no sense. If one resource A depends
> > > on
> > > something that another resource B provides and can be started only
> > > together with resource B (and after it is ready) - you must tell it
> > > to
> > > pacemaker by using resource colocations and ordering. See same
> > > document
> > > for details.
> > >
> > > > clustera pengine: info: native_print: ipmi-fence-
> > > clustera
> > > > (stonith:fence_ipmilan): Started clustera
> > > > clustera pengine: info: native_print: ipmi-fence-
> > > clusterb
> > > > (stonith:fence_ipmilan): Started clustera
> > > > clustera pengine: info: group_print: Resource
> > > Group: cluster
> > > > clustera pengine: info: native_print:
> > > cluster_fs
> > > > (ocf::heartbeat:Filesystem): Started clusterb
> > > > clustera pengine: info: native_print:
> > > cluster_vip
> > > > (ocf::heartbeat:IPaddr2): Started clusterb
> > > > clustera pengine: info: native_print:
> > > cluster_sid
> > > > (ocf::heartbeat:oracle): Started clusterb
> > > > clustera pengine: info: native_print:
> > > > cluster_listnr (ocf::heartbeat:oralsnr): Started
> > > clusterb
> > > > clustera pengine: info: get_failcount_full:
> > > cluster_sid has
> > > > failed INFINITY times on clustera
> > > >
> > > >
> > > > *clustera pengine: warning: common_apply_stickiness:
> > > Forcing
> > > > cluster_sid away from clustera after 1000000 failures
> > > (max=1000000)*
> > > > *=> Question: too much trying result in forbid the resource start
> > > on
> > > > clustera ?*
> > > >
> > >
> > > Yes.
> >
> > How to find out the root cause of 1000000 failures? which log will
> > contain the error message?
>
> As an aside, 1,000,000 is "infinity" to pacemaker. It could mean
> 1,000,000 actual failures, or a "fatal" failure that causes pacemaker
> to set the fail count to infinity.
>
> The most recent failure of each resource will be shown in the status
> display (crm_mon, pcs status, etc.). They will have a basic exit code
> (which you can use to distinguish a timeout from an error received from
> the agent), and if the agent provided one, an "exit-reason". That's the
> first place to look.
>
> Failures will remain in the status display, and affect the placement of
> resources, until one of two things happen: you manually clean up the
> failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if
> you configured a failure-timeout for the resource, that much time has
> passed with no more failures.
>
> For deeper investigation, check the system log (wherever it's kept on
> your distro). You can use the timestamp from the failure in the status
> to know where to look.
>
> For even more detail, you can look at pacemaker's detail log (the one
> you posted excerpts from). This will have additional messages beyond
> the system log, but they are harder to follow and more intended for
> developers and advanced troubleshooting.
>
> >
> > > > Couple days ago, the clusterb has been stonith by unknown reason,
> > > but only
> > > > "cluster_fs", "cluster_vip" moved to clustera successfully, but
> > > > "cluster_sid" and "cluster_listnr" go to "STOP" status.
> > > > like below messages, is it related with "op start for cluster_sid
> > > on
> > > > clustera..." ?
> > > >
> > >
> > > Yes. Node clustera is now marked as being incapable of running
> > > resource
> > > so if node cluaterb fails, resource cannot be started anywhere.
> > >
> > >
> >
> > How could i fix it? i need some hint for troubleshooting.
> >
> > > > clustera pengine: warning: unpack_rsc_op_failure: Processing
> > > failed op
> > > > start for cluster_sid on clustera: unknown error (1)
> > > > clustera pengine: info: native_print: ipmi-fence-
> > > clustera
> > > > (stonith:fence_ipmilan): Started clustera
> > > > clustera pengine: info: native_print: ipmi-fence-
> > > clusterb
> > > > (stonith:fence_ipmilan): Started clustera
> > > > clustera pengine: info: group_print: Resource Group:
> > > cluster
> > > > clustera pengine: info: native_print: cluster_fs
> > > > (ocf::heartbeat:Filesystem): Started clusterb (UNCLEAN)
> > > > clustera pengine: info: native_print: cluster_vip
> > > > (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN)
> > > > clustera pengine: info: native_print: cluster_sid
> > > > (ocf::heartbeat:oracle): Started clusterb (UNCLEAN)
> > > > clustera pengine: info: native_print:
> > > cluster_listnr
> > > > (ocf::heartbeat:oralsnr): Started clusterb (UNCLEAN)
> > > > clustera pengine: info: get_failcount_full:
> > > cluster_sid has
> > > > failed INFINITY times on clustera
> > > > clustera pengine: warning: common_apply_stickiness:
> > > Forcing
> > > > cluster_sid away from clustera after 1000000 failures
> > > (max=1000000)
> > > > clustera pengine: info: rsc_merge_weights:
> > > cluster_fs: Rolling
> > > > back scores from cluster_sid
> > > > clustera pengine: info: rsc_merge_weights:
> > > cluster_vip: Rolling
> > > > back scores from cluster_sid
> > > > clustera pengine: info: rsc_merge_weights:
> > > cluster_sid: Rolling
> > > > back scores from cluster_listnr
> > > > clustera pengine: info: native_color: Resource
> > > cluster_sid cannot
> > > > run anywhere
> > > > clustera pengine: info: native_color: Resource
> > > cluster_listnr
> > > > cannot run anywhere
> > > > clustera pengine: warning: custom_action: Action
> > > cluster_fs_stop_0 on
> > > > clusterb is unrunnable (offline)
> > > > clustera pengine: info: RecurringOp: Start recurring
> > > monitor
> > > > (20s) for cluster_fs on clustera
> > > > clustera pengine: warning: custom_action: Action
> > > cluster_vip_stop_0 on
> > > > clusterb is unrunnable (offline)
> > > > clustera pengine: info: RecurringOp: Start recurring
> > > monitor
> > > > (10s) for cluster_vip on clustera
> > > > clustera pengine: warning: custom_action: Action
> > > cluster_sid_stop_0 on
> > > > clusterb is unrunnable (offline)
> > > > clustera pengine: warning: custom_action: Action
> > > cluster_sid_stop_0 on
> > > > clusterb is unrunnable (offline)
> > > > clustera pengine: warning: custom_action: Action
> > > cluster_listnr_stop_0
> > > > on clusterb is unrunnable (offline)
> > > > clustera pengine: warning: custom_action: Action
> > > cluster_listnr_stop_0
> > > > on clusterb is unrunnable (offline)
> > > > clustera pengine: warning: stage6: Scheduling Node clusterb
> > > for STONITH
> > > > clustera pengine: info: native_stop_constraints:
> > > > cluster_fs_stop_0 is implicit after clusterb is fenced
> > > > clustera pengine: info: native_stop_constraints:
> > > > cluster_vip_stop_0 is implicit after clusterb is fenced
> > > > clustera pengine: info: native_stop_constraints:
> > > > cluster_sid_stop_0 is implicit after clusterb is fenced
> > > > clustera pengine: info: native_stop_constraints:
> > > > cluster_listnr_stop_0 is implicit after clusterb is fenced
> > > > clustera pengine: info: LogActions: Leave ipmi-
> > > fence-db01
> > > > (Started clustera)
> > > > clustera pengine: info: LogActions: Leave ipmi-
> > > fence-db02
> > > > (Started clustera)
> > > > clustera pengine: notice: LogActions: Move cluster_fs
> > > > (Started clusterb -> clustera)
> > > > clustera pengine: notice: LogActions: Move
> > > cluster_vip
> > > > (Started clusterb -> clustera)
> > > > clustera pengine: notice: LogActions: Stop
> > > cluster_sid
> > > > (clusterb)
> > > > clustera pengine: notice: LogActions: Stop
> > > cluster_listnr
> > > > (clusterb)
> > > > clustera pengine: warning: process_pe_message: Calculated
> > > > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > > clustera crmd: info: do_state_transition: State
> > > transition
> > > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > > clustera crmd: info: do_te_invoke: Processing graph
> > > 26821
> > > > (ref=pe_calc-dc-1526868653-26882) derived from
> > > > /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > > clustera crmd: notice: te_fence_node: Executing reboot
> > > fencing
> > > > operation (23) on clusterb (timeout=60000)
> > > >
> > > >
> > > > Thanks ~~~~
>
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
--
Kind regards,
Albert Weng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180613/fdbf2cfe/attachment-0002.html>
More information about the Users
mailing list