[ClusterLabs] Pengine always trying to start the resource on the standby node.

Wed Jun 13 05:09:06 EDT 2018

Hi All,

Thanks for reply.

Recently, i run the following command :
(clustera) # crm_simulate --xml-file pe-warn.last

it returns the following results :
   error: crm_abort:    xpath_search: Triggered assert at xpath.c:153 :
xml_top != NULL
   error: crm_element_value:    Couldn't find validate-with in NULL
   error: crm_abort:    crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   Configuration validation is currently disabled. It is highly encouraged
and prevents many common cluster issues.
   error: crm_element_value:    Couldn't find validate-with in NULL
   error: crm_abort:    crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   error: crm_element_value:    Couldn't find ignore-dtd in NULL
   error: crm_abort:    crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:    crm_xml_add: Triggered assert at xml.c:2494 : node
!= NULL
   error: write_xml_stream:     Cannot write NULL to
/var/lib/pacemaker/cib/shadow.20008
   Could not create '/var/lib/pacemaker/cib/shadow.20008': Success

Could anyone help me how to read those messages and what's going on my
server?

Thanks a lot..

On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> > Hi Andrei,
> >
> > Thanks for your quickly reply. Still need help as below :
> >
> > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidjaar at gmail.co
> > m> wrote:
> > > 06.06.2018 04:27, Albert Weng пишет:
> > > >  Hi All,
> > > >
> > > > I have created active/passive pacemaker cluster on RHEL 7.
> > > >
> > > > Here are my environment:
> > > > clustera : 192.168.11.1 (passive)
> > > > clusterb : 192.168.11.2 (master)
> > > > clustera-ilo4 : 192.168.11.10
> > > > clusterb-ilo4 : 192.168.11.11
> > > >
> > > > cluster resource status :
> > > >      cluster_fs        started on clusterb
> > > >      cluster_vip       started on clusterb
> > > >      cluster_sid       started on clusterb
> > > >      cluster_listnr    started on clusterb
> > > >
> > > > Both cluster node are online status.
> > > >
> > > > i found my corosync.log contain many records like below:
> > > >
> > > > clustera        pengine:     info:
> > > determine_online_status_fencing:
> > > > Node clusterb is active
> > > > clustera        pengine:     info: determine_online_status:
> > >   Node
> > > > clusterb is online
> > > > clustera        pengine:     info:
> > > determine_online_status_fencing:
> > > > Node clustera is active
> > > > clustera        pengine:     info: determine_online_status:
> > >   Node
> > > > clustera is online
> > > >
> > > > *clustera        pengine:  warning: unpack_rsc_op_failure:
> > > Processing
> > > > failed op start for cluster_sid on clustera: unknown error (1)*
> > > > *=> Question :  Why pengine always trying to start cluster_sid on
> > > the
> > > > passive node? how to fix it? *
> > > >
> > >
> > > pacemaker does not have concept of "passive" or "master" node - it
> > > is up
> > > to you to decide when you configure resource placement. By default
> > > pacemaker will attempt to spread resources across all eligible
> > > nodes.
> > > You can influence node selection by using constraints. See
> > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace
> > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
> > > for details.
> > >
> > > But in any case - all your resources MUST be capable of running of
> > > both
> > > nodes, otherwise cluster makes no sense. If one resource A depends
> > > on
> > > something that another resource B provides and can be started only
> > > together with resource B (and after it is ready) - you must tell it
> > > to
> > > pacemaker by using resource colocations and ordering. See same
> > > document
> > > for details.
> > >
> > > > clustera        pengine:     info: native_print:   ipmi-fence-
> > > clustera
> > > > (stonith:fence_ipmilan):        Started clustera
> > > > clustera        pengine:     info: native_print:   ipmi-fence-
> > > clusterb
> > > > (stonith:fence_ipmilan):        Started clustera
> > > > clustera        pengine:     info: group_print:     Resource
> > > Group: cluster
> > > > clustera        pengine:     info: native_print:
> > > cluster_fs
> > > > (ocf::heartbeat:Filesystem):    Started clusterb
> > > > clustera        pengine:     info: native_print:
> > > cluster_vip
> > > > (ocf::heartbeat:IPaddr2):       Started clusterb
> > > > clustera        pengine:     info: native_print:
> > > cluster_sid
> > > > (ocf::heartbeat:oracle):        Started clusterb
> > > > clustera        pengine:     info: native_print:
> > > > cluster_listnr       (ocf::heartbeat:oralsnr):       Started
> > > clusterb
> > > > clustera        pengine:     info: get_failcount_full:
> > >  cluster_sid has
> > > > failed INFINITY times on clustera
> > > >
> > > >
> > > > *clustera        pengine:  warning: common_apply_stickiness:
> > >   Forcing
> > > > cluster_sid away from clustera after 1000000 failures
> > > (max=1000000)*
> > > > *=> Question: too much trying result in forbid the resource start
> > > on
> > > > clustera ?*
> > > >
> > >
> > > Yes.
> >
> > How to find out the root cause of  1000000 failures? which log will
> > contain the error message?
>
> As an aside, 1,000,000 is "infinity" to pacemaker. It could mean
> 1,000,000 actual failures, or a "fatal" failure that causes pacemaker
> to set the fail count to infinity.
>
> The most recent failure of each resource will be shown in the status
> display (crm_mon, pcs status, etc.). They will have a basic exit code
> (which you can use to distinguish a timeout from an error received from
> the agent), and if the agent provided one, an "exit-reason". That's the
> first place to look.
>
> Failures will remain in the status display, and affect the placement of
> resources, until one of two things happen: you manually clean up the
> failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if
> you configured a failure-timeout for the resource, that much time has
> passed with no more failures.
>
> For deeper investigation, check the system log (wherever it's kept on
> your distro). You can use the timestamp from the failure in the status
> to know where to look.
>
> For even more detail, you can look at pacemaker's detail log (the one
> you posted excerpts from). This will have additional messages beyond
> the system log, but they are harder to follow and more intended for
> developers and advanced troubleshooting.
>
> >
> > > > Couple days ago, the clusterb has been stonith by unknown reason,
> > > but only
> > > > "cluster_fs", "cluster_vip" moved to clustera successfully, but
> > > > "cluster_sid" and "cluster_listnr" go to "STOP" status.
> > > > like below messages, is it related with "op start for cluster_sid
> > > on
> > > > clustera..." ?
> > > >
> > >
> > > Yes. Node clustera is now marked as being incapable of running
> > > resource
> > > so if node cluaterb fails, resource cannot be started anywhere.
> > >
> > >
> >
> > How could i fix it? i need some hint for troubleshooting.
> >
> > > > clustera    pengine:  warning: unpack_rsc_op_failure:  Processing
> > > failed op
> > > > start for cluster_sid on clustera: unknown error (1)
> > > > clustera    pengine:     info: native_print:   ipmi-fence-
> > > clustera
> > > > (stonith:fence_ipmilan):        Started clustera
> > > > clustera    pengine:     info: native_print:   ipmi-fence-
> > > clusterb
> > > > (stonith:fence_ipmilan):        Started clustera
> > > > clustera    pengine:     info: group_print:     Resource Group:
> > > cluster
> > > > clustera    pengine:     info: native_print:        cluster_fs
> > > > (ocf::heartbeat:Filesystem):    Started clusterb (UNCLEAN)
> > > > clustera    pengine:     info: native_print:        cluster_vip
> > > > (ocf::heartbeat:IPaddr2):       Started clusterb (UNCLEAN)
> > > > clustera    pengine:     info: native_print:        cluster_sid
> > > > (ocf::heartbeat:oracle):        Started clusterb (UNCLEAN)
> > > > clustera    pengine:     info: native_print:
> > > cluster_listnr
> > > > (ocf::heartbeat:oralsnr):       Started clusterb (UNCLEAN)
> > > > clustera    pengine:     info: get_failcount_full:
> > >  cluster_sid has
> > > > failed INFINITY times on clustera
> > > > clustera    pengine:  warning: common_apply_stickiness:
> > > Forcing
> > > > cluster_sid away from clustera after 1000000 failures
> > > (max=1000000)
> > > > clustera    pengine:     info: rsc_merge_weights:
> > > cluster_fs: Rolling
> > > > back scores from cluster_sid
> > > > clustera    pengine:     info: rsc_merge_weights:
> > > cluster_vip: Rolling
> > > > back scores from cluster_sid
> > > > clustera    pengine:     info: rsc_merge_weights:
> > > cluster_sid: Rolling
> > > > back scores from cluster_listnr
> > > > clustera    pengine:     info: native_color:   Resource
> > > cluster_sid cannot
> > > > run anywhere
> > > > clustera    pengine:     info: native_color:   Resource
> > > cluster_listnr
> > > > cannot run anywhere
> > > > clustera    pengine:  warning: custom_action:  Action
> > > cluster_fs_stop_0 on
> > > > clusterb is unrunnable (offline)
> > > > clustera    pengine:     info: RecurringOp:     Start recurring
> > > monitor
> > > > (20s) for cluster_fs on clustera
> > > > clustera    pengine:  warning: custom_action:  Action
> > > cluster_vip_stop_0 on
> > > > clusterb is unrunnable (offline)
> > > > clustera    pengine:     info: RecurringOp:     Start recurring
> > > monitor
> > > > (10s) for cluster_vip on clustera
> > > > clustera    pengine:  warning: custom_action:  Action
> > > cluster_sid_stop_0 on
> > > > clusterb is unrunnable (offline)
> > > > clustera    pengine:  warning: custom_action:  Action
> > > cluster_sid_stop_0 on
> > > > clusterb is unrunnable (offline)
> > > > clustera    pengine:  warning: custom_action:  Action
> > > cluster_listnr_stop_0
> > > > on clusterb is unrunnable (offline)
> > > > clustera    pengine:  warning: custom_action:  Action
> > > cluster_listnr_stop_0
> > > > on clusterb is unrunnable (offline)
> > > > clustera    pengine:  warning: stage6: Scheduling Node clusterb
> > > for STONITH
> > > > clustera    pengine:     info: native_stop_constraints:
> > > > cluster_fs_stop_0 is implicit after clusterb is fenced
> > > > clustera    pengine:     info: native_stop_constraints:
> > > > cluster_vip_stop_0 is implicit after clusterb is fenced
> > > > clustera    pengine:     info: native_stop_constraints:
> > > > cluster_sid_stop_0 is implicit after clusterb is fenced
> > > > clustera    pengine:     info: native_stop_constraints:
> > > > cluster_listnr_stop_0 is implicit after clusterb is fenced
> > > > clustera    pengine:     info: LogActions:     Leave   ipmi-
> > > fence-db01
> > > > (Started clustera)
> > > > clustera    pengine:     info: LogActions:     Leave   ipmi-
> > > fence-db02
> > > > (Started clustera)
> > > > clustera    pengine:   notice: LogActions:     Move    cluster_fs
> > > > (Started clusterb -> clustera)
> > > > clustera    pengine:   notice: LogActions:     Move
> > > cluster_vip
> > > > (Started clusterb -> clustera)
> > > > clustera    pengine:   notice: LogActions:     Stop
> > > cluster_sid
> > > > (clusterb)
> > > > clustera    pengine:   notice: LogActions:     Stop
> > > cluster_listnr
> > > > (clusterb)
> > > > clustera    pengine:  warning: process_pe_message:     Calculated
> > > > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > > clustera       crmd:     info: do_state_transition:    State
> > > transition
> > > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > > clustera       crmd:     info: do_te_invoke:   Processing graph
> > > 26821
> > > > (ref=pe_calc-dc-1526868653-26882) derived from
> > > > /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > > clustera       crmd:   notice: te_fence_node:  Executing reboot
> > > fencing
> > > > operation (23) on clusterb (timeout=60000)
> > > >
> > > >
> > > > Thanks ~~~~
>
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Kind regards,
Albert Weng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180613/fdbf2cfe/attachment-0002.html>