[ClusterLabs] Pengine always trying to start the resource on the standby node.

Ken Gaillot kgaillot at redhat.com
Wed Jun 13 09:52:46 EDT 2018


On Wed, 2018-06-13 at 17:09 +0800, Albert Weng wrote:
> Hi All,
> 
> Thanks for reply.
> 
> Recently, i run the following command :
> (clustera) # crm_simulate --xml-file pe-warn.last
> 
> it returns the following results :
>    error: crm_abort:    xpath_search: Triggered assert at xpath.c:153
> : xml_top != NULL
>    error: crm_element_value:    Couldn't find validate-with in NULL

It looks like pe-warn.last somehow got corrupted. It appears to not be
a full CIB file.

If the original was compressed (.gz/.bz2 extension), and you didn't
uncompress it, re-add the extension -- that's how pacemaker knows to
uncompress it.

>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    Configuration validation is currently disabled. It is highly
> encouraged and prevents many common cluster issues.
>    error: crm_element_value:    Couldn't find validate-with in NULL
>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    error: crm_element_value:    Couldn't find ignore-dtd in NULL
>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    crm_xml_add: Triggered assert at xml.c:2494 :
> node != NULL
>    error: write_xml_stream:     Cannot write NULL to
> /var/lib/pacemaker/cib/shadow.20008
>    Could not create '/var/lib/pacemaker/cib/shadow.20008': Success
> 
> Could anyone help me how to read those messages and what's going on
> my server?
> 
> Thanks a lot..
> 
> 
> On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot <kgaillot at redhat.com>
> wrote:
> > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> > > Hi Andrei,
> > > 
> > > Thanks for your quickly reply. Still need help as below :
> > > 
> > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidjaar at gmai
> > l.co
> > > m> wrote:
> > > > 06.06.2018 04:27, Albert Weng пишет:
> > > > >  Hi All,
> > > > > 
> > > > > I have created active/passive pacemaker cluster on RHEL 7.
> > > > > 
> > > > > Here are my environment:
> > > > > clustera : 192.168.11.1 (passive)
> > > > > clusterb : 192.168.11.2 (master)
> > > > > clustera-ilo4 : 192.168.11.10
> > > > > clusterb-ilo4 : 192.168.11.11
> > > > > 
> > > > > cluster resource status :
> > > > >      cluster_fs        started on clusterb
> > > > >      cluster_vip       started on clusterb
> > > > >      cluster_sid       started on clusterb
> > > > >      cluster_listnr    started on clusterb
> > > > > 
> > > > > Both cluster node are online status.
> > > > > 
> > > > > i found my corosync.log contain many records like below:
> > > > > 
> > > > > clustera        pengine:     info:
> > > > determine_online_status_fencing:
> > > > > Node clusterb is active
> > > > > clustera        pengine:     info: determine_online_status: 
> >    
> > > >   Node
> > > > > clusterb is online
> > > > > clustera        pengine:     info:
> > > > determine_online_status_fencing:
> > > > > Node clustera is active
> > > > > clustera        pengine:     info: determine_online_status: 
> >    
> > > >   Node
> > > > > clustera is online
> > > > > 
> > > > > *clustera        pengine:  warning: unpack_rsc_op_failure: 
> > > > Processing
> > > > > failed op start for cluster_sid on clustera: unknown error
> > (1)*
> > > > > *=> Question :  Why pengine always trying to start
> > cluster_sid on
> > > > the
> > > > > passive node? how to fix it? *
> > > > > 
> > > > 
> > > > pacemaker does not have concept of "passive" or "master" node -
> > it
> > > > is up
> > > > to you to decide when you configure resource placement. By
> > default
> > > > pacemaker will attempt to spread resources across all eligible
> > > > nodes.
> > > > You can influence node selection by using constraints. See
> > > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/
> > Pace
> > > >
> > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
> > > > for details.
> > > > 
> > > > But in any case - all your resources MUST be capable of running
> > of
> > > > both
> > > > nodes, otherwise cluster makes no sense. If one resource A
> > depends
> > > > on
> > > > something that another resource B provides and can be started
> > only
> > > > together with resource B (and after it is ready) - you must
> > tell it
> > > > to
> > > > pacemaker by using resource colocations and ordering. See same
> > > > document
> > > > for details.
> > > > 
> > > > > clustera        pengine:     info: native_print:   ipmi-
> > fence-
> > > > clustera
> > > > > (stonith:fence_ipmilan):        Started clustera
> > > > > clustera        pengine:     info: native_print:   ipmi-
> > fence-
> > > > clusterb
> > > > > (stonith:fence_ipmilan):        Started clustera
> > > > > clustera        pengine:     info: group_print:     Resource
> > > > Group: cluster
> > > > > clustera        pengine:     info: native_print:       
> > > > cluster_fs
> > > > > (ocf::heartbeat:Filesystem):    Started clusterb
> > > > > clustera        pengine:     info: native_print:       
> > > > cluster_vip
> > > > > (ocf::heartbeat:IPaddr2):       Started clusterb
> > > > > clustera        pengine:     info: native_print:       
> > > > cluster_sid
> > > > > (ocf::heartbeat:oracle):        Started clusterb
> > > > > clustera        pengine:     info: native_print:
> > > > > cluster_listnr       (ocf::heartbeat:oralsnr):       Started
> > > > clusterb
> > > > > clustera        pengine:     info: get_failcount_full:   
> > > >  cluster_sid has
> > > > > failed INFINITY times on clustera
> > > > > 
> > > > > 
> > > > > *clustera        pengine:  warning: common_apply_stickiness: 
> >    
> > > >   Forcing
> > > > > cluster_sid away from clustera after 1000000 failures
> > > > (max=1000000)*
> > > > > *=> Question: too much trying result in forbid the resource
> > start
> > > > on
> > > > > clustera ?*
> > > > > 
> > > > 
> > > > Yes.
> > > 
> > > How to find out the root cause of  1000000 failures? which log
> > will
> > > contain the error message?
> > 
> > As an aside, 1,000,000 is "infinity" to pacemaker. It could mean
> > 1,000,000 actual failures, or a "fatal" failure that causes
> > pacemaker
> > to set the fail count to infinity.
> > 
> > The most recent failure of each resource will be shown in the
> > status
> > display (crm_mon, pcs status, etc.). They will have a basic exit
> > code
> > (which you can use to distinguish a timeout from an error received
> > from
> > the agent), and if the agent provided one, an "exit-reason". That's
> > the
> > first place to look.
> > 
> > Failures will remain in the status display, and affect the
> > placement of
> > resources, until one of two things happen: you manually clean up
> > the
> > failure (crm_resource --cleanup, pcs resource cleanup, etc.), or,
> > if
> > you configured a failure-timeout for the resource, that much time
> > has
> > passed with no more failures.
> > 
> > For deeper investigation, check the system log (wherever it's kept
> > on
> > your distro). You can use the timestamp from the failure in the
> > status
> > to know where to look.
> > 
> > For even more detail, you can look at pacemaker's detail log (the
> > one
> > you posted excerpts from). This will have additional messages
> > beyond
> > the system log, but they are harder to follow and more intended for
> > developers and advanced troubleshooting.
> > 
> > >  
> > > > > Couple days ago, the clusterb has been stonith by unknown
> > reason,
> > > > but only
> > > > > "cluster_fs", "cluster_vip" moved to clustera successfully,
> > but
> > > > > "cluster_sid" and "cluster_listnr" go to "STOP" status.
> > > > > like below messages, is it related with "op start for
> > cluster_sid
> > > > on
> > > > > clustera..." ?
> > > > > 
> > > > 
> > > > Yes. Node clustera is now marked as being incapable of running
> > > > resource
> > > > so if node cluaterb fails, resource cannot be started anywhere.
> > > > 
> > > > 
> > > 
> > > How could i fix it? i need some hint for troubleshooting.
> > >  
> > > > > clustera    pengine:  warning: unpack_rsc_op_failure: 
> > Processing
> > > > failed op
> > > > > start for cluster_sid on clustera: unknown error (1)
> > > > > clustera    pengine:     info: native_print:   ipmi-fence-
> > > > clustera
> > > > > (stonith:fence_ipmilan):        Started clustera
> > > > > clustera    pengine:     info: native_print:   ipmi-fence-
> > > > clusterb
> > > > > (stonith:fence_ipmilan):        Started clustera
> > > > > clustera    pengine:     info: group_print:     Resource
> > Group:
> > > > cluster
> > > > > clustera    pengine:     info: native_print:       
> > cluster_fs
> > > > > (ocf::heartbeat:Filesystem):    Started clusterb (UNCLEAN)
> > > > > clustera    pengine:     info: native_print:       
> > cluster_vip
> > > > > (ocf::heartbeat:IPaddr2):       Started clusterb (UNCLEAN)
> > > > > clustera    pengine:     info: native_print:       
> > cluster_sid
> > > > > (ocf::heartbeat:oracle):        Started clusterb (UNCLEAN)
> > > > > clustera    pengine:     info: native_print:       
> > > > cluster_listnr
> > > > > (ocf::heartbeat:oralsnr):       Started clusterb (UNCLEAN)
> > > > > clustera    pengine:     info: get_failcount_full:   
> > > >  cluster_sid has
> > > > > failed INFINITY times on clustera
> > > > > clustera    pengine:  warning: common_apply_stickiness:     
> >  
> > > > Forcing
> > > > > cluster_sid away from clustera after 1000000 failures
> > > > (max=1000000)
> > > > > clustera    pengine:     info: rsc_merge_weights:     
> > > > cluster_fs: Rolling
> > > > > back scores from cluster_sid
> > > > > clustera    pengine:     info: rsc_merge_weights:     
> > > > cluster_vip: Rolling
> > > > > back scores from cluster_sid
> > > > > clustera    pengine:     info: rsc_merge_weights:     
> > > > cluster_sid: Rolling
> > > > > back scores from cluster_listnr
> > > > > clustera    pengine:     info: native_color:   Resource
> > > > cluster_sid cannot
> > > > > run anywhere
> > > > > clustera    pengine:     info: native_color:   Resource
> > > > cluster_listnr
> > > > > cannot run anywhere
> > > > > clustera    pengine:  warning: custom_action:  Action
> > > > cluster_fs_stop_0 on
> > > > > clusterb is unrunnable (offline)
> > > > > clustera    pengine:     info: RecurringOp:     Start
> > recurring
> > > > monitor
> > > > > (20s) for cluster_fs on clustera
> > > > > clustera    pengine:  warning: custom_action:  Action
> > > > cluster_vip_stop_0 on
> > > > > clusterb is unrunnable (offline)
> > > > > clustera    pengine:     info: RecurringOp:     Start
> > recurring
> > > > monitor
> > > > > (10s) for cluster_vip on clustera
> > > > > clustera    pengine:  warning: custom_action:  Action
> > > > cluster_sid_stop_0 on
> > > > > clusterb is unrunnable (offline)
> > > > > clustera    pengine:  warning: custom_action:  Action
> > > > cluster_sid_stop_0 on
> > > > > clusterb is unrunnable (offline)
> > > > > clustera    pengine:  warning: custom_action:  Action
> > > > cluster_listnr_stop_0
> > > > > on clusterb is unrunnable (offline)
> > > > > clustera    pengine:  warning: custom_action:  Action
> > > > cluster_listnr_stop_0
> > > > > on clusterb is unrunnable (offline)
> > > > > clustera    pengine:  warning: stage6: Scheduling Node
> > clusterb
> > > > for STONITH
> > > > > clustera    pengine:     info: native_stop_constraints:
> > > > > cluster_fs_stop_0 is implicit after clusterb is fenced
> > > > > clustera    pengine:     info: native_stop_constraints:
> > > > > cluster_vip_stop_0 is implicit after clusterb is fenced
> > > > > clustera    pengine:     info: native_stop_constraints:
> > > > > cluster_sid_stop_0 is implicit after clusterb is fenced
> > > > > clustera    pengine:     info: native_stop_constraints:
> > > > > cluster_listnr_stop_0 is implicit after clusterb is fenced
> > > > > clustera    pengine:     info: LogActions:     Leave   ipmi-
> > > > fence-db01
> > > > > (Started clustera)
> > > > > clustera    pengine:     info: LogActions:     Leave   ipmi-
> > > > fence-db02
> > > > > (Started clustera)
> > > > > clustera    pengine:   notice: LogActions:     Move   
> > cluster_fs
> > > > > (Started clusterb -> clustera)
> > > > > clustera    pengine:   notice: LogActions:     Move   
> > > > cluster_vip
> > > > > (Started clusterb -> clustera)
> > > > > clustera    pengine:   notice: LogActions:     Stop   
> > > > cluster_sid
> > > > > (clusterb)
> > > > > clustera    pengine:   notice: LogActions:     Stop   
> > > > cluster_listnr
> > > > > (clusterb)
> > > > > clustera    pengine:  warning: process_pe_message:   
> >  Calculated
> > > > > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > > > clustera       crmd:     info: do_state_transition:    State
> > > > transition
> > > > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > > > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > > > clustera       crmd:     info: do_te_invoke:   Processing
> > graph
> > > > 26821
> > > > > (ref=pe_calc-dc-1526868653-26882) derived from
> > > > > /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > > > > clustera       crmd:   notice: te_fence_node:  Executing
> > reboot
> > > > fencing
> > > > > operation (23) on clusterb (timeout=60000)
> > > > > 
> > > > > 
> > > > > Thanks ~~~~
> > 
> > Ken Gaillot <kgaillot at redhat.com>
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list