<div dir="ltr"><div>Hi All,</div><div><br></div><div>Thanks for reply.</div><div><br></div><div>Recently, i run the following command :</div><div>(clustera) # crm_simulate --xml-file pe-warn.last</div><div><br></div><div>it returns the following results :</div><div>   error: crm_abort:    xpath_search: Triggered assert at xpath.c:153 : xml_top != NULL<br>   error: crm_element_value:    Couldn't find validate-with in NULL<br>   error: crm_abort:    crm_element_value: Triggered assert at xml.c:5135 : data != NULL<br>   Configuration validation is currently disabled. It is highly encouraged and prevents many common cluster issues.<br>   error: crm_element_value:    Couldn't find validate-with in NULL<br>   error: crm_abort:    crm_element_value: Triggered assert at xml.c:5135 : data != NULL<br>   error: crm_element_value:    Couldn't find ignore-dtd in NULL<br>   error: crm_abort:    crm_element_value: Triggered assert at xml.c:5135 : data != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    validate_with: Triggered assert at schemas.c:522 : xml != NULL<br>   error: crm_abort:    crm_xml_add: Triggered assert at xml.c:2494 : node != NULL<br>   error: write_xml_stream:     Cannot write NULL to /var/lib/pacemaker/cib/shadow.20008<br>   Could not create '/var/lib/pacemaker/cib/shadow.20008': Success<br></div><div><br></div><div>Could anyone help me how to read those messages and what's going on my server?<br></div><div><br></div><div>Thanks a lot..<br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:<br>

> Hi Andrei,<br>

> <br>

> Thanks for your quickly reply. Still need help as below :<br>

> <br>

> On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <<a href="mailto:arvidjaar@gmail.co">arvidjaar@gmail.co</a><br>

> m> wrote:<br>

> > 06.06.2018 04:27, Albert Weng пишет:<br>

> > >  Hi All,<br>

> > > <br>

> > > I have created active/passive pacemaker cluster on RHEL 7.<br>

> > > <br>

> > > Here are my environment:<br>

> > > clustera : 192.168.11.1 (passive)<br>

> > > clusterb : 192.168.11.2 (master)<br>

> > > clustera-ilo4 : 192.168.11.10<br>

> > > clusterb-ilo4 : 192.168.11.11<br>

> > > <br>

> > > cluster resource status :<br>

> > >      cluster_fs        started on clusterb<br>

> > >      cluster_vip       started on clusterb<br>

> > >      cluster_sid       started on clusterb<br>

> > >      cluster_listnr    started on clusterb<br>

> > > <br>

> > > Both cluster node are online status.<br>

> > > <br>

> > > i found my corosync.log contain many records like below:<br>

> > > <br>

> > > clustera        pengine:     info:<br>

> > determine_online_status_<wbr>fencing:<br>

> > > Node clusterb is active<br>

> > > clustera        pengine:     info: determine_online_status:     <br>

> >   Node<br>

> > > clusterb is online<br>

> > > clustera        pengine:     info:<br>

> > determine_online_status_<wbr>fencing:<br>

> > > Node clustera is active<br>

> > > clustera        pengine:     info: determine_online_status:     <br>

> >   Node<br>

> > > clustera is online<br>

> > > <br>

> > > *clustera        pengine:  warning: unpack_rsc_op_failure: <br>

> > Processing<br>

> > > failed op start for cluster_sid on clustera: unknown error (1)*<br>

> > > *=> Question :  Why pengine always trying to start cluster_sid on<br>

> > the<br>

> > > passive node? how to fix it? *<br>

> > > <br>

> > <br>

> > pacemaker does not have concept of "passive" or "master" node - it<br>

> > is up<br>

> > to you to decide when you configure resource placement. By default<br>

> > pacemaker will attempt to spread resources across all eligible<br>

> > nodes.<br>

> > You can influence node selection by using constraints. See<br>

> > <a href="https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace" rel="noreferrer" target="_blank">https://clusterlabs.org/<wbr>pacemaker/doc/en-US/Pacemaker/<wbr>1.1/html/Pace</a><br>

> > maker_Explained/_deciding_<wbr>which_nodes_a_resource_can_<wbr>run_on.html<br>

> > for details.<br>

> > <br>

> > But in any case - all your resources MUST be capable of running of<br>

> > both<br>

> > nodes, otherwise cluster makes no sense. If one resource A depends<br>

> > on<br>

> > something that another resource B provides and can be started only<br>

> > together with resource B (and after it is ready) - you must tell it<br>

> > to<br>

> > pacemaker by using resource colocations and ordering. See same<br>

> > document<br>

> > for details.<br>

> > <br>

> > > clustera        pengine:     info: native_print:   ipmi-fence-<br>

> > clustera<br>

> > > (stonith:fence_ipmilan):        Started clustera<br>

> > > clustera        pengine:     info: native_print:   ipmi-fence-<br>

> > clusterb<br>

> > > (stonith:fence_ipmilan):        Started clustera<br>

> > > clustera        pengine:     info: group_print:     Resource<br>

> > Group: cluster<br>

> > > clustera        pengine:     info: native_print:       <br>

> > cluster_fs<br>

> > > (ocf::heartbeat:Filesystem):    Started clusterb<br>

> > > clustera        pengine:     info: native_print:       <br>

> > cluster_vip<br>

> > > (ocf::heartbeat:IPaddr2):       Started clusterb<br>

> > > clustera        pengine:     info: native_print:       <br>

> > cluster_sid<br>

> > > (ocf::heartbeat:oracle):        Started clusterb<br>

> > > clustera        pengine:     info: native_print:<br>

> > > cluster_listnr       (ocf::heartbeat:oralsnr):       Started<br>

> > clusterb<br>

> > > clustera        pengine:     info: get_failcount_full:   <br>

> >  cluster_sid has<br>

> > > failed INFINITY times on clustera<br>

> > > <br>

> > > <br>

> > > *clustera        pengine:  warning: common_apply_stickiness:     <br>

> >   Forcing<br>

> > > cluster_sid away from clustera after 1000000 failures<br>

> > (max=1000000)*<br>

> > > *=> Question: too much trying result in forbid the resource start<br>

> > on<br>

> > > clustera ?*<br>

> > > <br>

> > <br>

> > Yes.<br>

> <br>

> How to find out the root cause of  1000000 failures? which log will<br>

> contain the error message?<br>

<br>

</div></div>As an aside, 1,000,000 is "infinity" to pacemaker. It could mean<br>

1,000,000 actual failures, or a "fatal" failure that causes pacemaker<br>

to set the fail count to infinity.<br>

<br>

The most recent failure of each resource will be shown in the status<br>

display (crm_mon, pcs status, etc.). They will have a basic exit code<br>

(which you can use to distinguish a timeout from an error received from<br>

the agent), and if the agent provided one, an "exit-reason". That's the<br>

first place to look.<br>

<br>

Failures will remain in the status display, and affect the placement of<br>

resources, until one of two things happen: you manually clean up the<br>

failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if<br>

you configured a failure-timeout for the resource, that much time has<br>

passed with no more failures.<br>

<br>

For deeper investigation, check the system log (wherever it's kept on<br>

your distro). You can use the timestamp from the failure in the status<br>

to know where to look.<br>

<br>

For even more detail, you can look at pacemaker's detail log (the one<br>

you posted excerpts from). This will have additional messages beyond<br>

the system log, but they are harder to follow and more intended for<br>

developers and advanced troubleshooting.<br>

<span class=""><br>

>  <br>

> > > Couple days ago, the clusterb has been stonith by unknown reason,<br>

> > but only<br>

> > > "cluster_fs", "cluster_vip" moved to clustera successfully, but<br>

> > > "cluster_sid" and "cluster_listnr" go to "STOP" status.<br>

> > > like below messages, is it related with "op start for cluster_sid<br>

> > on<br>

> > > clustera..." ?<br>

> > > <br>

> > <br>

> > Yes. Node clustera is now marked as being incapable of running<br>

> > resource<br>

> > so if node cluaterb fails, resource cannot be started anywhere.<br>

> > <br>

> > <br>

> <br>

> How could i fix it? i need some hint for troubleshooting.<br>

>  <br>

> > > clustera    pengine:  warning: unpack_rsc_op_failure:  Processing<br>

> > failed op<br>

> > > start for cluster_sid on clustera: unknown error (1)<br>

> > > clustera    pengine:     info: native_print:   ipmi-fence-<br>

> > clustera<br>

</span><span class="">> > > (stonith:fence_ipmilan):        Started clustera<br>

> > > clustera    pengine:     info: native_print:   ipmi-fence-<br>

> > clusterb<br>

</span><div><div class="h5">> > > (stonith:fence_ipmilan):        Started clustera<br>

> > > clustera    pengine:     info: group_print:     Resource Group:<br>

> > cluster<br>

> > > clustera    pengine:     info: native_print:        cluster_fs<br>

> > > (ocf::heartbeat:Filesystem):    Started clusterb (UNCLEAN)<br>

> > > clustera    pengine:     info: native_print:        cluster_vip<br>

> > > (ocf::heartbeat:IPaddr2):       Started clusterb (UNCLEAN)<br>

> > > clustera    pengine:     info: native_print:        cluster_sid<br>

> > > (ocf::heartbeat:oracle):        Started clusterb (UNCLEAN)<br>

> > > clustera    pengine:     info: native_print:       <br>

> > cluster_listnr<br>

> > > (ocf::heartbeat:oralsnr):       Started clusterb (UNCLEAN)<br>

> > > clustera    pengine:     info: get_failcount_full:   <br>

> >  cluster_sid has<br>

> > > failed INFINITY times on clustera<br>

> > > clustera    pengine:  warning: common_apply_stickiness:       <br>

> > Forcing<br>

> > > cluster_sid away from clustera after 1000000 failures<br>

> > (max=1000000)<br>

> > > clustera    pengine:     info: rsc_merge_weights:     <br>

> > cluster_fs: Rolling<br>

> > > back scores from cluster_sid<br>

> > > clustera    pengine:     info: rsc_merge_weights:     <br>

> > cluster_vip: Rolling<br>

> > > back scores from cluster_sid<br>

> > > clustera    pengine:     info: rsc_merge_weights:     <br>

> > cluster_sid: Rolling<br>

> > > back scores from cluster_listnr<br>

> > > clustera    pengine:     info: native_color:   Resource<br>

> > cluster_sid cannot<br>

> > > run anywhere<br>

> > > clustera    pengine:     info: native_color:   Resource<br>

> > cluster_listnr<br>

> > > cannot run anywhere<br>

> > > clustera    pengine:  warning: custom_action:  Action<br>

> > cluster_fs_stop_0 on<br>

> > > clusterb is unrunnable (offline)<br>

> > > clustera    pengine:     info: RecurringOp:     Start recurring<br>

> > monitor<br>

> > > (20s) for cluster_fs on clustera<br>

> > > clustera    pengine:  warning: custom_action:  Action<br>

> > cluster_vip_stop_0 on<br>

> > > clusterb is unrunnable (offline)<br>

> > > clustera    pengine:     info: RecurringOp:     Start recurring<br>

> > monitor<br>

> > > (10s) for cluster_vip on clustera<br>

> > > clustera    pengine:  warning: custom_action:  Action<br>

> > cluster_sid_stop_0 on<br>

> > > clusterb is unrunnable (offline)<br>

> > > clustera    pengine:  warning: custom_action:  Action<br>

> > cluster_sid_stop_0 on<br>

> > > clusterb is unrunnable (offline)<br>

> > > clustera    pengine:  warning: custom_action:  Action<br>

> > cluster_listnr_stop_0<br>

> > > on clusterb is unrunnable (offline)<br>

> > > clustera    pengine:  warning: custom_action:  Action<br>

> > cluster_listnr_stop_0<br>

> > > on clusterb is unrunnable (offline)<br>

> > > clustera    pengine:  warning: stage6: Scheduling Node clusterb<br>

> > for STONITH<br>

> > > clustera    pengine:     info: native_stop_constraints:<br>

> > > cluster_fs_stop_0 is implicit after clusterb is fenced<br>

> > > clustera    pengine:     info: native_stop_constraints:<br>

> > > cluster_vip_stop_0 is implicit after clusterb is fenced<br>

> > > clustera    pengine:     info: native_stop_constraints:<br>

> > > cluster_sid_stop_0 is implicit after clusterb is fenced<br>

> > > clustera    pengine:     info: native_stop_constraints:<br>

> > > cluster_listnr_stop_0 is implicit after clusterb is fenced<br>

> > > clustera    pengine:     info: LogActions:     Leave   ipmi-<br>

> > fence-db01<br>

> > > (Started clustera)<br>

> > > clustera    pengine:     info: LogActions:     Leave   ipmi-<br>

> > fence-db02<br>

> > > (Started clustera)<br>

> > > clustera    pengine:   notice: LogActions:     Move    cluster_fs<br>

> > > (Started clusterb -> clustera)<br>

> > > clustera    pengine:   notice: LogActions:     Move   <br>

> > cluster_vip<br>

> > > (Started clusterb -> clustera)<br>

> > > clustera    pengine:   notice: LogActions:     Stop   <br>

> > cluster_sid<br>

> > > (clusterb)<br>

> > > clustera    pengine:   notice: LogActions:     Stop   <br>

> > cluster_listnr<br>

> > > (clusterb)<br>

> > > clustera    pengine:  warning: process_pe_message:     Calculated<br>

> > > Transition 26821: /var/lib/pacemaker/pengine/pe-<wbr>warn-7.bz2<br>

> > > clustera       crmd:     info: do_state_transition:    State<br>

> > transition<br>

> > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS<br>

> > > cause=C_IPC_MESSAGE origin=handle_response ]<br>

> > > clustera       crmd:     info: do_te_invoke:   Processing graph<br>

> > 26821<br>

> > > (ref=pe_calc-dc-1526868653-<wbr>26882) derived from<br>

> > > /var/lib/pacemaker/pengine/pe-<wbr>warn-7.bz2<br>

> > > clustera       crmd:   notice: te_fence_node:  Executing reboot<br>

> > fencing<br>

> > > operation (23) on clusterb (timeout=60000)<br>

> > > <br>

> > > <br>

> > > Thanks ~~~~<br>

<br>

</div></div>Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>><br>

<div class="HOEnZb"><div class="h5">______________________________<wbr>_________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Kind regards,<br>Albert Weng</div>

</div>