[ClusterLabs] Pengine always trying to start the resource on the standby node.

Wed Jun 6 20:37:17 EDT 2018

Hi Andrei,

Thanks for your quickly reply. Still need help as below :

On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidjaar at gmail.com>
wrote:

> 06.06.2018 04:27, Albert Weng пишет:
> >  Hi All,
> >
> > I have created active/passive pacemaker cluster on RHEL 7.
> >
> > Here are my environment:
> > clustera : 192.168.11.1 (passive)
> > clusterb : 192.168.11.2 (master)
> > clustera-ilo4 : 192.168.11.10
> > clusterb-ilo4 : 192.168.11.11
> >
> > cluster resource status :
> >      cluster_fs        started on clusterb
> >      cluster_vip       started on clusterb
> >      cluster_sid       started on clusterb
> >      cluster_listnr    started on clusterb
> >
> > Both cluster node are online status.
> >
> > i found my corosync.log contain many records like below:
> >
> > clustera        pengine:     info: determine_online_status_fencing:
> > Node clusterb is active
> > clustera        pengine:     info: determine_online_status:        Node
> > clusterb is online
> > clustera        pengine:     info: determine_online_status_fencing:
> > Node clustera is active
> > clustera        pengine:     info: determine_online_status:        Node
> > clustera is online
> >
> > *clustera        pengine:  warning: unpack_rsc_op_failure:  Processing
> > failed op start for cluster_sid on clustera: unknown error (1)*
> > *=> Question :  Why pengine always trying to start cluster_sid on the
> > passive node? how to fix it? *
> >
>
> pacemaker does not have concept of "passive" or "master" node - it is up
> to you to decide when you configure resource placement. By default
> pacemaker will attempt to spread resources across all eligible nodes.
> You can influence node selection by using constraints. See
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/
> 1.1/html/Pacemaker_Explained/_deciding_which_nodes_a_
> resource_can_run_on.html
> for details.
>
> But in any case - all your resources MUST be capable of running of both
> nodes, otherwise cluster makes no sense. If one resource A depends on
> something that another resource B provides and can be started only
> together with resource B (and after it is ready) - you must tell it to
> pacemaker by using resource colocations and ordering. See same document
> for details.
>
> > clustera        pengine:     info: native_print:   ipmi-fence-clustera
> > (stonith:fence_ipmilan):        Started clustera
> > clustera        pengine:     info: native_print:   ipmi-fence-clusterb
> > (stonith:fence_ipmilan):        Started clustera
> > clustera        pengine:     info: group_print:     Resource Group:
> cluster
> > clustera        pengine:     info: native_print:        cluster_fs
> > (ocf::heartbeat:Filesystem):    Started clusterb
> > clustera        pengine:     info: native_print:        cluster_vip
> > (ocf::heartbeat:IPaddr2):       Started clusterb
> > clustera        pengine:     info: native_print:        cluster_sid
> > (ocf::heartbeat:oracle):        Started clusterb
> > clustera        pengine:     info: native_print:
> > cluster_listnr       (ocf::heartbeat:oralsnr):       Started clusterb
> > clustera        pengine:     info: get_failcount_full:     cluster_sid
> has
> > failed INFINITY times on clustera
> >
> >
> > *clustera        pengine:  warning: common_apply_stickiness:
> Forcing
> > cluster_sid away from clustera after 1000000 failures (max=1000000)*
> > *=> Question: too much trying result in forbid the resource start on
> > clustera ?*
> >
>
> Yes.
>

How to find out the root cause of  1000000 failures? which log will contain
the error message?

>
> > Couple days ago, the clusterb has been stonith by unknown reason, but
> only
> > "cluster_fs", "cluster_vip" moved to clustera successfully, but
> > "cluster_sid" and "cluster_listnr" go to "STOP" status.
> > like below messages, is it related with "op start for cluster_sid on
> > clustera..." ?
> >
>
> Yes. Node clustera is now marked as being incapable of running resource
> so if node cluaterb fails, resource cannot be started anywhere.
>
> How could i fix it? i need some hint for troubleshooting.

> > clustera    pengine:  warning: unpack_rsc_op_failure:  Processing failed
> op
> > start for cluster_sid on clustera: unknown error (1)
> > clustera    pengine:     info: native_print:   ipmi-fence-clustera
> > (stonith:fence_ipmilan):        Started clustera
> > clustera    pengine:     info: native_print:   ipmi-fence-clusterb
> > (stonith:fence_ipmilan):        Started clustera
> > clustera    pengine:     info: group_print:     Resource Group: cluster
> > clustera    pengine:     info: native_print:        cluster_fs
> > (ocf::heartbeat:Filesystem):    Started clusterb (UNCLEAN)
> > clustera    pengine:     info: native_print:        cluster_vip
> > (ocf::heartbeat:IPaddr2):       Started clusterb (UNCLEAN)
> > clustera    pengine:     info: native_print:        cluster_sid
> > (ocf::heartbeat:oracle):        Started clusterb (UNCLEAN)
> > clustera    pengine:     info: native_print:        cluster_listnr
> > (ocf::heartbeat:oralsnr):       Started clusterb (UNCLEAN)
> > clustera    pengine:     info: get_failcount_full:     cluster_sid has
> > failed INFINITY times on clustera
> > clustera    pengine:  warning: common_apply_stickiness:        Forcing
> > cluster_sid away from clustera after 1000000 failures (max=1000000)
> > clustera    pengine:     info: rsc_merge_weights:      cluster_fs:
> Rolling
> > back scores from cluster_sid
> > clustera    pengine:     info: rsc_merge_weights:      cluster_vip:
> Rolling
> > back scores from cluster_sid
> > clustera    pengine:     info: rsc_merge_weights:      cluster_sid:
> Rolling
> > back scores from cluster_listnr
> > clustera    pengine:     info: native_color:   Resource cluster_sid
> cannot
> > run anywhere
> > clustera    pengine:     info: native_color:   Resource cluster_listnr
> > cannot run anywhere
> > clustera    pengine:  warning: custom_action:  Action cluster_fs_stop_0
> on
> > clusterb is unrunnable (offline)
> > clustera    pengine:     info: RecurringOp:     Start recurring monitor
> > (20s) for cluster_fs on clustera
> > clustera    pengine:  warning: custom_action:  Action cluster_vip_stop_0
> on
> > clusterb is unrunnable (offline)
> > clustera    pengine:     info: RecurringOp:     Start recurring monitor
> > (10s) for cluster_vip on clustera
> > clustera    pengine:  warning: custom_action:  Action cluster_sid_stop_0
> on
> > clusterb is unrunnable (offline)
> > clustera    pengine:  warning: custom_action:  Action cluster_sid_stop_0
> on
> > clusterb is unrunnable (offline)
> > clustera    pengine:  warning: custom_action:  Action
> cluster_listnr_stop_0
> > on clusterb is unrunnable (offline)
> > clustera    pengine:  warning: custom_action:  Action
> cluster_listnr_stop_0
> > on clusterb is unrunnable (offline)
> > clustera    pengine:  warning: stage6: Scheduling Node clusterb for
> STONITH
> > clustera    pengine:     info: native_stop_constraints:
> > cluster_fs_stop_0 is implicit after clusterb is fenced
> > clustera    pengine:     info: native_stop_constraints:
> > cluster_vip_stop_0 is implicit after clusterb is fenced
> > clustera    pengine:     info: native_stop_constraints:
> > cluster_sid_stop_0 is implicit after clusterb is fenced
> > clustera    pengine:     info: native_stop_constraints:
> > cluster_listnr_stop_0 is implicit after clusterb is fenced
> > clustera    pengine:     info: LogActions:     Leave   ipmi-fence-db01
> > (Started clustera)
> > clustera    pengine:     info: LogActions:     Leave   ipmi-fence-db02
> > (Started clustera)
> > clustera    pengine:   notice: LogActions:     Move    cluster_fs
> > (Started clusterb -> clustera)
> > clustera    pengine:   notice: LogActions:     Move    cluster_vip
> > (Started clusterb -> clustera)
> > clustera    pengine:   notice: LogActions:     Stop    cluster_sid
> > (clusterb)
> > clustera    pengine:   notice: LogActions:     Stop    cluster_listnr
> > (clusterb)
> > clustera    pengine:  warning: process_pe_message:     Calculated
> > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > clustera       crmd:     info: do_state_transition:    State transition
> > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > cause=C_IPC_MESSAGE origin=handle_response ]
> > clustera       crmd:     info: do_te_invoke:   Processing graph 26821
> > (ref=pe_calc-dc-1526868653-26882) derived from
> > /var/lib/pacemaker/pengine/pe-warn-7.bz2
> > clustera       crmd:   notice: te_fence_node:  Executing reboot fencing
> > operation (23) on clusterb (timeout=60000)
> >
> >
> > Thanks ~~~~
> >
> >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Kind regards,
Albert Weng

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
不含病毒。www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180607/57523db7/attachment-0002.html>