[ClusterLabs] EXTERNAL: Re: Pacemaker not reacting as I would expect when two resources fail at the same time

Sun Jun 9 08:44:50 EDT 2019

Thanks for your input Andrei - I appreciate you taking the time to look over my issue. I could be completely wrong, but I don't believe that this issue is being caused by the OCF script. Originally I had the migration-threshold for the resource set to zero. When I killed the master instance of the resource to test recovery, the monitor event was correctly triggered and returned OCF_ERR_GENERIC, causing Pacemaker to recognise that the resource had failed. However, it did not attempt to restart it, nor did it promote the resource on the other node, so it left the resource in the "failed" state on one node, and running as "slave" on the other. To try to overcome that, I changed the migration-threshold to 1 to try to force a failover, but whilst the logs (supplied previously) indicate that Pacemaker tried to migrate the resource, it seems that pacemaker-controld terminated. If I clear the failure history, Pacemaker will then restart the resource, but I shouldn't need to intervene.

Unfortunately the OCF script contains company confidential code, so I can't post the whole thing. But I've posted the relevant parts of the script that set the master scores below. These are the state change logs that occurred for the resource on both nodes just after I killed the master instance of the resource, which was running on the node named "secondary". As you can see Pacemaker triggered a DEMOTE action on both nodes, then "primary" received "PRE-PROMOTE" and "POST-PROMOTE" notifications, but Pacemaker didn't seem to trigger a "PROMOTE" action. It then sent a "PRE-STOP" notification and that was the final operation before the bad state was reached:

2019 Jun  8 01:52:08.449 daemon.err VIRTUAL main_system-ocf(main_system) 18569 ERROR: secondary: The process has died unexpectedly
2019 Jun  8 01:52:08.670 daemon.info VIRTUAL main_system-ocf(main_system) 18692 INFO: secondary: NOTIFY state entered: pre-demote
2019 Jun  8 01:52:08.788 daemon.info VIRTUAL main_system-ocf(main_system) 18751 INFO: secondary: DEMOTE state entered
2019 Jun  8 01:52:09.100 daemon.info VIRTUAL main_system-ocf(main_system) 18898 INFO: secondary: NOTIFY state entered: post-demote
2019 Jun  8 01:52:09.256 daemon.info VIRTUAL main_system-ocf(main_system) 18939 INFO: secondary: NOTIFY state entered: pre-stop
2019 Jun  8 01:50:45.293 daemon.info VIRTUAL main_system-ocf(main_system) 572 INFO: primary: NOTIFY state entered: pre-demote
2019 Jun  8 01:50:45.396 daemon.info VIRTUAL main_system-ocf(main_system) 624 INFO: primary: DEMOTE state entered
2019 Jun  8 01:50:45.516 daemon.info VIRTUAL main_system-ocf(main_system) 624 INFO: primary: Starting the process in BACKUP mode
2019 Jun  8 01:50:45.776 daemon.info VIRTUAL main_system-ocf(main_system) 858 INFO: primary: NOTIFY state entered: post-demote
2019 Jun  8 01:50:45.963 daemon.info VIRTUAL metaswitch-cp-ocf(metaswitch_cp) 896 INFO: primary: NOTIFY state entered: pre-stop
2019 Jun  8 01:51:26.861 daemon.info VIRTUAL metaswitch-cp-ocf(metaswitch_cp) 1389 INFO: primary: NOTIFY state entered: pre-promote
2019 Jun  8 01:51:27.960 daemon.info VIRTUAL metaswitch-cp-ocf(metaswitch_cp) 1497 INFO: primary: NOTIFY state entered: post-promote
2019 Jun  8 01:52:09.093 daemon.info VIRTUAL metaswitch-cp-ocf(metaswitch_cp) 2545 INFO: primary: NOTIFY state entered: pre-demote
2019 Jun  8 01:52:09.491 daemon.info VIRTUAL metaswitch-cp-ocf(metaswitch_cp) 2742 INFO: primary: NOTIFY state entered: post-demote
2019 Jun  8 01:52:09.644 daemon.info VIRTUAL metaswitch-cp-ocf(metaswitch_cp) 2776 INFO: primary: NOTIFY state entered: pre-stop

main_status() {
    local proc status

    # Check for a running process.
    proc=$(ps -ef |grep '[m]aind)
    if [[ ! "$proc" =~ "maind" ]]; then
            # No process running.
            return $OCF_NOT_RUNNING
    fi

    status=$(crm_resource -W -r m_main_process | grep Master | grep $(ocf_local_nodename))
    if [ -z "$status" ]; then
        # Running as slave so set a low master preference. If the master fails
        # right now, and there is another slave that does not lag behind the
        # master, its higher master preference will win and that slave will become
        # the new master
        crm_master -l reboot -v 5
        return $OCF_SUCCESS
    fi

    # Running as master so set a high master preference.
    crm_master -l reboot -v 100
    return $OCF_RUNNING_MASTER
}

main_monitor() {
    local rc

    main_status
    rc=$?
    if [ $rc -eq $OCF_NOT_RUNNING ] && [[ "$(crm_resource -W -r m_main_process)" =~ "$(ocf_local_nodename)" ]]; then
        ocf_log err "$(ocf_local_nodename): The process has died unexpectedly"
        crm_master -l reboot -v 1
        return $OCF_ERR_GENERIC
    fi

    return $rc
}

main_start_primary() {
    local status timeout

    ocf_log info "$(ocf_local_nodename): Starting the process in PRIMARY mode"
    start-stop-daemon -S --background -q -n $DESC --exec $DAEMON -- primary || exit $OCF_ERR_GENERIC

    timeout=$((SECONDS+5))
    while [ $SECONDS -lt $timeout ]; do
        main_status
        status=$?
        if [ $status -eq $OCF_RUNNING_MASTER ]; then
            crm_master -l reboot -v 100
            return $OCF_SUCCESS
        fi
    done

    ocf_exit_reason "Failed to start process in primary mode"
    exit $OCF_ERR_GENERIC
}

main_start_backup() {
    local status timeout

    ocf_log info "$(ocf_local_nodename): Starting the process in BACKUP mode"
    ocf_run start-stop-daemon -S --background -q -n $DESC --exec $DAEMON -- backup || exit $OCF_ERR_GENERIC

    timeout=$((SECONDS+5))
    while [ $SECONDS -lt $timeout ]; do
        # In the single node case, Pacemaker will return that the process is running
        # as master. So to avoid a failure here, accept a status of OCF_SUCCESS
        # or OCF_RUNNING_MASTER.
        main_status
        status=$?
        if [ $status -ne $OCF_NOT_RUNNING ]; then
            crm_master -l reboot -v 5
            return $OCF_SUCCESS
        fi
    done

    ocf_exit_reason "Failed to start process in backup mode"
    exit $OCF_ERR_GENERIC
}

main_start() {
    # pacemaker resources must start as backup until promoted
    main_start_backup
    ocf_run crm_resource -C -r m_main_process

    return $OCF_SUCCESS                                                                            
}

main_promote() {
    main_status
    rc=$?
    case "$rc" in
        "$OCF_SUCCESS")
            # Running as slave.
            main_start_primary
             ;;    
        "$OCF_RUNNING_MASTER")
            # Already a master. Unexpected, but not a problem.
            ocf_log info "Resource is already running as Master"
            ;;
        "$OCF_NOT_RUNNING")
            # Currently not running. Shouldn't really happen but just start in primary mode.
            main_start_primary
            ;;
        *)
            # Failed resource. Let the cluster manager recover.
            ocf_log err "Unexpected error, cannot promote"
            exit $rc
            ;;
    esac
    return $OCF_SUCCESS
}

main_demote() {
    main_start_backup
    return $OCF_SUCCESS
}

Thanks again for any help you can provide.

Regards,
Harvey

________________________________________
From: Users <users-bounces at clusterlabs.org> on behalf of Andrei Borzenkov <arvidjaar at gmail.com>
Sent: Saturday, 8 June 2019 7:00 p.m.
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] EXTERNAL: Re: Pacemaker not reacting as I would expect when two resources fail at the same time

08.06.2019 5:12, Harvey Shepherd пишет:
> Thank you for your advice Ken. Sorry for the delayed reply - I was trying out a few things and trying to capture extra info. The changes that you suggested make sense, and I have incorporated them into my config. However, the original issue remains whereby Pacemaker does not attempt to restart the failed m_main_system process. I tried setting the migration-threshold of that resource to 1, to try to get Pacemaker to force it to be promoted on the other node, but this had no effect - the master instance remains "failed" and the slave instance remains "running" but is not promoted.

As far as I understand, for a clone to be promoted on a node this node
must have explicit master score or location constraint for this clone.
Master score is normally set by resource agent.

> Snipped output from crm_mon:
>
> Current DC: primary (version unknown) - partition with quorum
> Last updated: Sat Jun  8 02:04:05 2019
> Last change: Sat Jun  8 01:51:25 2019 by hacluster via crmd on primary
>
> 2 nodes configured
> 26 resources configured
>
> Online: [ primary secondary ]
>
> Active resources:
>
>  Clone Set: m_main_system [main_system] (promotable)
>      main_system      (ocf::main_system-ocf):    FAILED secondary
>      Slaves: [ primary ]
>
> Migration Summary:
> * Node secondary:
>    main_system: migration-threshold=1 fail-count=1 last-failure='Sat Jun  8 01:52:08 2019'
>
> Failed Resource Actions:
> * main_system_monitor_10000 on secondary 'unknown error' (1): call=214, status=complete, exitreason='',
>     last-rc-change='Sat Jun  8 01:52:08 2019', queued=0ms, exec=0ms
>
>
> From the logs I see:
>
> 2019 Jun  8 01:52:09.574 daemon.warning VIRTUAL pacemaker-schedulerd 1131  warning: Processing failed monitor of main_system:1 on secondary: unknown error
> 2019 Jun  8 01:52:09.586 daemon.warning VIRTUAL pacemaker-schedulerd 1131  warning: Forcing m_main_system away from secondary after 1 failures (max=1)
> 2019 Jun  8 01:52:09.586 daemon.warning VIRTUAL pacemaker-schedulerd 1131  warning: Forcing m_main_system away from secondary after 1 failures (max=1)
> 2019 Jun  8 01:52:10.692 daemon.warning VIRTUAL pacemaker-controld 1132  warning: Transition 35 (Complete=33, Pending=0, Fired=0, Skipped=0, Incomplete=67, Source=/var/lib/pacemaker/pengine/pe-input-47.bz2): Terminated

Making this file available may help to determine why it decided to not
promote resource.

> 2019 Jun  8 01:52:10.692 daemon.warning VIRTUAL pacemaker-controld 1132  warning: Transition failed: terminated
>
>
> Do you have any further suggestions? For your information I've upgraded Pacemaker to 2.0.2, but the behaviour is the same.
>
> Thanks,
> Harvey
> ________________________________________
> From: Users <users-bounces at clusterlabs.org> on behalf of Ken Gaillot <kgaillot at redhat.com>
> Sent: Saturday, 1 June 2019 5:40 a.m.
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: EXTERNAL: Re: [ClusterLabs] Pacemaker not reacting as I would expect when two resources fail at the same time
>
> On Thu, 2019-05-30 at 23:39 +0000, Harvey Shepherd wrote:
>> Hi All,
>>
>> I'm running Pacemaker 2.0.1 on a cluster containing two nodes; one
>> master and one slave. I have a main master/slave resource
>> (m_main_system), a group of resources that run in active-active mode
>> (active_active - i.e. run on both nodes), and a group that runs in
>> active-disabled mode (snmp_active_disabled - resources only run on
>> the current promoted master). The snmp_active_disabled group is
>> configured to be co-located with the master of m_main_system, so only
>> a failure of the master m_main_system resource can trigger a
>> failover. The constraints specify that m_main_system must be started
>> before snmp_active_disabled.
>>
>> The problem I'm having is that when a resource in the
>> snmp_active_disabled group fails and gets into a constant cycle where
>> Pacemaker tries to restart it, and I then kill m_main_system on the
>> master, then Pacemaker still constantly tries to restart the failed
>> snmp_active_disabled resource and ignores the more important
>> m_main_system process which should be triggering a failover. If I
>> stabilise the snmp_active_disabled resource then Pacemaker finally
>> acts on the m_main_system failure. I hope I've described this well
>> enough, but I've included a cut down form of my CIB config below if
>> it helps!
>>
>> Is this a bug or an error in my config? Perhaps the order in which
>> the groups are defined in the CIB matters despite the constraints?
>> Any help would be gratefully received.
>>
>> Thanks,
>> Harvey
>>
>> <configuration>
>>   <crm_config>
>>     <cluster_property_set id="cib-bootstrap-options">
>>       <nvpair name="stonith-enabled" value="false" id="cib-bootstrap-
>> options-stonith-enabled"/>
>>       <nvpair name="no-quorum-policy" value="ignore" id="cib-
>> bootstrap-options-no-quorum-policy"/>
>>       <nvpair name="have-watchdog" value="false" id="cib-bootstrap-
>> options-have-watchdog"/>
>>       <nvpair name="cluster-name" value="lbcluster" id="cib-
>> bootstrap-options-cluster-name"/>
>>       <nvpair name="start-failure-is-fatal" value="false" id="cib-
>> bootstrap-options-start-failure-is-fatal"/>
>>       <nvpair name="cluster-recheck-interval" value="0s" id="cib-
>> bootstrap-options-cluster-recheck-interval"/>
>>     </cluster_property_set>
>>   </crm_config>
>>   <nodes>
>>     <node id="1" uname="primary"/>
>>     <node id="2" uname="secondary"/>
>>   </nodes>
>>   <resources>
>>     <group id="snmp_active_disabled">
>>         <primitive id="snmpd" class="lsb" type="snmpd">
>>           <operations>
>>             <op name="monitor" interval="10s" id="snmpd-monitor-
>> 10s"/>
>>             <op name="start" interval="0" timeout="30s" id="snmpd-
>> start-30s"/>
>>             <op name="stop" interval="0" timeout="30s" id="snmpd-
>> stop-30s"/>
>>           </operations>
>>         </primitive>
>>         <primitive id="snmp-auxiliaries" class="lsb" type="snmp-
>> auxiliaries">
>>           <operations>
>>             <op name="monitor" interval="10s" id="snmp-auxiliaries-
>> monitor-10s"/>
>>             <op name="start" interval="0" timeout="30s" id="snmp-
>> auxiliaries-start-30s"/>
>>             <op name="stop" interval="0" timeout="30s" id="snmp-
>> auxiliaries-stop-30s"/>
>>           </operations>
>>         </primitive>
>>     </group>
>>     <clone id="clone_active_active">
>>       <meta_attributes id="clone_active_active_meta_attributes">
>>         <nvpair id="group-unique" name="globally-unique"
>> value="false"/>
>>       </meta_attributes>
>>       <group id="active_active">
>>         <primitive id="logd" class="lsb" type="logd">
>>           <operations>
>>             <op name="monitor" interval="10s" id="logd-monitor-10s"/>
>>             <op name="start" interval="0" timeout="30s" id="logd-
>> start-30s"/>
>>             <op name="stop" interval="0" timeout="30s" id="logd-stop-
>> 30s"/>
>>           </operations>
>>         </primitive>
>>         <primitive id="serviced" class="lsb" type="serviced">
>>           <operations>
>>             <op name="monitor" interval="10s" id="serviced-monitor-
>> 10s"/>
>>             <op name="start" interval="0" timeout="30s" id="serviced-
>> start-30s"/>
>>             <op name="stop" interval="0" timeout="30s" id="serviced-
>> stop-30s"/>
>>           </operations>
>>         </primitive>
>>       </group>
>>     </clone>
>>     <master id="m_main_system">
>>       <meta_attributes id="m_main_system-meta_attributes">
>>         <nvpair name="notify" value="true" id="m_main_system-
>> meta_attributes-notify"/>
>>         <nvpair name="clone-max" value="2" id="m_main_system-
>> meta_attributes-clone-max"/>
>>         <nvpair name="promoted-max" value="1" id="m_main_system-
>> meta_attributes-promoted-max"/>
>>         <nvpair name="promoted-node-max" value="1" id="m_main_system-
>> meta_attributes-promoted-node-max"/>
>>       </meta_attributes>
>>       <primitive id="main_system" class="ocf" provider="acme"
>> type="main-system-ocf">
>>         <operations>
>>           <op name="start" interval="0" timeout="120s"
>> id="main_system-start-0"/>
>>           <op name="stop" interval="0" timeout="120s"
>> id="main_system-stop-0"/>
>>           <op name="promote" interval="0" timeout="120s"
>> id="main_system-promote-0"/>
>>           <op name="demote" interval="0" timeout="120s"
>> id="main_system-demote-0"/>
>>           <op name="monitor" interval="10s" timeout="10s"
>> role="Master" id="main_system-monitor-10s"/>
>>           <op name="monitor" interval="11s" timeout="10s"
>> role="Slave" id="main_system-monitor-11s"/>
>>           <op name="notify" interval="0" timeout="60s"
>> id="main_system-notify-0"/>
>>          </operations>
>>        </primitive>
>>     </master>
>>   </resources>
>>   <constraints>
>>     <rsc_colocation id="master_only_snmp_rscs_with_main_system"
>> score="INFINITY" rsc="snmp_active_disabled" with-rsc="m_main_system"
>> with-rsc-role="Master"/>
>>     <rsc_order id="snmp_active_disabled_after_main_system"
>> kind="Mandatory" first="m_main_system" then="snmp_active_disabled"/>
>
> You want first-action="promote" in the above constraint, otherwise the
> slave being started (or the master being started but not yet promoted)
> is sufficient to start snmp_active_disabled (even though the colocation
> ensures it will only be started on the same node where the master will
> be).
>
> I'm not sure if that's related to your issue, but it's worth trying
> first.
>
>>     <rsc_order id="active_active_after_main_system" kind="Mandatory"
>> first="m_main_system" then="clone_active_active"/>
>
> You may also want to set interleave to true on clone_active_active, if
> you want it to depend only on the local instance of m_main_system, and
> not both instances.
>
>>   </constraints>
>>   <rsc_defaults>
>>     <meta_attributes id="rsc-options">
>>       <nvpair name="resource-stickiness" value="1" id="rsc-options-
>> resource-stickiness"/>
>>       <nvpair name="migration-threshold" value="0" id="rsc-options-
>> migration-threshold"/>
>>       <nvpair name="requires" value="nothing" id="rsc-options-
>> requires"/>
>>     </meta_attributes>
>>   </rsc_defaults>
>> </configuration>
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/