[ClusterLabs] Current DC becomes None suddenly

Thu Oct 8 10:43:50 EDT 2015

On 10/02/2015 01:47 PM, Pritam Kharat wrote:
> Hi,
> 
> I have set up a ACTIVE/PASSIVE HA
> 
> *Issue 1) *
> 
> *corosync.conf*  file is
> 
> # Please read the openais.conf.5 manual page
> 
> totem {
> 
>         version: 2
> 
>         # How long before declaring a token lost (ms)
>         token: 10000
> 
>         # How many token retransmits before forming a new configuration
>         token_retransmits_before_loss_const: 20
> 
>         # How long to wait for join messages in the membership protocol (ms)
>         join: 10000
> 
>         # How long to wait for consensus to be achieved before starting a
> new round of membership configuration (ms)
>         consensus: 12000
> 
>         # Turn off the virtual synchrony filter
>         vsftype: none
> 
>         # Number of messages that may be sent by one processor on receipt
> of the token
>         max_messages: 20
> 
>         # Limit generated nodeids to 31-bits (positive signed integers)
>         clear_node_high_bit: yes
> 
>         # Disable encryption
>         secauth: off
> 
>         # How many threads to use for encryption/decryption
>         threads: 0
> 
>         # Optionally assign a fixed node id (integer)
>         # nodeid: 1234
> 
>         # This specifies the mode of redundant ring, which may be none,
> active, or passive.
>         rrp_mode: none
>         interface {
>                 # The following values need to be set based on your
> environment
>                 ringnumber: 0
>                 bindnetaddr: 192.168.101.0
> mcastport: 5405
>         }
> 
>         transport: udpu
> }
> 
> amf {
>         mode: disabled
> }
> 
> quorum {
>         # Quorum for the Pacemaker Cluster Resource Manager
>         provider: corosync_votequorum
>         expected_votes: 1

If you're using a recent version of corosync, use "two_node: 1" instead
of "expected_votes: 1", and get rid of "no-quorum-policy: ignore" in the
pacemaker cluster options.

> }
> 
> 
> nodelist {
> 
>         node {
>                 ring0_addr: 192.168.101.73
>         }
> 
>         node {
>                 ring0_addr: 192.168.101.74
>         }
> }
> 
> aisexec {
>         user:   root
>         group:  root
> }
> 
> 
> logging {
>         fileline: off
>         to_stderr: yes
>         to_logfile: yes
>         to_syslog: yes
>         syslog_facility: daemon
>         logfile: /var/log/corosync/corosync.log
>         debug: off
>         timestamp: on
>         logger_subsys {
>                 subsys: AMF
>                 debug: off
>                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>         }
> }
> 
> And I have added 5 resources - 1 is VIP and 4 are upstart jobs
> Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE)
> Resources are running on ACTIVE node
> 
> Default cluster properties -
> 
>       <cluster_property_set id="cib-bootstrap-options">
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> value="1.1.10-42f2063"/>
>         <nvpair id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="corosync"/>
>         <nvpair name="no-quorum-policy" value="ignore"
> id="cib-bootstrap-options-no-quorum-policy"/>
>         <nvpair name="stonith-enabled" value="false"
> id="cib-bootstrap-options-stonith-enabled"/>
>         <nvpair name="cluster-recheck-interval" value="3min"
> id="cib-bootstrap-options-cluster-recheck-interval"/>
>         <nvpair name="default-action-timeout" value="120s"
> id="cib-bootstrap-options-default-action-timeout"/>
>       </cluster_property_set>
> 
> 
> But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from
> STANDBY to ACTIVE,
> both nodes become OFFLINE and Current DC becomes None, I have disabled the
> stonith property and even quorum is ignored

Disabling stonith isn't helping you. The cluster needs stonith to
recover from difficult situations, so it's easier to get into weird
states like this without it.

> root at sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status
> Last updated: Sat Oct  3 00:01:40 2015
> Last change: Fri Oct  2 23:38:28 2015 via crm_resource on sc-node-1
> Stack: corosync
> Current DC: NONE
> 2 Nodes configured
> 5 Resources configured
> 
> OFFLINE: [ sc-node-1 sc-node-2 ]
> 
> What is going wrong here ? What is the reason for node Current DC becoming
> None suddenly ? Is corosync.conf okay ? Are default cluster properties fine
> ? Help will be appreciated.

I'd recommend seeing how the problem behaves with stonith enabled, but
in any case you'll need to dive into the logs to figure what starts the
chain of events.

> 
> *Issue 2)*
> Command used to add upstart job is
> 
> crm configure primitive service upstart:service meta allow-migrate=true
> migration-threshold=5 failure-timeout=30s op monitor interval=15s
>  timeout=60s
> 
> But still sometimes I see fail count going to INFINITY. Why ? How can we
> avoid it ? Resource should have migrated as soon as it reaches migration
> threshold.
> 
> * Node sc-node-2:
>    service: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct
>  2 23:38:53 2015'
>    service1: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct
>  2 23:38:53 2015'
> 
> Failed actions:
>     service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
> last-rc-change=Fri Oct  2 23:38:53 2015
> , queued=0ms, exec=0ms
> ): unknown error
>     service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
> last-rc-change=Fri Oct  2 23:38:53 2015
> , queued=0ms, exec=0ms

migration-threshold is used for monitor failures, not (by default) start
or stop failures.

This is a start failure, which (by default) makes the fail-count go to
infinity. The rationale is that a monitor failure indicates some sort of
temporary error, but failing to start could well mean that something is
wrong with the installation or configuration.

You can tell the cluster to apply migration-threshold to start failures
too, by setting the start-failure-is-fatal=false cluster option.