[ClusterLabs] Current DC becomes None suddenly

Fri Oct 2 18:47:01 UTC 2015

Hi,

I have set up a ACTIVE/PASSIVE HA

*Issue 1) *

*corosync.conf*  file is

# Please read the openais.conf.5 manual page

totem {

        version: 2

        # How long before declaring a token lost (ms)
        token: 10000

        # How many token retransmits before forming a new configuration
        token_retransmits_before_loss_const: 20

        # How long to wait for join messages in the membership protocol (ms)
        join: 10000

        # How long to wait for consensus to be achieved before starting a
new round of membership configuration (ms)
        consensus: 12000

        # Turn off the virtual synchrony filter
        vsftype: none

        # Number of messages that may be sent by one processor on receipt
of the token
        max_messages: 20

        # Limit generated nodeids to 31-bits (positive signed integers)
        clear_node_high_bit: yes

        # Disable encryption
        secauth: off

        # How many threads to use for encryption/decryption
        threads: 0

        # Optionally assign a fixed node id (integer)
        # nodeid: 1234

        # This specifies the mode of redundant ring, which may be none,
active, or passive.
        rrp_mode: none
        interface {
                # The following values need to be set based on your
environment
                ringnumber: 0
                bindnetaddr: 192.168.101.0
mcastport: 5405
        }

        transport: udpu
}

amf {
        mode: disabled
}

quorum {
        # Quorum for the Pacemaker Cluster Resource Manager
        provider: corosync_votequorum
        expected_votes: 1
}

nodelist {

        node {
                ring0_addr: 192.168.101.73
        }

        node {
                ring0_addr: 192.168.101.74
        }
}

aisexec {
        user:   root
        group:  root
}

logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        to_syslog: yes
        syslog_facility: daemon
        logfile: /var/log/corosync/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}

And I have added 5 resources - 1 is VIP and 4 are upstart jobs
Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE)
Resources are running on ACTIVE node

Default cluster properties -

      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
value="1.1.10-42f2063"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure"
name="cluster-infrastructure" value="corosync"/>
        <nvpair name="no-quorum-policy" value="ignore"
id="cib-bootstrap-options-no-quorum-policy"/>
        <nvpair name="stonith-enabled" value="false"
id="cib-bootstrap-options-stonith-enabled"/>
        <nvpair name="cluster-recheck-interval" value="3min"
id="cib-bootstrap-options-cluster-recheck-interval"/>
        <nvpair name="default-action-timeout" value="120s"
id="cib-bootstrap-options-default-action-timeout"/>
      </cluster_property_set>

But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from
STANDBY to ACTIVE,
both nodes become OFFLINE and Current DC becomes None, I have disabled the
stonith property and even quorum is ignored

root at sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status
Last updated: Sat Oct  3 00:01:40 2015
Last change: Fri Oct  2 23:38:28 2015 via crm_resource on sc-node-1
Stack: corosync
Current DC: NONE
2 Nodes configured
5 Resources configured

OFFLINE: [ sc-node-1 sc-node-2 ]

What is going wrong here ? What is the reason for node Current DC becoming
None suddenly ? Is corosync.conf okay ? Are default cluster properties fine
? Help will be appreciated.

*Issue 2)*
Command used to add upstart job is

crm configure primitive service upstart:service meta allow-migrate=true
migration-threshold=5 failure-timeout=30s op monitor interval=15s
 timeout=60s

But still sometimes I see fail count going to INFINITY. Why ? How can we
avoid it ? Resource should have migrated as soon as it reaches migration
threshold.

* Node sc-node-2:
   service: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct
 2 23:38:53 2015'
   service1: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct
 2 23:38:53 2015'

Failed actions:
    service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
last-rc-change=Fri Oct  2 23:38:53 2015
, queued=0ms, exec=0ms
): unknown error
    service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
last-rc-change=Fri Oct  2 23:38:53 2015
, queued=0ms, exec=0ms

-- 
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151003/fcd799ce/attachment-0003.html>