[ClusterLabs] Current DC becomes None suddenly

Thu Oct 8 11:21:50 EDT 2015

On 10/08/2015 09:55 AM, Pritam Kharat wrote:
> Hi Ken,
> 
> Thanks for reply.
> 
> On Thu, Oct 8, 2015 at 8:13 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> On 10/02/2015 01:47 PM, Pritam Kharat wrote:
>>> Hi,
>>>
>>> I have set up a ACTIVE/PASSIVE HA
>>>
>>> *Issue 1) *
>>>
>>> *corosync.conf*  file is
>>>
>>> # Please read the openais.conf.5 manual page
>>>
>>> totem {
>>>
>>>         version: 2
>>>
>>>         # How long before declaring a token lost (ms)
>>>         token: 10000
>>>
>>>         # How many token retransmits before forming a new configuration
>>>         token_retransmits_before_loss_const: 20
>>>
>>>         # How long to wait for join messages in the membership protocol
>> (ms)
>>>         join: 10000
>>>
>>>         # How long to wait for consensus to be achieved before starting a
>>> new round of membership configuration (ms)
>>>         consensus: 12000
>>>
>>>         # Turn off the virtual synchrony filter
>>>         vsftype: none
>>>
>>>         # Number of messages that may be sent by one processor on receipt
>>> of the token
>>>         max_messages: 20
>>>
>>>         # Limit generated nodeids to 31-bits (positive signed integers)
>>>         clear_node_high_bit: yes
>>>
>>>         # Disable encryption
>>>         secauth: off
>>>
>>>         # How many threads to use for encryption/decryption
>>>         threads: 0
>>>
>>>         # Optionally assign a fixed node id (integer)
>>>         # nodeid: 1234
>>>
>>>         # This specifies the mode of redundant ring, which may be none,
>>> active, or passive.
>>>         rrp_mode: none
>>>         interface {
>>>                 # The following values need to be set based on your
>>> environment
>>>                 ringnumber: 0
>>>                 bindnetaddr: 192.168.101.0
>>> mcastport: 5405
>>>         }
>>>
>>>         transport: udpu
>>> }
>>>
>>> amf {
>>>         mode: disabled
>>> }
>>>
>>> quorum {
>>>         # Quorum for the Pacemaker Cluster Resource Manager
>>>         provider: corosync_votequorum
>>>         expected_votes: 1
>>
>> If you're using a recent version of corosync, use "two_node: 1" instead
>> of "expected_votes: 1", and get rid of "no-quorum-policy: ignore" in the
>> pacemaker cluster options.
>>
>>    -> We are using corosync version 2.3.3. Do we above mentioned change
> for this version ?

Yes, you can use two_node.

FYI, two_node automatically enables wait_for_all, which means that when
a node first starts up, it waits until it can see the other node before
forming the cluster. So once the cluster is running, it can handle the
failure of one node, and the other will continue. But to start, both
nodes needs to be present.

>>> }
>>>
>>>
>>> nodelist {
>>>
>>>         node {
>>>                 ring0_addr: 192.168.101.73
>>>         }
>>>
>>>         node {
>>>                 ring0_addr: 192.168.101.74
>>>         }
>>> }
>>>
>>> aisexec {
>>>         user:   root
>>>         group:  root
>>> }
>>>
>>>
>>> logging {
>>>         fileline: off
>>>         to_stderr: yes
>>>         to_logfile: yes
>>>         to_syslog: yes
>>>         syslog_facility: daemon
>>>         logfile: /var/log/corosync/corosync.log
>>>         debug: off
>>>         timestamp: on
>>>         logger_subsys {
>>>                 subsys: AMF
>>>                 debug: off
>>>                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>>>         }
>>> }
>>>
>>> And I have added 5 resources - 1 is VIP and 4 are upstart jobs
>>> Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE)
>>> Resources are running on ACTIVE node
>>>
>>> Default cluster properties -
>>>
>>>       <cluster_property_set id="cib-bootstrap-options">
>>>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
>>> value="1.1.10-42f2063"/>
>>>         <nvpair id="cib-bootstrap-options-cluster-infrastructure"
>>> name="cluster-infrastructure" value="corosync"/>
>>>         <nvpair name="no-quorum-policy" value="ignore"
>>> id="cib-bootstrap-options-no-quorum-policy"/>
>>>         <nvpair name="stonith-enabled" value="false"
>>> id="cib-bootstrap-options-stonith-enabled"/>
>>>         <nvpair name="cluster-recheck-interval" value="3min"
>>> id="cib-bootstrap-options-cluster-recheck-interval"/>
>>>         <nvpair name="default-action-timeout" value="120s"
>>> id="cib-bootstrap-options-default-action-timeout"/>
>>>       </cluster_property_set>
>>>
>>>
>>> But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from
>>> STANDBY to ACTIVE,
>>> both nodes become OFFLINE and Current DC becomes None, I have disabled
>> the
>>> stonith property and even quorum is ignored
>>
>> Disabling stonith isn't helping you. The cluster needs stonith to
>> recover from difficult situations, so it's easier to get into weird
>> states like this without it.
>>
>>> root at sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status
>>> Last updated: Sat Oct  3 00:01:40 2015
>>> Last change: Fri Oct  2 23:38:28 2015 via crm_resource on sc-node-1
>>> Stack: corosync
>>> Current DC: NONE
>>> 2 Nodes configured
>>> 5 Resources configured
>>>
>>> OFFLINE: [ sc-node-1 sc-node-2 ]
>>>
>>> What is going wrong here ? What is the reason for node Current DC
>> becoming
>>> None suddenly ? Is corosync.conf okay ? Are default cluster properties
>> fine
>>> ? Help will be appreciated.
>>
>> I'd recommend seeing how the problem behaves with stonith enabled, but
>> in any case you'll need to dive into the logs to figure what starts the
>> chain of events.
>>
>>
>    -> We are seeing this issue when we try rebooting the vms

For VMs, fence_virtd/fence_xvm are relatively easy to set up for
stonith. I'd get that going first, then try to reproduce the problem,
and show the cluster logs from around the time the problem starts.

>>
>>> *Issue 2)*
>>> Command used to add upstart job is
>>>
>>> crm configure primitive service upstart:service meta allow-migrate=true
>>> migration-threshold=5 failure-timeout=30s op monitor interval=15s
>>>  timeout=60s
>>>
>>> But still sometimes I see fail count going to INFINITY. Why ? How can we
>>> avoid it ? Resource should have migrated as soon as it reaches migration
>>> threshold.
>>>
>>> * Node sc-node-2:
>>>    service: migration-threshold=5 fail-count=1000000 last-failure='Fri
>> Oct
>>>  2 23:38:53 2015'
>>>    service1: migration-threshold=5 fail-count=1000000 last-failure='Fri
>> Oct
>>>  2 23:38:53 2015'
>>>
>>> Failed actions:
>>>     service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
>>> last-rc-change=Fri Oct  2 23:38:53 2015
>>> , queued=0ms, exec=0ms
>>> ): unknown error
>>>     service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
>>> last-rc-change=Fri Oct  2 23:38:53 2015
>>> , queued=0ms, exec=0ms
>>
>> migration-threshold is used for monitor failures, not (by default) start
>> or stop failures.
>>
>> This is a start failure, which (by default) makes the fail-count go to
>> infinity. The rationale is that a monitor failure indicates some sort of
>> temporary error, but failing to start could well mean that something is
>> wrong with the installation or configuration.
>>
>> You can tell the cluster to apply migration-threshold to start failures
>> too, by setting the start-failure-is-fatal=false cluster option.