[ClusterLabs] Current DC becomes None suddenly

Thu Oct 8 14:56:33 UTC 2015

Hi Ken,

Please see inline comments of last mail

On Thu, Oct 8, 2015 at 8:25 PM, Pritam Kharat <
pritam.kharat at oneconvergence.com> wrote:

> Hi Ken,
>
> Thanks for reply.
>
> On Thu, Oct 8, 2015 at 8:13 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On 10/02/2015 01:47 PM, Pritam Kharat wrote:
>> > Hi,
>> >
>> > I have set up a ACTIVE/PASSIVE HA
>> >
>> > *Issue 1) *
>> >
>> > *corosync.conf*  file is
>> >
>> > # Please read the openais.conf.5 manual page
>> >
>> > totem {
>> >
>> >         version: 2
>> >
>> >         # How long before declaring a token lost (ms)
>> >         token: 10000
>> >
>> >         # How many token retransmits before forming a new configuration
>> >         token_retransmits_before_loss_const: 20
>> >
>> >         # How long to wait for join messages in the membership protocol
>> (ms)
>> >         join: 10000
>> >
>> >         # How long to wait for consensus to be achieved before starting
>> a
>> > new round of membership configuration (ms)
>> >         consensus: 12000
>> >
>> >         # Turn off the virtual synchrony filter
>> >         vsftype: none
>> >
>> >         # Number of messages that may be sent by one processor on
>> receipt
>> > of the token
>> >         max_messages: 20
>> >
>> >         # Limit generated nodeids to 31-bits (positive signed integers)
>> >         clear_node_high_bit: yes
>> >
>> >         # Disable encryption
>> >         secauth: off
>> >
>> >         # How many threads to use for encryption/decryption
>> >         threads: 0
>> >
>> >         # Optionally assign a fixed node id (integer)
>> >         # nodeid: 1234
>> >
>> >         # This specifies the mode of redundant ring, which may be none,
>> > active, or passive.
>> >         rrp_mode: none
>> >         interface {
>> >                 # The following values need to be set based on your
>> > environment
>> >                 ringnumber: 0
>> >                 bindnetaddr: 192.168.101.0
>> > mcastport: 5405
>> >         }
>> >
>> >         transport: udpu
>> > }
>> >
>> > amf {
>> >         mode: disabled
>> > }
>> >
>> > quorum {
>> >         # Quorum for the Pacemaker Cluster Resource Manager
>> >         provider: corosync_votequorum
>> >         expected_votes: 1
>>
>> If you're using a recent version of corosync, use "two_node: 1" instead
>> of "expected_votes: 1", and get rid of "no-quorum-policy: ignore" in the
>> pacemaker cluster options.
>>
>>    -> We are using corosync version 2.3.3. Do we above mentioned change
> for this version ?
>
>
>
>> > }
>> >
>> >
>> > nodelist {
>> >
>> >         node {
>> >                 ring0_addr: 192.168.101.73
>> >         }
>> >
>> >         node {
>> >                 ring0_addr: 192.168.101.74
>> >         }
>> > }
>> >
>> > aisexec {
>> >         user:   root
>> >         group:  root
>> > }
>> >
>> >
>> > logging {
>> >         fileline: off
>> >         to_stderr: yes
>> >         to_logfile: yes
>> >         to_syslog: yes
>> >         syslog_facility: daemon
>> >         logfile: /var/log/corosync/corosync.log
>> >         debug: off
>> >         timestamp: on
>> >         logger_subsys {
>> >                 subsys: AMF
>> >                 debug: off
>> >                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>> >         }
>> > }
>> >
>> > And I have added 5 resources - 1 is VIP and 4 are upstart jobs
>> > Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE)
>> > Resources are running on ACTIVE node
>> >
>> > Default cluster properties -
>> >
>> >       <cluster_property_set id="cib-bootstrap-options">
>> >         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
>> > value="1.1.10-42f2063"/>
>> >         <nvpair id="cib-bootstrap-options-cluster-infrastructure"
>> > name="cluster-infrastructure" value="corosync"/>
>> >         <nvpair name="no-quorum-policy" value="ignore"
>> > id="cib-bootstrap-options-no-quorum-policy"/>
>> >         <nvpair name="stonith-enabled" value="false"
>> > id="cib-bootstrap-options-stonith-enabled"/>
>> >         <nvpair name="cluster-recheck-interval" value="3min"
>> > id="cib-bootstrap-options-cluster-recheck-interval"/>
>> >         <nvpair name="default-action-timeout" value="120s"
>> > id="cib-bootstrap-options-default-action-timeout"/>
>> >       </cluster_property_set>
>> >
>> >
>> > But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from
>> > STANDBY to ACTIVE,
>> > both nodes become OFFLINE and Current DC becomes None, I have disabled
>> the
>> > stonith property and even quorum is ignored
>>
>> Disabling stonith isn't helping you. The cluster needs stonith to
>> recover from difficult situations, so it's easier to get into weird
>> states like this without it.
>>
>> > root at sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status
>> > Last updated: Sat Oct  3 00:01:40 2015
>> > Last change: Fri Oct  2 23:38:28 2015 via crm_resource on sc-node-1
>> > Stack: corosync
>> > Current DC: NONE
>> > 2 Nodes configured
>> > 5 Resources configured
>> >
>> > OFFLINE: [ sc-node-1 sc-node-2 ]
>> >
>> > What is going wrong here ? What is the reason for node Current DC
>> becoming
>> > None suddenly ? Is corosync.conf okay ? Are default cluster properties
>> fine
>> > ? Help will be appreciated.
>>
>> I'd recommend seeing how the problem behaves with stonith enabled, but
>> in any case you'll need to dive into the logs to figure what starts the
>> chain of events.
>>
>>
>    -> We are seeing this issue when we try rebooting the vms
>
> >
>> > *Issue 2)*
>> > Command used to add upstart job is
>> >
>> > crm configure primitive service upstart:service meta allow-migrate=true
>> > migration-threshold=5 failure-timeout=30s op monitor interval=15s
>> >  timeout=60s
>> >
>> > But still sometimes I see fail count going to INFINITY. Why ? How can we
>> > avoid it ? Resource should have migrated as soon as it reaches migration
>> > threshold.
>> >
>> > * Node sc-node-2:
>> >    service: migration-threshold=5 fail-count=1000000 last-failure='Fri
>> Oct
>> >  2 23:38:53 2015'
>> >    service1: migration-threshold=5 fail-count=1000000 last-failure='Fri
>> Oct
>> >  2 23:38:53 2015'
>> >
>> > Failed actions:
>> >     service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
>> > last-rc-change=Fri Oct  2 23:38:53 2015
>> > , queued=0ms, exec=0ms
>> > ): unknown error
>> >     service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
>> > last-rc-change=Fri Oct  2 23:38:53 2015
>> > , queued=0ms, exec=0ms
>>
>> migration-threshold is used for monitor failures, not (by default) start
>> or stop failures.
>>
>> This is a start failure, which (by default) makes the fail-count go to
>> infinity. The rationale is that a monitor failure indicates some sort of
>> temporary error, but failing to start could well mean that something is
>> wrong with the installation or configuration.
>>
>> You can tell the cluster to apply migration-threshold to start failures
>> too, by setting the start-failure-is-fatal=false cluster option.
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> --
> Thanks and Regards,
> Pritam Kharat.
>

-- 
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20151008/84d5d7db/attachment-0002.html>