[ClusterLabs] Current DC becomes None suddenly
Pritam Kharat
pritam.kharat at oneconvergence.com
Thu Oct 8 14:55:04 UTC 2015
Hi Ken,
Thanks for reply.
On Thu, Oct 8, 2015 at 8:13 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On 10/02/2015 01:47 PM, Pritam Kharat wrote:
> > Hi,
> >
> > I have set up a ACTIVE/PASSIVE HA
> >
> > *Issue 1) *
> >
> > *corosync.conf* file is
> >
> > # Please read the openais.conf.5 manual page
> >
> > totem {
> >
> > version: 2
> >
> > # How long before declaring a token lost (ms)
> > token: 10000
> >
> > # How many token retransmits before forming a new configuration
> > token_retransmits_before_loss_const: 20
> >
> > # How long to wait for join messages in the membership protocol
> (ms)
> > join: 10000
> >
> > # How long to wait for consensus to be achieved before starting a
> > new round of membership configuration (ms)
> > consensus: 12000
> >
> > # Turn off the virtual synchrony filter
> > vsftype: none
> >
> > # Number of messages that may be sent by one processor on receipt
> > of the token
> > max_messages: 20
> >
> > # Limit generated nodeids to 31-bits (positive signed integers)
> > clear_node_high_bit: yes
> >
> > # Disable encryption
> > secauth: off
> >
> > # How many threads to use for encryption/decryption
> > threads: 0
> >
> > # Optionally assign a fixed node id (integer)
> > # nodeid: 1234
> >
> > # This specifies the mode of redundant ring, which may be none,
> > active, or passive.
> > rrp_mode: none
> > interface {
> > # The following values need to be set based on your
> > environment
> > ringnumber: 0
> > bindnetaddr: 192.168.101.0
> > mcastport: 5405
> > }
> >
> > transport: udpu
> > }
> >
> > amf {
> > mode: disabled
> > }
> >
> > quorum {
> > # Quorum for the Pacemaker Cluster Resource Manager
> > provider: corosync_votequorum
> > expected_votes: 1
>
> If you're using a recent version of corosync, use "two_node: 1" instead
> of "expected_votes: 1", and get rid of "no-quorum-policy: ignore" in the
> pacemaker cluster options.
>
> -> We are using corosync version 2.3.3. Do we above mentioned change
for this version ?
> > }
> >
> >
> > nodelist {
> >
> > node {
> > ring0_addr: 192.168.101.73
> > }
> >
> > node {
> > ring0_addr: 192.168.101.74
> > }
> > }
> >
> > aisexec {
> > user: root
> > group: root
> > }
> >
> >
> > logging {
> > fileline: off
> > to_stderr: yes
> > to_logfile: yes
> > to_syslog: yes
> > syslog_facility: daemon
> > logfile: /var/log/corosync/corosync.log
> > debug: off
> > timestamp: on
> > logger_subsys {
> > subsys: AMF
> > debug: off
> > tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> > }
> > }
> >
> > And I have added 5 resources - 1 is VIP and 4 are upstart jobs
> > Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE)
> > Resources are running on ACTIVE node
> >
> > Default cluster properties -
> >
> > <cluster_property_set id="cib-bootstrap-options">
> > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> > value="1.1.10-42f2063"/>
> > <nvpair id="cib-bootstrap-options-cluster-infrastructure"
> > name="cluster-infrastructure" value="corosync"/>
> > <nvpair name="no-quorum-policy" value="ignore"
> > id="cib-bootstrap-options-no-quorum-policy"/>
> > <nvpair name="stonith-enabled" value="false"
> > id="cib-bootstrap-options-stonith-enabled"/>
> > <nvpair name="cluster-recheck-interval" value="3min"
> > id="cib-bootstrap-options-cluster-recheck-interval"/>
> > <nvpair name="default-action-timeout" value="120s"
> > id="cib-bootstrap-options-default-action-timeout"/>
> > </cluster_property_set>
> >
> >
> > But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from
> > STANDBY to ACTIVE,
> > both nodes become OFFLINE and Current DC becomes None, I have disabled
> the
> > stonith property and even quorum is ignored
>
> Disabling stonith isn't helping you. The cluster needs stonith to
> recover from difficult situations, so it's easier to get into weird
> states like this without it.
>
> > root at sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status
> > Last updated: Sat Oct 3 00:01:40 2015
> > Last change: Fri Oct 2 23:38:28 2015 via crm_resource on sc-node-1
> > Stack: corosync
> > Current DC: NONE
> > 2 Nodes configured
> > 5 Resources configured
> >
> > OFFLINE: [ sc-node-1 sc-node-2 ]
> >
> > What is going wrong here ? What is the reason for node Current DC
> becoming
> > None suddenly ? Is corosync.conf okay ? Are default cluster properties
> fine
> > ? Help will be appreciated.
>
> I'd recommend seeing how the problem behaves with stonith enabled, but
> in any case you'll need to dive into the logs to figure what starts the
> chain of events.
>
>
-> We are seeing this issue when we try rebooting the vms
>
> > *Issue 2)*
> > Command used to add upstart job is
> >
> > crm configure primitive service upstart:service meta allow-migrate=true
> > migration-threshold=5 failure-timeout=30s op monitor interval=15s
> > timeout=60s
> >
> > But still sometimes I see fail count going to INFINITY. Why ? How can we
> > avoid it ? Resource should have migrated as soon as it reaches migration
> > threshold.
> >
> > * Node sc-node-2:
> > service: migration-threshold=5 fail-count=1000000 last-failure='Fri
> Oct
> > 2 23:38:53 2015'
> > service1: migration-threshold=5 fail-count=1000000 last-failure='Fri
> Oct
> > 2 23:38:53 2015'
> >
> > Failed actions:
> > service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
> > last-rc-change=Fri Oct 2 23:38:53 2015
> > , queued=0ms, exec=0ms
> > ): unknown error
> > service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
> > last-rc-change=Fri Oct 2 23:38:53 2015
> > , queued=0ms, exec=0ms
>
> migration-threshold is used for monitor failures, not (by default) start
> or stop failures.
>
> This is a start failure, which (by default) makes the fail-count go to
> infinity. The rationale is that a monitor failure indicates some sort of
> temporary error, but failing to start could well mean that something is
> wrong with the installation or configuration.
>
> You can tell the cluster to apply migration-threshold to start failures
> too, by setting the start-failure-is-fatal=false cluster option.
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
--
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151008/5987ffb6/attachment.htm>
More information about the Users
mailing list