[ClusterLabs] Current DC becomes None suddenly

Thu Oct 8 14:23:03 UTC 2015

Could some one please reply to this query ?

On Sat, Oct 3, 2015 at 12:17 AM, Pritam Kharat <
pritam.kharat at oneconvergence.com> wrote:

>
> Hi,
>
> I have set up a ACTIVE/PASSIVE HA
>
> *Issue 1) *
>
> *corosync.conf*  file is
>
> # Please read the openais.conf.5 manual page
>
> totem {
>
>         version: 2
>
>         # How long before declaring a token lost (ms)
>         token: 10000
>
>         # How many token retransmits before forming a new configuration
>         token_retransmits_before_loss_const: 20
>
>         # How long to wait for join messages in the membership protocol
> (ms)
>         join: 10000
>
>         # How long to wait for consensus to be achieved before starting a
> new round of membership configuration (ms)
>         consensus: 12000
>
>         # Turn off the virtual synchrony filter
>         vsftype: none
>
>         # Number of messages that may be sent by one processor on receipt
> of the token
>         max_messages: 20
>
>         # Limit generated nodeids to 31-bits (positive signed integers)
>         clear_node_high_bit: yes
>
>         # Disable encryption
>         secauth: off
>
>         # How many threads to use for encryption/decryption
>         threads: 0
>
>         # Optionally assign a fixed node id (integer)
>         # nodeid: 1234
>
>         # This specifies the mode of redundant ring, which may be none,
> active, or passive.
>         rrp_mode: none
>         interface {
>                 # The following values need to be set based on your
> environment
>                 ringnumber: 0
>                 bindnetaddr: 192.168.101.0
> mcastport: 5405
>         }
>
>         transport: udpu
> }
>
> amf {
>         mode: disabled
> }
>
> quorum {
>         # Quorum for the Pacemaker Cluster Resource Manager
>         provider: corosync_votequorum
>         expected_votes: 1
> }
>
>
> nodelist {
>
>         node {
>                 ring0_addr: 192.168.101.73
>         }
>
>         node {
>                 ring0_addr: 192.168.101.74
>         }
> }
>
> aisexec {
>         user:   root
>         group:  root
> }
>
>
> logging {
>         fileline: off
>         to_stderr: yes
>         to_logfile: yes
>         to_syslog: yes
>         syslog_facility: daemon
>         logfile: /var/log/corosync/corosync.log
>         debug: off
>         timestamp: on
>         logger_subsys {
>                 subsys: AMF
>                 debug: off
>                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>         }
> }
>
> And I have added 5 resources - 1 is VIP and 4 are upstart jobs
> Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE)
> Resources are running on ACTIVE node
>
> Default cluster properties -
>
>       <cluster_property_set id="cib-bootstrap-options">
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> value="1.1.10-42f2063"/>
>         <nvpair id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="corosync"/>
>         <nvpair name="no-quorum-policy" value="ignore"
> id="cib-bootstrap-options-no-quorum-policy"/>
>         <nvpair name="stonith-enabled" value="false"
> id="cib-bootstrap-options-stonith-enabled"/>
>         <nvpair name="cluster-recheck-interval" value="3min"
> id="cib-bootstrap-options-cluster-recheck-interval"/>
>         <nvpair name="default-action-timeout" value="120s"
> id="cib-bootstrap-options-default-action-timeout"/>
>       </cluster_property_set>
>
>
> But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from
> STANDBY to ACTIVE,
> both nodes become OFFLINE and Current DC becomes None, I have disabled the
> stonith property and even quorum is ignored
>
> root at sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status
> Last updated: Sat Oct  3 00:01:40 2015
> Last change: Fri Oct  2 23:38:28 2015 via crm_resource on sc-node-1
> Stack: corosync
> Current DC: NONE
> 2 Nodes configured
> 5 Resources configured
>
> OFFLINE: [ sc-node-1 sc-node-2 ]
>
> What is going wrong here ? What is the reason for node Current DC becoming
> None suddenly ? Is corosync.conf okay ? Are default cluster properties fine
> ? Help will be appreciated.
>
>
> *Issue 2)*
> Command used to add upstart job is
>
> crm configure primitive service upstart:service meta allow-migrate=true
> migration-threshold=5 failure-timeout=30s op monitor interval=15s
>  timeout=60s
>
> But still sometimes I see fail count going to INFINITY. Why ? How can we
> avoid it ? Resource should have migrated as soon as it reaches migration
> threshold.
>
> * Node sc-node-2:
>    service: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct
>  2 23:38:53 2015'
>    service1: migration-threshold=5 fail-count=1000000 last-failure='Fri
> Oct  2 23:38:53 2015'
>
> Failed actions:
>     service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
> last-rc-change=Fri Oct  2 23:38:53 2015
> , queued=0ms, exec=0ms
> ): unknown error
>     service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,
> last-rc-change=Fri Oct  2 23:38:53 2015
> , queued=0ms, exec=0ms
>
>
>
>
> --
> Thanks and Regards,
> Pritam Kharat.
>

-- 
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20151008/37315e24/attachment-0002.html>