<div dir="ltr">Could some one please reply to this query ?<div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Oct 3, 2015 at 12:17 AM, Pritam Kharat <span dir="ltr"><<a href="mailto:pritam.kharat@oneconvergence.com" target="_blank">pritam.kharat@oneconvergence.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><br></div>Hi,<div><br></div><div>I have set up a ACTIVE/PASSIVE HA</div><div><br></div><div><b>Issue 1) </b></div><div><b><br></b></div><div><b>corosync.conf</b> file is</div><div><br></div><div><div># Please read the openais.conf.5 manual page</div><div><br></div><div>totem {</div><div><span style="white-space:pre-wrap"> </span> </div><div> version: 2</div><div><br></div><div> # How long before declaring a token lost (ms)</div><div> token: 10000</div><div><br></div><div> # How many token retransmits before forming a new configuration</div><div> token_retransmits_before_loss_const: 20</div><div><br></div><div> # How long to wait for join messages in the membership protocol (ms)</div><div> join: 10000</div><div><br></div><div> # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)</div><div> consensus: 12000</div><div><br></div><div> # Turn off the virtual synchrony filter</div><div> vsftype: none</div><div><br></div><div> # Number of messages that may be sent by one processor on receipt of the token</div><div> max_messages: 20</div><div><br></div><div> # Limit generated nodeids to 31-bits (positive signed integers)</div><div> clear_node_high_bit: yes</div><div><br></div><div> # Disable encryption</div><div> secauth: off</div><div><br></div><div> # How many threads to use for encryption/decryption</div><div> threads: 0</div><div><br></div><div> # Optionally assign a fixed node id (integer)</div><div> # nodeid: 1234</div><div><br></div><div> # This specifies the mode of redundant ring, which may be none, active, or passive.</div><div> rrp_mode: none</div><div> interface {</div><div> # The following values need to be set based on your environment </div><div> ringnumber: 0</div><div> bindnetaddr: 192.168.101.0</div><div><span style="white-space:pre-wrap"> </span>mcastport: 5405</div><div> }</div><div> </div><div> transport: udpu</div><div>}</div><div><br></div><div>amf {</div><div> mode: disabled</div><div>}</div><div><br></div><div>quorum {</div><div> # Quorum for the Pacemaker Cluster Resource Manager</div><div> provider: corosync_votequorum</div><div> expected_votes: 1</div><div>}</div><div><br></div><div><br></div><div>nodelist {</div><div><br></div><div> node {</div><div> ring0_addr: 192.168.101.73</div><div> }</div><div><br></div><div> node {</div><div> ring0_addr: 192.168.101.74</div><div> }</div><div>}</div><div><br></div><div>aisexec {</div><div> user: root</div><div> group: root</div><div>}</div><div><br></div><div><br></div><div>logging {</div><div> fileline: off</div><div> to_stderr: yes</div><div> to_logfile: yes</div><div> to_syslog: yes</div><div> syslog_facility: daemon</div><div> logfile: /var/log/corosync/corosync.log</div><div> debug: off</div><div> timestamp: on</div><div> logger_subsys {</div><div> subsys: AMF</div><div> debug: off</div><div> tags: enter|leave|trace1|trace2|trace3|trace4|trace6</div><div> }</div><div>}</div></div><div><br></div><div>And I have added 5 resources - 1 is VIP and 4 are upstart jobs<br></div><div>Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE)<br></div><div><div>Resources are running on ACTIVE node</div></div><div><br></div><div>Default cluster properties -<br></div><div><br></div><div><div> <cluster_property_set id="cib-bootstrap-options"></div><div> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.10-42f2063"/></div><div> <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/></div><div> <nvpair name="no-quorum-policy" value="ignore" id="cib-bootstrap-options-no-quorum-policy"/></div><div> <nvpair name="stonith-enabled" value="false" id="cib-bootstrap-options-stonith-enabled"/></div><div> <nvpair name="cluster-recheck-interval" value="3min" id="cib-bootstrap-options-cluster-recheck-interval"/></div><div> <nvpair name="default-action-timeout" value="120s" id="cib-bootstrap-options-default-action-timeout"/></div><div> </cluster_property_set></div></div><div><br></div><div><br></div><div>But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from STANDBY to ACTIVE,<br></div><div>both nodes become OFFLINE and Current DC becomes None, I have disabled the stonith property and even quorum is ignored<br></div><div><br></div><div><div>root@sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status</div><div>Last updated: Sat Oct 3 00:01:40 2015</div><div>Last change: Fri Oct 2 23:38:28 2015 via crm_resource on sc-node-1</div><div>Stack: corosync</div><div>Current DC: NONE</div><div>2 Nodes configured</div><div>5 Resources configured</div><div><br></div><div>OFFLINE: [ sc-node-1 sc-node-2 ]</div></div><div><br></div><div>What is going wrong here ? What is the reason for node Current DC becoming None suddenly ? Is corosync.conf okay ? Are default cluster properties fine ? Help will be appreciated.<br></div><div><br></div><div><br></div><div><b>Issue 2)</b></div><div>Command used to add upstart job is</div><div><br></div><div>crm configure primitive service upstart:service meta allow-migrate=true migration-threshold=5 failure-timeout=30s op monitor interval=15s timeout=60s<br></div><div><br></div><div>But still sometimes I see fail count going to INFINITY. Why ? How can we avoid it ? Resource should have migrated as soon as it reaches migration threshold.</div><div><br></div><div><div>* Node sc-node-2: </div><div> service: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct 2 23:38:53 2015'<br></div><div> service1: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct 2 23:38:53 2015'</div><div><br></div><div>Failed actions:</div><div> service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, last-rc-change=Fri Oct 2 23:38:53 2015</div><div>, queued=0ms, exec=0ms</div><div>): unknown error</div><div> service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, last-rc-change=Fri Oct 2 23:38:53 2015</div><div>, queued=0ms, exec=0ms</div></div><span class="HOEnZb"><font color="#888888"><div><br></div><div><br></div><div><br></div><div><br></div><div>-- <br><div>Thanks and Regards,<br>Pritam Kharat.<br></div>
</div></font></span></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Thanks and Regards,<br>Pritam Kharat.<br></div>
</div>