<div dir="ltr">Hi Ulrich,<div><br></div><div>Thanks for the response.</div><div><br></div><div> 30 sec is the time for detection only as confirmed by logs.</div><div><br></div><div>++++++++++++++++++++++++++++++++++++</div><div><div>Jan 10 11:06:18 [19261] orana crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (30000ms)</div><div>Jan 10 11:06:18 [19261] orana crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]</div><div>Jan 10 11:06:18 [19261] orana crmd: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED</div><div>Jan 10 11:06:18 [19260] orana pengine: info: process_pe_message: Input has not changed since last time, not saving to disk</div><div>Jan 10 11:06:18 [19260] orana pengine: notice: unpack_config: On loss of CCM Quorum: Ignore</div><div>Jan 10 11:06:18 [19260] orana pengine: info: determine_online_status_fencing: Node tigana is active</div><div>Jan 10 11:06:18 [19260] orana pengine: info: determine_online_status: Node tigana is online</div><div>Jan 10 11:06:18 [19260] orana pengine: info: determine_online_status_fencing: Node orana is active</div><div>Jan 10 11:06:18 [19260] orana pengine: info: determine_online_status: Node orana is online</div><div>Jan 10 11:06:18 [19260] orana pengine: info: clone_print: Master/Slave Set: unicloud-master [unicloud]</div><div>Jan 10 11:06:18 [19260] orana pengine: info: short_print: Masters: [ tigana ]</div><div>Jan 10 11:06:18 [19260] orana pengine: info: short_print: Slaves: [ orana ]</div><div>Jan 10 11:06:18 [19260] orana pengine: info: native_print: fence-uc-orana (stonith:fence_ilo4): Started tigana</div><div>Jan 10 11:06:18 [19260] orana pengine: info: native_print: fence-uc-tigana (stonith:fence_ilo4): Started tigana</div><div>Jan 10 11:06:18 [19260] orana pengine: info: master_color: Promoting unicloud:0 (Master tigana)</div><div>Jan 10 11:06:18 [19260] orana pengine: info: master_color: unicloud-master: Promoted 1 instances of a possible 1 to master</div><div>Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave unicloud:0 (Master tigana)</div><div>Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave unicloud:1 (Slave orana)</div><div>Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave fence-uc-orana (Started tigana)</div><div>Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave fence-uc-tigana (Started tigana)</div><div>Jan 10 11:06:18 [19260] orana pengine: notice: process_pe_message: Calculated Transition 2390: /var/lib/pacemaker/pengine/pe-input-1655.bz2</div><div>Jan 10 11:06:18 [19261] orana crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]</div><div>Jan 10 11:06:18 [19261] orana crmd: info: do_te_invoke: Processing graph 2390 (ref=pe_calc-dc-1515562578-2650) derived from /var/lib/pacemaker/pengine/pe-input-1655.bz2</div><div>Jan 10 11:06:18 [19261] orana crmd: notice: run_graph: Transition 2390 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1655.bz2): Complete</div><div>Jan 10 11:06:18 [19261] orana crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE</div><div>Jan 10 11:06:18 [19261] orana crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]</div><div>Jan 10 11:06:31 corosync [TOTEM ] A processor failed, forming new configuration.</div><div>Jan 10 11:06:33 corosync [QUORUM] Members[1]: 1</div><div>Jan 10 11:06:33 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.</div><div>Jan 10 11:06:33 [19250] orana pacemakerd: info: cman_event_callback: Membership 2064: quorum retained</div><div>Jan 10 11:06:33 [19261] orana crmd: info: cman_event_callback: Membership 2064: quorum retained</div><div>Jan 10 11:06:33 [19250] orana pacemakerd: notice: crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is now lost (was member)</div><div>Jan 10 11:06:33 [19261] orana crmd: notice: crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is now lost (was member)</div><div>Jan 10 11:06:33 [19261] orana crmd: info: peer_update_callback: tigana is now lost (was member)</div><div>Jan 10 11:06:33 [19261] orana crmd: warning: match_down_event: No match for shutdown action on tigana</div><div>Jan 10 11:06:33 [19261] orana crmd: notice: peer_update_callback: Stonith/shutdown of tigana not matched</div><div>Jan 10 11:06:33 [19261] orana crmd: info: crm_update_peer_join: peer_update_callback: Node tigana[2] - join-2 phase 4 -> 0</div><div>Jan 10 11:06:33 [19261] orana crmd: info: abort_transition_graph: Transition aborted: Node failure (source=peer_update_callback:240, 1)</div><div>Jan 10 11:06:33 corosync [CPG ] chosen downlist: sender r(0) ip(7.7.7.1) ; members(old:2 left:1)</div></div><div>++++++++++++++++++++++++++</div><div><br></div><div>this is the logs from standby node(new active).</div><div>kernel panic was triggered at 11:06:00 at the other node and here totem change is reported at 11:06:31.</div><div><br></div><div>30 secs is the cluster recheck timer.</div><div><br></div><div>Regards,</div><div>Ashutosh </div><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 10, 2018 at 3:12 PM, <span dir="ltr"><<a href="mailto:users-request@clusterlabs.org" target="_blank">users-request@clusterlabs.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Send Users mailing list submissions to<br>
<a href="mailto:users@clusterlabs.org">users@clusterlabs.org</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:users-request@clusterlabs.org">users-request@clusterlabs.org</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:users-owner@clusterlabs.org">users-owner@clusterlabs.org</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of Users digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. corosync taking almost 30 secs to detect node failure in case<br>
of kernel panic (ashutosh tiwari)<br>
2. Antw: corosync taking almost 30 secs to detect node failure<br>
in case of kernel panic (Ulrich Windl)<br>
3. pacemaker reports monitor timeout while CPU is high (???)<br>
<br>
<br>
------------------------------<wbr>------------------------------<wbr>----------<br>
<br>
Message: 1<br>
Date: Wed, 10 Jan 2018 12:43:46 +0530<br>
From: ashutosh tiwari <<a href="mailto:ashutosh.kvas@gmail.com">ashutosh.kvas@gmail.com</a>><br>
To: <a href="mailto:users@clusterlabs.org">users@clusterlabs.org</a><br>
Subject: [ClusterLabs] corosync taking almost 30 secs to detect node<br>
failure in case of kernel panic<br>
Message-ID:<br>
<CA+vEgjiKG_VGegT7Q+<wbr>wCqn6merFNrvegiQs+RHRuxzE=<a href="mailto:muVb3Q@mail.gmail.com">muVb<wbr>3Q@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Hi,<br>
<br>
We have two node cluster running in active/standby mode and having IPMI<br>
fencing configured.<br>
<br>
In case of kernel panic at Active node, standby node is detecting node<br>
failure in around 30 secs which leads to delay in standby node taking the<br>
active role.<br>
<br>
we have totem token timeout as 10000 msecs.<br>
Please let us know in case there is any more configuration controlling<br>
membership detection.<br>
<br>
s/w versions.<br>
<br>
centos 6.7<br>
corosync-1.4.7-5.el6.x86_64<br>
pacemaker-1.1.14-8.el6.x86_64<br>
<br>
Thanks and Regards,<br>
Ashutosh Tiwari<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://lists.clusterlabs.org/pipermail/users/attachments/20180110/235f148d/attachment-0001.html" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/<wbr>pipermail/users/attachments/<wbr>20180110/235f148d/attachment-<wbr>0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Wed, 10 Jan 2018 08:32:16 +0100<br>
From: "Ulrich Windl" <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de">Ulrich.Windl@rz.uni-<wbr>regensburg.de</a>><br>
To: <<a href="mailto:users@clusterlabs.org">users@clusterlabs.org</a>><br>
Subject: [ClusterLabs] Antw: corosync taking almost 30 secs to detect<br>
node failure in case of kernel panic<br>
Message-ID: <<a href="mailto:5A55C180020000A100029BD1@gwsmtp1.uni-regensburg.de">5A55C180020000A100029BD1@<wbr>gwsmtp1.uni-regensburg.de</a>><br>
Content-Type: text/plain; charset=US-ASCII<br>
<br>
Hi!<br>
<br>
Maybe define "detecting node failure". Culkd it be your 30 seconds are between detection and reaction? Logs would help here, too.<br>
<br>
Regards,<br>
Ulrich<br>
<br>
<br>
>>> ashutosh tiwari <<a href="mailto:ashutosh.kvas@gmail.com">ashutosh.kvas@gmail.com</a>> schrieb am 10.01.2018 um 08:13 in<br>
Nachricht<br>
<CA+vEgjiKG_VGegT7Q+<wbr>wCqn6merFNrvegiQs+RHRuxzE=<a href="mailto:muVb3Q@mail.gmail.com">muVb<wbr>3Q@mail.gmail.com</a>>:<br>
> Hi,<br>
><br>
> We have two node cluster running in active/standby mode and having IPMI<br>
> fencing configured.<br>
><br>
> In case of kernel panic at Active node, standby node is detecting node<br>
> failure in around 30 secs which leads to delay in standby node taking the<br>
> active role.<br>
><br>
> we have totem token timeout as 10000 msecs.<br>
> Please let us know in case there is any more configuration controlling<br>
> membership detection.<br>
><br>
> s/w versions.<br>
><br>
> centos 6.7<br>
> corosync-1.4.7-5.el6.x86_64<br>
> pacemaker-1.1.14-8.el6.x86_64<br>
><br>
> Thanks and Regards,<br>
> Ashutosh Tiwari<br>
<br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 3<br>
Date: Wed, 10 Jan 2018 09:40:51 +0000<br>
From: ??? <<a href="mailto:fanguoteng@highgo.com">fanguoteng@highgo.com</a>><br>
To: Cluster Labs - All topics related to open-source clustering<br>
welcomed <<a href="mailto:users@clusterlabs.org">users@clusterlabs.org</a>><br>
Subject: [ClusterLabs] pacemaker reports monitor timeout while CPU is<br>
high<br>
Message-ID: <<a href="mailto:4dc98a5d9be144a78fb9a187217439ed@EX01.highgo.com">4dc98a5d9be144a78fb9a18721743<wbr>9ed@EX01.highgo.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Hello,<br>
<br>
This issue only appears when we run performance test and the CPU is high. The cluster and log is as below. The Pacemaker will restart the Slave Side pgsql-ha resource about every two minutes.<br>
<br>
Take the following scenario for example:?when the pgsqlms RA is called, we print the log ?execute the command start (command)?. When the command is returned, we print the log ?execute the command stop (Command) (result)??<br>
<br>
1. We could see that pacemaker call ?pgsqlms monitor? about every 15 seconds. And it return $OCF_SUCCESS<br>
<br>
2. In calls monitor command again at 13:56:16, and then it reports timeout error error 13:56:18. It is only 2 seconds but it reports ?timeout=10000ms?<br>
<br>
3. In other logs, sometimes after 15 minutes, there is no ?execute the command start monitor? printed and it reports timeout error directly.<br>
<br>
Could you please tell how to debug or resolve such issue?<br>
<br>
The log:<br>
<br>
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor<br>
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start<br>
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0<br>
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop monitor 0<br>
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command start monitor<br>
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start<br>
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0<br>
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop monitor 0<br>
Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU load detected: 426.779999<br>
Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command start monitor<br>
Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606) timed out<br>
Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - timed out after 10000ms<br>
Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=10000ms<br>
Jan 10 13:56:18 sds2 crmd[26096]: notice: db2-pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ]<br>
Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph<br>
Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor for pgsqld:0 on db2: unknown error (1)<br>
Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start for pgsqld:1 on db1: unknown error (1)<br>
Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 1000000 failures (max=1000000)<br>
Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 1000000 failures (max=1000000)<br>
Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover pgsqld:0#011(Slave db2)<br>
Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition 37, saving inputs in /var/lib/pacemaker/pengine/pe-<wbr>input-1251.bz2<br>
<br>
<br>
The Cluster Configuration:<br>
2 nodes and 13 resources configured<br>
<br>
Online: [ db1 db2 ]<br>
<br>
Full list of resources:<br>
<br>
Clone Set: dlm-clone [dlm]<br>
Started: [ db1 db2 ]<br>
Clone Set: clvmd-clone [clvmd]<br>
Started: [ db1 db2 ]<br>
ipmi_node1 (stonith:fence_ipmilan): Started db2<br>
ipmi_node2 (stonith:fence_ipmilan): Started db1<br>
Clone Set: clusterfs-clone [clusterfs]<br>
Started: [ db1 db2 ]<br>
Master/Slave Set: pgsql-ha [pgsqld]><br>
<br>
Masters: [ db1 ]<br>
<br>
Slaves: [ db2 ]<br>
Resource Group: mastergroup<br>
db1-vip (ocf::heartbeat:IPaddr2): Started<br>
rep-vip (ocf::heartbeat:IPaddr2): Started<br>
Resource Group: slavegroup<br>
db2-vip (ocf::heartbeat:IPaddr2): Started<br>
<br>
<br>
pcs resource show pgsql-ha<br>
Master: pgsql-ha<br>
Meta Attrs: interleave=true notify=true<br>
Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)<br>
Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data<br>
Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)<br>
stop interval=0s timeout=60s (pgsqld-stop-interval-0s)<br>
promote interval=0s timeout=130s (pgsqld-promote-interval-0s)<br>
demote interval=0s timeout=120s (pgsqld-demote-interval-0s)<br>
monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)<br>
monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)<br>
notify interval=0s timeout=60s (pgsqld-notify-interval-0s)<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://lists.clusterlabs.org/pipermail/users/attachments/20180110/88e7c872/attachment.html" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/<wbr>pipermail/users/attachments/<wbr>20180110/88e7c872/attachment.<wbr>html</a>><br>
<br>
------------------------------<br>
<br>
______________________________<wbr>_________________<br>
Users mailing list<br>
<a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>
<a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>
<br>
<br>
End of Users Digest, Vol 36, Issue 8<br>
******************************<wbr>******<br>
</blockquote></div><br></div></div></div>