[ClusterLabs] Antw: corosync taking almost 30 secs to detect node failure in case of kernel panic
ashutosh tiwari
ashutosh.kvas at gmail.com
Wed Jan 10 05:47:17 EST 2018
Hi Ulrich,
Thanks for the response.
30 sec is the time for detection only as confirmed by logs.
++++++++++++++++++++++++++++++++++++
Jan 10 11:06:18 [19261] orana crmd: info: crm_timer_popped:
PEngine Recheck Timer (I_PE_CALC) just popped (30000ms)
Jan 10 11:06:18 [19261] orana crmd: notice: do_state_transition:
State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jan 10 11:06:18 [19261] orana crmd: info: do_state_transition:
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Jan 10 11:06:18 [19260] orana pengine: info: process_pe_message:
Input has not changed since last time, not saving to disk
Jan 10 11:06:18 [19260] orana pengine: notice: unpack_config: On
loss of CCM Quorum: Ignore
Jan 10 11:06:18 [19260] orana pengine: info:
determine_online_status_fencing: Node tigana is active
Jan 10 11:06:18 [19260] orana pengine: info:
determine_online_status: Node tigana is online
Jan 10 11:06:18 [19260] orana pengine: info:
determine_online_status_fencing: Node orana is active
Jan 10 11:06:18 [19260] orana pengine: info:
determine_online_status: Node orana is online
Jan 10 11:06:18 [19260] orana pengine: info: clone_print:
Master/Slave Set: unicloud-master [unicloud]
Jan 10 11:06:18 [19260] orana pengine: info: short_print:
Masters: [ tigana ]
Jan 10 11:06:18 [19260] orana pengine: info: short_print:
Slaves: [ orana ]
Jan 10 11:06:18 [19260] orana pengine: info: native_print:
fence-uc-orana (stonith:fence_ilo4): Started tigana
Jan 10 11:06:18 [19260] orana pengine: info: native_print:
fence-uc-tigana (stonith:fence_ilo4): Started tigana
Jan 10 11:06:18 [19260] orana pengine: info: master_color:
Promoting unicloud:0 (Master tigana)
Jan 10 11:06:18 [19260] orana pengine: info: master_color:
unicloud-master: Promoted 1 instances of a possible 1 to master
Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave
unicloud:0 (Master tigana)
Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave
unicloud:1 (Slave orana)
Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave
fence-uc-orana (Started tigana)
Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave
fence-uc-tigana (Started tigana)
Jan 10 11:06:18 [19260] orana pengine: notice: process_pe_message:
Calculated Transition 2390: /var/lib/pacemaker/pengine/pe-input-1655.bz2
Jan 10 11:06:18 [19261] orana crmd: info: do_state_transition:
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jan 10 11:06:18 [19261] orana crmd: info: do_te_invoke:
Processing graph 2390 (ref=pe_calc-dc-1515562578-2650) derived from
/var/lib/pacemaker/pengine/pe-input-1655.bz2
Jan 10 11:06:18 [19261] orana crmd: notice: run_graph: Transition
2390 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-1655.bz2): Complete
Jan 10 11:06:18 [19261] orana crmd: info: do_log: FSA: Input
I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Jan 10 11:06:18 [19261] orana crmd: notice: do_state_transition:
State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jan 10 11:06:31 corosync [TOTEM ] A processor failed, forming new
configuration.
Jan 10 11:06:33 corosync [QUORUM] Members[1]: 1
Jan 10 11:06:33 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jan 10 11:06:33 [19250] orana pacemakerd: info: cman_event_callback:
Membership 2064: quorum retained
Jan 10 11:06:33 [19261] orana crmd: info: cman_event_callback:
Membership 2064: quorum retained
Jan 10 11:06:33 [19250] orana pacemakerd: notice:
crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is
now lost (was member)
Jan 10 11:06:33 [19261] orana crmd: notice:
crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is
now lost (was member)
Jan 10 11:06:33 [19261] orana crmd: info: peer_update_callback:
tigana is now lost (was member)
Jan 10 11:06:33 [19261] orana crmd: warning: match_down_event: No
match for shutdown action on tigana
Jan 10 11:06:33 [19261] orana crmd: notice: peer_update_callback:
Stonith/shutdown of tigana not matched
Jan 10 11:06:33 [19261] orana crmd: info: crm_update_peer_join:
peer_update_callback: Node tigana[2] - join-2 phase 4 -> 0
Jan 10 11:06:33 [19261] orana crmd: info:
abort_transition_graph: Transition aborted: Node failure
(source=peer_update_callback:240, 1)
Jan 10 11:06:33 corosync [CPG ] chosen downlist: sender r(0) ip(7.7.7.1)
; members(old:2 left:1)
++++++++++++++++++++++++++
this is the logs from standby node(new active).
kernel panic was triggered at 11:06:00 at the other node and here totem
change is reported at 11:06:31.
30 secs is the cluster recheck timer.
Regards,
Ashutosh
On Wed, Jan 10, 2018 at 3:12 PM, <users-request at clusterlabs.org> wrote:
> Send Users mailing list submissions to
> users at clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> users-request at clusterlabs.org
>
> You can reach the person managing the list at
> users-owner at clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
> 1. corosync taking almost 30 secs to detect node failure in case
> of kernel panic (ashutosh tiwari)
> 2. Antw: corosync taking almost 30 secs to detect node failure
> in case of kernel panic (Ulrich Windl)
> 3. pacemaker reports monitor timeout while CPU is high (???)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 10 Jan 2018 12:43:46 +0530
> From: ashutosh tiwari <ashutosh.kvas at gmail.com>
> To: users at clusterlabs.org
> Subject: [ClusterLabs] corosync taking almost 30 secs to detect node
> failure in case of kernel panic
> Message-ID:
> <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb
> 3Q at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> We have two node cluster running in active/standby mode and having IPMI
> fencing configured.
>
> In case of kernel panic at Active node, standby node is detecting node
> failure in around 30 secs which leads to delay in standby node taking the
> active role.
>
> we have totem token timeout as 10000 msecs.
> Please let us know in case there is any more configuration controlling
> membership detection.
>
> s/w versions.
>
> centos 6.7
> corosync-1.4.7-5.el6.x86_64
> pacemaker-1.1.14-8.el6.x86_64
>
> Thanks and Regards,
> Ashutosh Tiwari
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.clusterlabs.org/pipermail/users/attachments/
> 20180110/235f148d/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 10 Jan 2018 08:32:16 +0100
> From: "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de>
> To: <users at clusterlabs.org>
> Subject: [ClusterLabs] Antw: corosync taking almost 30 secs to detect
> node failure in case of kernel panic
> Message-ID: <5A55C180020000A100029BD1 at gwsmtp1.uni-regensburg.de>
> Content-Type: text/plain; charset=US-ASCII
>
> Hi!
>
> Maybe define "detecting node failure". Culkd it be your 30 seconds are
> between detection and reaction? Logs would help here, too.
>
> Regards,
> Ulrich
>
>
> >>> ashutosh tiwari <ashutosh.kvas at gmail.com> schrieb am 10.01.2018 um
> 08:13 in
> Nachricht
> <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb3Q at mail.gmail.com>:
> > Hi,
> >
> > We have two node cluster running in active/standby mode and having IPMI
> > fencing configured.
> >
> > In case of kernel panic at Active node, standby node is detecting node
> > failure in around 30 secs which leads to delay in standby node taking the
> > active role.
> >
> > we have totem token timeout as 10000 msecs.
> > Please let us know in case there is any more configuration controlling
> > membership detection.
> >
> > s/w versions.
> >
> > centos 6.7
> > corosync-1.4.7-5.el6.x86_64
> > pacemaker-1.1.14-8.el6.x86_64
> >
> > Thanks and Regards,
> > Ashutosh Tiwari
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 10 Jan 2018 09:40:51 +0000
> From: ??? <fanguoteng at highgo.com>
> To: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> Subject: [ClusterLabs] pacemaker reports monitor timeout while CPU is
> high
> Message-ID: <4dc98a5d9be144a78fb9a187217439ed at EX01.highgo.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello,
>
> This issue only appears when we run performance test and the CPU is high.
> The cluster and log is as below. The Pacemaker will restart the Slave Side
> pgsql-ha resource about every two minutes.
>
> Take the following scenario for example:?when the pgsqlms RA is called, we
> print the log ?execute the command start (command)?. When the command is
> returned, we print the log ?execute the command stop (Command) (result)??
>
> 1. We could see that pacemaker call ?pgsqlms monitor? about every 15
> seconds. And it return $OCF_SUCCESS
>
> 2. In calls monitor command again at 13:56:16, and then it reports
> timeout error error 13:56:18. It is only 2 seconds but it reports
> ?timeout=10000ms?
>
> 3. In other logs, sometimes after 15 minutes, there is no ?execute the
> command start monitor? printed and it reports timeout error directly.
>
> Could you please tell how to debug or resolve such issue?
>
> The log:
>
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command
> start monitor
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop
> monitor 0
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command
> start monitor
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop
> monitor 0
> Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU load detected:
> 426.779999
> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command
> start monitor
> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process
> (PID 5606) timed out
> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 -
> timed out after 10000ms
> Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for
> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=10000ms
> Jan 10 13:56:18 sds2 crmd[26096]: notice: db2-pgsqld_monitor_16000:102 [
> /tmp:5432 - accepting connections\n ]
> Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE ->
> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor
> for pgsqld:0 on db2: unknown error (1)
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start
> for pgsqld:1 on db1: unknown error (1)
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from
> db1 after 1000000 failures (max=1000000)
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from
> db1 after 1000000 failures (max=1000000)
> Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover pgsqld:0#011(Slave
> db2)
> Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition 37,
> saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2
>
>
> The Cluster Configuration:
> 2 nodes and 13 resources configured
>
> Online: [ db1 db2 ]
>
> Full list of resources:
>
> Clone Set: dlm-clone [dlm]
> Started: [ db1 db2 ]
> Clone Set: clvmd-clone [clvmd]
> Started: [ db1 db2 ]
> ipmi_node1 (stonith:fence_ipmilan): Started db2
> ipmi_node2 (stonith:fence_ipmilan): Started db1
> Clone Set: clusterfs-clone [clusterfs]
> Started: [ db1 db2 ]
> Master/Slave Set: pgsql-ha [pgsqld]>
>
> Masters: [ db1 ]
>
> Slaves: [ db2 ]
> Resource Group: mastergroup
> db1-vip (ocf::heartbeat:IPaddr2): Started
> rep-vip (ocf::heartbeat:IPaddr2): Started
> Resource Group: slavegroup
> db2-vip (ocf::heartbeat:IPaddr2): Started
>
>
> pcs resource show pgsql-ha
> Master: pgsql-ha
> Meta Attrs: interleave=true notify=true
> Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
> Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
> Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
> stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
> promote interval=0s timeout=130s
> (pgsqld-promote-interval-0s)
> demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
> monitor interval=15s role=Master timeout=10s
> (pgsqld-monitor-interval-15s)
> monitor interval=16s role=Slave timeout=10s
> (pgsqld-monitor-interval-16s)
> notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.clusterlabs.org/pipermail/users/attachments/
> 20180110/88e7c872/attachment.html>
>
> ------------------------------
>
> _______________________________________________
> Users mailing list
> Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
>
> End of Users Digest, Vol 36, Issue 8
> ************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180110/f91a4b96/attachment-0003.html>
More information about the Users
mailing list