[ClusterLabs] Antw: corosync taking almost 30 secs to detect node failure in case of kernel panic

Wed Jan 10 11:59:11 EST 2018

On Wed, 2018-01-10 at 16:17 +0530, ashutosh tiwari wrote:
> Hi Ulrich,
> 
> Thanks for the response.
> 
>  30 sec is the time for detection only as confirmed by logs.
> 
> ++++++++++++++++++++++++++++++++++++
> Jan 10 11:06:18 [19261] orana       crmd:     info:
> crm_timer_popped:   PEngine Recheck Timer (I_PE_CALC) just popped
> (30000ms)
> Jan 10 11:06:18 [19261] orana       crmd:   notice:
> do_state_transition:        State transition S_IDLE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Jan 10 11:06:18 [19261] orana       crmd:     info:
> do_state_transition:        Progressed to state S_POLICY_ENGINE after
> C_TIMER_POPPED
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> process_pe_message: Input has not changed since last time, not saving
> to disk
> Jan 10 11:06:18 [19260] orana    pengine:   notice: unpack_config:   
>   On loss of CCM Quorum: Ignore
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> determine_online_status_fencing:    Node tigana is active
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> determine_online_status:    Node tigana is online
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> determine_online_status_fencing:    Node orana is active
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> determine_online_status:    Node orana is online
> Jan 10 11:06:18 [19260] orana    pengine:     info: clone_print:     
>    Master/Slave Set: unicloud-master [unicloud]
> Jan 10 11:06:18 [19260] orana    pengine:     info: short_print:     
>        Masters: [ tigana ]
> Jan 10 11:06:18 [19260] orana    pengine:     info: short_print:     
>        Slaves: [ orana ]
> Jan 10 11:06:18 [19260] orana    pengine:     info: native_print:   
>    fence-uc-orana  (stonith:fence_ilo4):   Started tigana
> Jan 10 11:06:18 [19260] orana    pengine:     info: native_print:   
>    fence-uc-tigana (stonith:fence_ilo4):   Started tigana
> Jan 10 11:06:18 [19260] orana    pengine:     info: master_color:   
>    Promoting unicloud:0 (Master tigana)
> Jan 10 11:06:18 [19260] orana    pengine:     info: master_color:   
>    unicloud-master: Promoted 1 instances of a possible 1 to master
> Jan 10 11:06:18 [19260] orana    pengine:     info: LogActions:
> Leave   unicloud:0      (Master tigana)
> Jan 10 11:06:18 [19260] orana    pengine:     info: LogActions:
> Leave   unicloud:1      (Slave orana)
> Jan 10 11:06:18 [19260] orana    pengine:     info: LogActions:
> Leave   fence-uc-orana  (Started tigana)
> Jan 10 11:06:18 [19260] orana    pengine:     info: LogActions:
> Leave   fence-uc-tigana (Started tigana)
> Jan 10 11:06:18 [19260] orana    pengine:   notice:
> process_pe_message: Calculated Transition 2390:
> /var/lib/pacemaker/pengine/pe-input-1655.bz2
> Jan 10 11:06:18 [19261] orana       crmd:     info:
> do_state_transition:        State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Jan 10 11:06:18 [19261] orana       crmd:     info: do_te_invoke:   
>    Processing graph 2390 (ref=pe_calc-dc-1515562578-2650) derived
> from /var/lib/pacemaker/pengine/pe-input-1655.bz2
> Jan 10 11:06:18 [19261] orana       crmd:   notice: run_graph: 
> Transition 2390 (Complete=0, Pending=0, Fired=0, Skipped=0,
> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1655.bz2):
> Complete
> Jan 10 11:06:18 [19261] orana       crmd:     info: do_log:     FSA:
> Input I_TE_SUCCESS from notify_crmd() received in state
> S_TRANSITION_ENGINE
> Jan 10 11:06:18 [19261] orana       crmd:   notice:
> do_state_transition:        State transition S_TRANSITION_ENGINE ->
> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

The above logs are just a routine check pacemaker does periodically,
and are unrelated to the node failure.

> Jan 10 11:06:31 corosync [TOTEM ] A processor failed, forming new
> configuration.

This is the first sign of the node failure, so it appears this delay is
entirely within corosync.

The first thing I'd check is that the log timestamps are in sync
between the two nodes, to be sure about the time difference.

Beyond that, it's a question for the corosync folks :)

> Jan 10 11:06:33 corosync [QUORUM] Members[1]: 1
> Jan 10 11:06:33 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jan 10 11:06:33 [19250] orana pacemakerd:     info:
> cman_event_callback:        Membership 2064: quorum retained
> Jan 10 11:06:33 [19261] orana       crmd:     info:
> cman_event_callback:        Membership 2064: quorum retained
> Jan 10 11:06:33 [19250] orana pacemakerd:   notice:
> crm_update_peer_state_iter: cman_event_callback: Node tigana[2] -
> state is now lost (was member)
> Jan 10 11:06:33 [19261] orana       crmd:   notice:
> crm_update_peer_state_iter: cman_event_callback: Node tigana[2] -
> state is now lost (was member)
> Jan 10 11:06:33 [19261] orana       crmd:     info:
> peer_update_callback:       tigana is now lost (was member)
> Jan 10 11:06:33 [19261] orana       crmd:  warning:
> match_down_event:   No match for shutdown action on tigana
> Jan 10 11:06:33 [19261] orana       crmd:   notice:
> peer_update_callback:       Stonith/shutdown of tigana not matched
> Jan 10 11:06:33 [19261] orana       crmd:     info:
> crm_update_peer_join:       peer_update_callback: Node tigana[2] -
> join-2 phase 4 -> 0
> Jan 10 11:06:33 [19261] orana       crmd:     info:
> abort_transition_graph:     Transition aborted: Node failure
> (source=peer_update_callback:240, 1)
> Jan 10 11:06:33 corosync [CPG   ] chosen downlist: sender r(0)
> ip(7.7.7.1) ; members(old:2 left:1)
> ++++++++++++++++++++++++++
> 
> this is the logs from standby node(new active).
> kernel panic was triggered at 11:06:00 at the other node and here
> totem change is reported at 11:06:31.
> 
> 30 secs is the cluster recheck timer.
> 
> Regards,
> Ashutosh 
> 
> 
> On Wed, Jan 10, 2018 at 3:12 PM, <users-request at clusterlabs.org>
> wrote:
> > Send Users mailing list submissions to
> >         users at clusterlabs.org
> > 
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         http://lists.clusterlabs.org/mailman/listinfo/users
> > or, via email, send a message with subject or body 'help' to
> >         users-request at clusterlabs.org
> > 
> > You can reach the person managing the list at
> >         users-owner at clusterlabs.org
> > 
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Users digest..."
> > 
> > 
> > Today's Topics:
> > 
> >    1. corosync taking almost 30 secs to detect node failure in case
> >       of kernel panic (ashutosh tiwari)
> >    2. Antw: corosync taking almost 30 secs to detect node failure
> >       in case of kernel panic (Ulrich Windl)
> >    3. pacemaker reports monitor timeout while CPU is high (???)
> > 
> > 
> > -----------------------------------------------------------------
> > -----
> > 
> > Message: 1
> > Date: Wed, 10 Jan 2018 12:43:46 +0530
> > From: ashutosh tiwari <ashutosh.kvas at gmail.com>
> > To: users at clusterlabs.org
> > Subject: [ClusterLabs] corosync taking almost 30 secs to detect
> > node
> >         failure in case of kernel panic
> > Message-ID:
> >         <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb3Q at mail.g
> > mail.com>
> > Content-Type: text/plain; charset="utf-8"
> > 
> > Hi,
> > 
> > We have two node cluster running in active/standby mode and having
> > IPMI
> > fencing configured.
> > 
> > In case of kernel panic at Active node, standby node is detecting
> > node
> > failure in around 30 secs which leads to delay in standby node
> > taking the
> > active role.
> > 
> > we have totem token timeout as 10000 msecs.
> > Please let us know in case there is any more configuration
> > controlling
> > membership detection.
> > 
> > s/w versions.
> > 
> > centos 6.7
> > corosync-1.4.7-5.el6.x86_64
> > pacemaker-1.1.14-8.el6.x86_64
> > 
> > Thanks and Regards,
> > Ashutosh Tiwari
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <http://lists.clusterlabs.org/pipermail/users/attachments/2018
> > 0110/235f148d/attachment-0001.html>
> > 
> > ------------------------------
> > 
> > Message: 2
> > Date: Wed, 10 Jan 2018 08:32:16 +0100
> > From: "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de>
> > To: <users at clusterlabs.org>
> > Subject: [ClusterLabs] Antw: corosync taking almost 30 secs to
> > detect
> >         node failure in case of kernel panic
> > Message-ID: <5A55C180020000A100029BD1 at gwsmtp1.uni-regensburg.de>
> > Content-Type: text/plain; charset=US-ASCII
> > 
> > Hi!
> > 
> > Maybe define "detecting node failure". Culkd it be your 30 seconds
> > are between detection and reaction? Logs would help here, too.
> > 
> > Regards,
> > Ulrich
> > 
> > 
> > >>> ashutosh tiwari <ashutosh.kvas at gmail.com> schrieb am 10.01.2018
> > um 08:13 in
> > Nachricht
> > <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb3Q at mail.gmail.com
> > >:
> > > Hi,
> > >
> > > We have two node cluster running in active/standby mode and
> > having IPMI
> > > fencing configured.
> > >
> > > In case of kernel panic at Active node, standby node is detecting
> > node
> > > failure in around 30 secs which leads to delay in standby node
> > taking the
> > > active role.
> > >
> > > we have totem token timeout as 10000 msecs.
> > > Please let us know in case there is any more configuration
> > controlling
> > > membership detection.
> > >
> > > s/w versions.
> > >
> > > centos 6.7
> > > corosync-1.4.7-5.el6.x86_64
> > > pacemaker-1.1.14-8.el6.x86_64
> > >
> > > Thanks and Regards,
> > > Ashutosh Tiwari
> > 
> > 
> > 
> > 
> > ------------------------------
> > 
> > Message: 3
> > Date: Wed, 10 Jan 2018 09:40:51 +0000
> > From: ??? <fanguoteng at highgo.com>
> > To: Cluster Labs - All topics related to open-source clustering
> >         welcomed        <users at clusterlabs.org>
> > Subject: [ClusterLabs] pacemaker reports monitor timeout while CPU
> > is
> >         high
> > Message-ID: <4dc98a5d9be144a78fb9a187217439ed at EX01.highgo.com>
> > Content-Type: text/plain; charset="utf-8"
> > 
> > Hello,
> > 
> > This issue only appears when we run performance test and the CPU is
> > high. The cluster and log is as below. The Pacemaker will restart
> > the Slave Side pgsql-ha resource about every two minutes.
> > 
> > Take the following scenario for example:?when the pgsqlms RA is
> > called, we print the log ?execute the command start (command)?.
> > When the command is returned, we print the log ?execute the command
> > stop (Command) (result)??
> > 
> > 1.     We could see that pacemaker call ?pgsqlms monitor? about
> > every 15 seconds. And it return $OCF_SUCCESS
> > 
> > 2.     In calls monitor command again at 13:56:16, and then it
> > reports timeout error error 13:56:18. It is only 2 seconds but it
> > reports ?timeout=10000ms?
> > 
> > 3.     In other logs, sometimes after 15 minutes, there is no
> > ?execute the command start monitor? printed and it reports timeout
> > error directly.
> > 
> > Could you please tell how to debug or resolve such issue?
> > 
> > The log:
> > 
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the
> > command start monitor
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role
> > start
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role
> > stop 0
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the
> > command stop monitor 0
> > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the
> > command start monitor
> > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role
> > start
> > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role
> > stop 0
> > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the
> > command stop monitor 0
> > Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU load detected:
> > 426.779999
> > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the
> > command start monitor
> > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000
> > process (PID 5606) timed out
> > Jan 10 13:56:18 sds2 lrmd[26093]: warning:
> > pgsqld_monitor_16000:5606 - timed out after 10000ms
> > Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor
> > operation for pgsqld on db2: Timed Out | call=102
> > key=pgsqld_monitor_16000 timeout=10000ms
> > Jan 10 13:56:18 sds2 crmd[26096]:  notice: db2-
> > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ]
> > Jan 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE
> > -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL
> > origin=abort_transition_graph
> > Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op
> > monitor for pgsqld:0 on db2: unknown error (1)
> > Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op
> > start for pgsqld:1 on db1: unknown error (1)
> > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away
> > from db1 after 1000000 failures (max=1000000)
> > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away
> > from db1 after 1000000 failures (max=1000000)
> > Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover
> > pgsqld:0#011(Slave db2)
> > Jan 10 13:56:19 sds2 pengine[26095]:  notice: Calculated transition
> > 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2
> > 
> > 
> > The Cluster Configuration:
> > 2 nodes and 13 resources configured
> > 
> > Online: [ db1 db2 ]
> > 
> > Full list of resources:
> > 
> > Clone Set: dlm-clone [dlm]
> >      Started: [ db1 db2 ]
> > Clone Set: clvmd-clone [clvmd]
> >      Started: [ db1 db2 ]
> > ipmi_node1     (stonith:fence_ipmilan):        Started db2
> > ipmi_node2     (stonith:fence_ipmilan):        Started db1
> > Clone Set: clusterfs-clone [clusterfs]
> >      Started: [ db1 db2 ]
> > Master/Slave Set: pgsql-ha [pgsqld]>
> > 
> >       Masters: [ db1 ]
> > 
> > Slaves: [ db2 ]
> > Resource Group: mastergroup
> >      db1-vip    (ocf::heartbeat:IPaddr2):       Started
> >      rep-vip    (ocf::heartbeat:IPaddr2):       Started
> > Resource Group: slavegroup
> >      db2-vip    (ocf::heartbeat:IPaddr2):       Started
> > 
> > 
> > pcs resource show pgsql-ha
> > Master: pgsql-ha
> >   Meta Attrs: interleave=true notify=true
> >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
> >    Attributes: bindir=/usr/local/pgsql/bin
> > pgdata=/home/postgres/data
> >    Operations: start interval=0s timeout=160s (pgsqld-start-
> > interval-0s)
> >                stop interval=0s timeout=60s (pgsqld-stop-interval-
> > 0s)
> >                promote interval=0s timeout=130s (pgsqld-promote-
> > interval-0s)
> >                demote interval=0s timeout=120s (pgsqld-demote-
> > interval-0s)
> >                monitor interval=15s role=Master timeout=10s
> > (pgsqld-monitor-interval-15s)
> >                monitor interval=16s role=Slave timeout=10s (pgsqld-
> > monitor-interval-16s)
> >                notify interval=0s timeout=60s (pgsqld-notify-
> > interval-0s)
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <http://lists.clusterlabs.org/pipermail/users/attachments/2018
> > 0110/88e7c872/attachment.html>
> > 
> > ------------------------------
> > 
> > _______________________________________________
> > Users mailing list
> > Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > 
> > End of Users Digest, Vol 36, Issue 8
> > ************************************
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>