[ClusterLabs] Cluster Stopped, No Messages?

Fri May 28 15:08:40 EDT 2021

> -----Original Message-----
> From: Digimer <lists at alteeve.ca>
> Sent: Friday, May 28, 2021 12:43 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users at clusterlabs.org>; Eric Robinson <eric.robinson at psmnv.com>; Strahil
> Nikolov <hunter86_bg at yahoo.com>
> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>
> Shared storage is not what triggers the need for fencing. Coordinating actions
> is what triggers the need. Specifically; If you can run resource on both/all
> nodes at the same time, you don't need HA. If you can't, you need fencing.
>
> Digimer

Thanks. That said, there is no fencing, so any thoughts on why the node behaved the way it did?

>
> On 2021-05-28 1:19 p.m., Eric Robinson wrote:
> > There is no fencing agent on this cluster and no shared storage.
> >
> > -Eric
> >
> > *From:* Strahil Nikolov <hunter86_bg at yahoo.com>
> > *Sent:* Friday, May 28, 2021 10:08 AM
> > *To:* Cluster Labs - All topics related to open-source clustering
> > welcomed <users at clusterlabs.org>; Eric Robinson
> > <eric.robinson at psmnv.com>
> > *Subject:* Re: [ClusterLabs] Cluster Stopped, No Messages?
> >
> > what is your fencing agent ?
> >
> > Best Regards,
> >
> > Strahil Nikolov
> >
> >     On Thu, May 27, 2021 at 20:52, Eric Robinson
> >
> >     <eric.robinson at psmnv.com <mailto:eric.robinson at psmnv.com>>
> wrote:
> >
> >     We found one of our cluster nodes down this morning. The server was
> >     up but cluster services were not running. Upon examination of the
> >     logs, we found that the cluster just stopped around 9:40:31 and then
> >     I started it up manually (pcs cluster start) at 11:49:48. I can’t
> >     imagine that Pacemaker just randomly terminates. Any thoughts why it
> >     would behave this way?
> >
> >
> >
> >
> >
> >     May 27 09:25:31 [92170] 001store01a    pengine:   notice:
> >     process_pe_message:   Calculated transition 91482, saving inputs in
> >     /var/lib/pacemaker/pengine/pe-input-756.bz2
> >
> >     May 27 09:25:31 [92171] 001store01a       crmd:     info:
> >     do_state_transition:  State transition S_POLICY_ENGINE ->
> >     S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> >     origin=handle_response
> >
> >     May 27 09:25:31 [92171] 001store01a       crmd:     info:
> >     do_te_invoke: Processing graph 91482
> >     (ref=pe_calc-dc-1622121931-124396) derived from
> >     /var/lib/pacemaker/pengine/pe-input-756.bz2
> >
> >     May 27 09:25:31 [92171] 001store01a       crmd:   notice:
> >     run_graph:    Transition 91482 (Complete=0, Pending=0, Fired=0,
> >     Skipped=0, Incomplete=0,
> >     Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
> >
> >     May 27 09:25:31 [92171] 001store01a       crmd:     info:
> >     do_log:       Input I_TE_SUCCESS received in state
> >     S_TRANSITION_ENGINE from notify_crmd
> >
> >     May 27 09:25:31 [92171] 001store01a       crmd:   notice:
> >     do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE
> >     | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
> >
> >     May 27 09:40:31 [92171] 001store01a       crmd:     info:
> >     crm_timer_popped:     PEngine Recheck Timer (I_PE_CALC) just popped
> >     (900000ms)
> >
> >     May 27 09:40:31 [92171] 001store01a       crmd:   notice:
> >     do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE |
> >     input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
> >
> >     May 27 09:40:31 [92171] 001store01a       crmd:     info:
> >     do_state_transition:  Progressed to state S_POLICY_ENGINE after
> >     C_TIMER_POPPED
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     process_pe_message:   Input has not changed since last time, not
> >     saving to disk
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     determine_online_status:      Node 001store01a is online
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     determine_op_status:  Operation monitor found resource
> >     p_pure-ftpd-itls active on 001store01a
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:  warning:
> >     unpack_rsc_op_failure:        Processing failed op monitor for
> >     p_vip_ftpclust01 on 001store01a: unknown error (1)
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     determine_op_status:  Operation monitor found resource
> >     p_pure-ftpd-etls active on 001store01a
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     unpack_node_loop:     Node 1 is already processed
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     unpack_node_loop:     Node 1 is already processed
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     common_print: p_vip_ftpclust01
> >     (ocf::heartbeat:IPaddr2):       Started 001store01a
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     common_print: p_replicator    (systemd:pure-replicator):
> >        Started 001store01a
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     common_print: p_pure-ftpd-etls
> >     (systemd:pure-ftpd-etls):       Started 001store01a
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     common_print: p_pure-ftpd-itls
> >     (systemd:pure-ftpd-itls):       Started 001store01a
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     LogActions:   Leave   p_vip_ftpclust01        (Started
> > 001store01a)
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     LogActions:   Leave   p_replicator    (Started 001store01a)
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     LogActions:   Leave   p_pure-ftpd-etls        (Started
> > 001store01a)
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:     info:
> >     LogActions:   Leave   p_pure-ftpd-itls        (Started
> > 001store01a)
> >
> >     May 27 09:40:31 [92170] 001store01a    pengine:   notice:
> >     process_pe_message:   Calculated transition 91483, saving inputs in
> >     /var/lib/pacemaker/pengine/pe-input-756.bz2
> >
> >     May 27 09:40:31 [92171] 001store01a       crmd:     info:
> >     do_state_transition:  State transition S_POLICY_ENGINE ->
> >     S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> >     origin=handle_response
> >
> >     May 27 09:40:31 [92171] 001store01a       crmd:     info:
> >     do_te_invoke: Processing graph 91483
> >     (ref=pe_calc-dc-1622122831-124397) derived from
> >     /var/lib/pacemaker/pengine/pe-input-756.bz2
> >
> >     May 27 09:40:31 [92171] 001store01a       crmd:   notice:
> >     run_graph:    Transition 91483 (Complete=0, Pending=0, Fired=0,
> >     Skipped=0, Incomplete=0,
> >     Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
> >
> >     May 27 09:40:31 [92171] 001store01a       crmd:     info:
> >     do_log:       Input I_TE_SUCCESS received in state
> >     S_TRANSITION_ENGINE from notify_crmd
> >
> >     May 27 09:40:31 [92171] 001store01a       crmd:   notice:
> >     do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE
> >     | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [MAIN  ] Corosync
> >     Cluster Engine ('2.4.3'): started and ready to provide service.
> >
> >     [10667] 001store01a.ccnva.local corosyncinfo    [MAIN  ] Corosync
> >     built-in features: dbus systemd xmlconf qdevices qnetd snmp
> >     libcgroup pie relro bindnow
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ]
> >     Initializing transport (UDP/IP Unicast).
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ]
> >     Initializing transmit/receive security (NSS) crypto: none hash:
> > none
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] The network
> >     interface [10.51.14.40] is now up.
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
> >     engine loaded: corosync configuration map access [0]
> >
> >     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
> >     name: cmap
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
> >     engine loaded: corosync configuration service [1]
> >
> >     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
> >     name: cfg
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
> >     engine loaded: corosync cluster closed process group service v1.01
> > [2]
> >
> >     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
> >     name: cpg
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
> >     engine loaded: corosync profile loading service [4]
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [QUORUM] Using
> >     quorum provider corosync_votequorum
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for
> >     all cluster members. Current votes: 1 expected_votes: 2
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
> >     engine loaded: corosync vote quorum service v1.0 [5]
> >
> >     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
> >     name: votequorum
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
> >     engine loaded: corosync cluster quorum service v0.1 [3]
> >
> >     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
> >     name: quorum
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] adding new
> >     UDPU member {10.51.14.40}
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] adding new
> >     UDPU member {10.51.14.41}
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] A new
> >     membership (10.51.14.40:6412) was formed. Members joined: 1
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for
> >     all cluster members. Current votes: 1 expected_votes: 2
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for
> >     all cluster members. Current votes: 1 expected_votes: 2
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for
> >     all cluster members. Current votes: 1 expected_votes: 2
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [QUORUM]
> > Members[1]: 1
> >
> >     [10667] 001store01a.ccnva.local corosyncnotice  [MAIN  ] Completed
> >     service synchronization, ready to provide service.
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     notice: main:     Starting Pacemaker 1.1.18-11.el7_5.3 |
> >     build=2b07d5c5a9 features: generated-manpages agent-manpages
> ncurses
> >     libqb-logging libqb-ipc systemd nagios  corosync-native atomic-attrd
> >     acls
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: main:     Maximum core file size is: 18446744073709551615
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: qb_ipcs_us_publish:       server name: pacemakerd
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: crm_get_peer:     Created entry
> >     05ad8b08-25a3-4a2d-84cb-1fc355fb697c/0x55d844a446b0 for node
> >     001store01a/1 (1 total)
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: crm_get_peer:     Node 1 is now known as 001store01a
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: crm_get_peer:     Node 1 has uuid 1
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: crm_update_peer_proc:     cluster_connect_cpg: Node
> >     001store01a[1] - corosync-cpg is now online
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     warning: cluster_connect_quorum:   Quorum lost
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: crm_get_peer:     Created entry
> >     2f1f038e-9cc1-4a43-bab9-e7c91ca0bf3f/0x55d844a45ee0 for node
> >     001store01b/2 (2 total)
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: crm_get_peer:     Node 2 is now known as 001store01b
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: crm_get_peer:     Node 2 has uuid 2
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: start_child:      Using uid=189 and group=189 for process
> > cib
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: start_child:      Forked child 10682 for process cib
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: start_child:      Forked child 10683 for process stonith-ng
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: start_child:      Forked child 10684 for process lrmd
> >
> >     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:
> >     info: start_child:      Using uid=189 and group=189 for process
> > attrd
> >
> >
> >
> >
> >
> >
> >
> >     Disclaimer : This email and any files transmitted with it are
> >     confidential and intended solely for intended recipients. If you are
> >     not the named addressee you should not disseminate, distribute, copy
> >     or alter this email. Any views or opinions presented in this email
> >     are solely those of the author and might not represent those of
> >     Physician Select Management. Warning: Although Physician Select
> >     Management has taken reasonable precautions to ensure no viruses are
> >     present in this email, the company cannot accept responsibility for
> >     any loss or damage arising from the use of this email or attachments.
> >
> >     _______________________________________________
> >     Manage your subscription:
> >     https://lists.clusterlabs.org/mailman/listinfo/users
> >     <https://lists.clusterlabs.org/mailman/listinfo/users>
> >
> >     ClusterLabs home: https://www.clusterlabs.org/
> >     <https://www.clusterlabs.org/>
> >
> > Disclaimer : This email and any files transmitted with it are
> > confidential and intended solely for intended recipients. If you are
> > not the named addressee you should not disseminate, distribute, copy
> > or alter this email. Any views or opinions presented in this email are
> > solely those of the author and might not represent those of Physician
> > Select Management. Warning: Although Physician Select Management has
> > taken reasonable precautions to ensure no viruses are present in this
> > email, the company cannot accept responsibility for any loss or damage
> > arising from the use of this email or attachments.
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/ "I am, somehow, less
> interested in the weight and convolutions of Einstein’s brain than in the near
> certainty that people of equal talent have lived and died in cotton fields and
> sweatshops." - Stephen Jay Gould
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.