[ClusterLabs] Cluster Stopped, No Messages?

Fri May 28 13:42:41 EDT 2021

Shared storage is not what triggers the need for fencing. Coordinating
actions is what triggers the need. Specifically; If you can run resource
on both/all nodes at the same time, you don't need HA. If you can't, you
need fencing.

digimer

On 2021-05-28 1:19 p.m., Eric Robinson wrote:
> There is no fencing agent on this cluster and no shared storage.
> 
> -Eric
> 
> *From:* Strahil Nikolov <hunter86_bg at yahoo.com>
> *Sent:* Friday, May 28, 2021 10:08 AM
> *To:* Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>; Eric Robinson <eric.robinson at psmnv.com>
> *Subject:* Re: [ClusterLabs] Cluster Stopped, No Messages?
> 
> what is your fencing agent ?
> 
> Best Regards,
> 
> Strahil Nikolov
> 
>     On Thu, May 27, 2021 at 20:52, Eric Robinson
> 
>     <eric.robinson at psmnv.com <mailto:eric.robinson at psmnv.com>> wrote:
> 
>     We found one of our cluster nodes down this morning. The server was
>     up but cluster services were not running. Upon examination of the
>     logs, we found that the cluster just stopped around 9:40:31 and then
>     I started it up manually (pcs cluster start) at 11:49:48. I can’t
>     imagine that Pacemaker just randomly terminates. Any thoughts why it
>     would behave this way?
> 
>      
> 
>      
> 
>     May 27 09:25:31 [92170] 001store01a    pengine:   notice:
>     process_pe_message:   Calculated transition 91482, saving inputs in
>     /var/lib/pacemaker/pengine/pe-input-756.bz2
> 
>     May 27 09:25:31 [92171] 001store01a       crmd:     info:
>     do_state_transition:  State transition S_POLICY_ENGINE ->
>     S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>     origin=handle_response
> 
>     May 27 09:25:31 [92171] 001store01a       crmd:     info:
>     do_te_invoke: Processing graph 91482
>     (ref=pe_calc-dc-1622121931-124396) derived from
>     /var/lib/pacemaker/pengine/pe-input-756.bz2
> 
>     May 27 09:25:31 [92171] 001store01a       crmd:   notice:
>     run_graph:    Transition 91482 (Complete=0, Pending=0, Fired=0,
>     Skipped=0, Incomplete=0,
>     Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
> 
>     May 27 09:25:31 [92171] 001store01a       crmd:     info:
>     do_log:       Input I_TE_SUCCESS received in state
>     S_TRANSITION_ENGINE from notify_crmd
> 
>     May 27 09:25:31 [92171] 001store01a       crmd:   notice:
>     do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE
>     | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
> 
>     May 27 09:40:31 [92171] 001store01a       crmd:     info:
>     crm_timer_popped:     PEngine Recheck Timer (I_PE_CALC) just popped
>     (900000ms)
> 
>     May 27 09:40:31 [92171] 001store01a       crmd:   notice:
>     do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE |
>     input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
> 
>     May 27 09:40:31 [92171] 001store01a       crmd:     info:
>     do_state_transition:  Progressed to state S_POLICY_ENGINE after
>     C_TIMER_POPPED
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     process_pe_message:   Input has not changed since last time, not
>     saving to disk
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     determine_online_status:      Node 001store01a is online
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     determine_op_status:  Operation monitor found resource
>     p_pure-ftpd-itls active on 001store01a
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:  warning:
>     unpack_rsc_op_failure:        Processing failed op monitor for
>     p_vip_ftpclust01 on 001store01a: unknown error (1)
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     determine_op_status:  Operation monitor found resource
>     p_pure-ftpd-etls active on 001store01a
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     unpack_node_loop:     Node 1 is already processed
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     unpack_node_loop:     Node 1 is already processed
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     common_print: p_vip_ftpclust01       
>     (ocf::heartbeat:IPaddr2):       Started 001store01a
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     common_print: p_replicator    (systemd:pure-replicator):  
>        Started 001store01a
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     common_print: p_pure-ftpd-etls       
>     (systemd:pure-ftpd-etls):       Started 001store01a
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     common_print: p_pure-ftpd-itls       
>     (systemd:pure-ftpd-itls):       Started 001store01a
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     LogActions:   Leave   p_vip_ftpclust01        (Started 001store01a)
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     LogActions:   Leave   p_replicator    (Started 001store01a)
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     LogActions:   Leave   p_pure-ftpd-etls        (Started 001store01a)
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:     info:
>     LogActions:   Leave   p_pure-ftpd-itls        (Started 001store01a)
> 
>     May 27 09:40:31 [92170] 001store01a    pengine:   notice:
>     process_pe_message:   Calculated transition 91483, saving inputs in
>     /var/lib/pacemaker/pengine/pe-input-756.bz2
> 
>     May 27 09:40:31 [92171] 001store01a       crmd:     info:
>     do_state_transition:  State transition S_POLICY_ENGINE ->
>     S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>     origin=handle_response
> 
>     May 27 09:40:31 [92171] 001store01a       crmd:     info:
>     do_te_invoke: Processing graph 91483
>     (ref=pe_calc-dc-1622122831-124397) derived from
>     /var/lib/pacemaker/pengine/pe-input-756.bz2
> 
>     May 27 09:40:31 [92171] 001store01a       crmd:   notice:
>     run_graph:    Transition 91483 (Complete=0, Pending=0, Fired=0,
>     Skipped=0, Incomplete=0,
>     Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
> 
>     May 27 09:40:31 [92171] 001store01a       crmd:     info:
>     do_log:       Input I_TE_SUCCESS received in state
>     S_TRANSITION_ENGINE from notify_crmd
> 
>     May 27 09:40:31 [92171] 001store01a       crmd:   notice:
>     do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE
>     | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [MAIN  ] Corosync
>     Cluster Engine ('2.4.3'): started and ready to provide service.
> 
>     [10667] 001store01a.ccnva.local corosyncinfo    [MAIN  ] Corosync
>     built-in features: dbus systemd xmlconf qdevices qnetd snmp
>     libcgroup pie relro bindnow
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ]
>     Initializing transport (UDP/IP Unicast).
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ]
>     Initializing transmit/receive security (NSS) crypto: none hash: none
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] The network
>     interface [10.51.14.40] is now up.
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
>     engine loaded: corosync configuration map access [0]
> 
>     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
>     name: cmap
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
>     engine loaded: corosync configuration service [1]
> 
>     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
>     name: cfg
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
>     engine loaded: corosync cluster closed process group service v1.01 [2]
> 
>     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
>     name: cpg
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
>     engine loaded: corosync profile loading service [4]
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [QUORUM] Using
>     quorum provider corosync_votequorum
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for
>     all cluster members. Current votes: 1 expected_votes: 2
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
>     engine loaded: corosync vote quorum service v1.0 [5]
> 
>     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
>     name: votequorum
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service
>     engine loaded: corosync cluster quorum service v0.1 [3]
> 
>     [10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server
>     name: quorum
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] adding new
>     UDPU member {10.51.14.40}
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] adding new
>     UDPU member {10.51.14.41}
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] A new
>     membership (10.51.14.40:6412) was formed. Members joined: 1
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for
>     all cluster members. Current votes: 1 expected_votes: 2
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for
>     all cluster members. Current votes: 1 expected_votes: 2
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for
>     all cluster members. Current votes: 1 expected_votes: 2
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [QUORUM] Members[1]: 1
> 
>     [10667] 001store01a.ccnva.local corosyncnotice  [MAIN  ] Completed
>     service synchronization, ready to provide service.
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:  
>     notice: main:     Starting Pacemaker 1.1.18-11.el7_5.3 |
>     build=2b07d5c5a9 features: generated-manpages agent-manpages ncurses
>     libqb-logging libqb-ipc systemd nagios  corosync-native atomic-attrd
>     acls
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: main:     Maximum core file size is: 18446744073709551615
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: qb_ipcs_us_publish:       server name: pacemakerd
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: crm_get_peer:     Created entry
>     05ad8b08-25a3-4a2d-84cb-1fc355fb697c/0x55d844a446b0 for node
>     001store01a/1 (1 total)
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: crm_get_peer:     Node 1 is now known as 001store01a
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: crm_get_peer:     Node 1 has uuid 1
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: crm_update_peer_proc:     cluster_connect_cpg: Node
>     001store01a[1] - corosync-cpg is now online
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd: 
>     warning: cluster_connect_quorum:   Quorum lost
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: crm_get_peer:     Created entry
>     2f1f038e-9cc1-4a43-bab9-e7c91ca0bf3f/0x55d844a45ee0 for node
>     001store01b/2 (2 total)
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: crm_get_peer:     Node 2 is now known as 001store01b
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: crm_get_peer:     Node 2 has uuid 2
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: start_child:      Using uid=189 and group=189 for process cib
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: start_child:      Forked child 10682 for process cib
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: start_child:      Forked child 10683 for process stonith-ng
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: start_child:      Forked child 10684 for process lrmd
> 
>     May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:    
>     info: start_child:      Using uid=189 and group=189 for process attrd
> 
>      
> 
>      
> 
>      
> 
>     Disclaimer : This email and any files transmitted with it are
>     confidential and intended solely for intended recipients. If you are
>     not the named addressee you should not disseminate, distribute, copy
>     or alter this email. Any views or opinions presented in this email
>     are solely those of the author and might not represent those of
>     Physician Select Management. Warning: Although Physician Select
>     Management has taken reasonable precautions to ensure no viruses are
>     present in this email, the company cannot accept responsibility for
>     any loss or damage arising from the use of this email or attachments.
> 
>     _______________________________________________
>     Manage your subscription:
>     https://lists.clusterlabs.org/mailman/listinfo/users
>     <https://lists.clusterlabs.org/mailman/listinfo/users>
> 
>     ClusterLabs home: https://www.clusterlabs.org/
>     <https://www.clusterlabs.org/>
> 
> Disclaimer : This email and any files transmitted with it are
> confidential and intended solely for intended recipients. If you are not
> the named addressee you should not disseminate, distribute, copy or
> alter this email. Any views or opinions presented in this email are
> solely those of the author and might not represent those of Physician
> Select Management. Warning: Although Physician Select Management has
> taken reasonable precautions to ensure no viruses are present in this
> email, the company cannot accept responsibility for any loss or damage
> arising from the use of this email or attachments.
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould