[ClusterLabs] Antw: [EXT] Systemd resource started on node after reboot before cluster is stable ?

Thu Feb 16 14:54:18 EST 2023

On Thu, 2023-02-16 at 11:13 +0100, Adam Cecile wrote:
> 
> On 2/16/23 07:57, Ulrich Windl wrote:
> > > > > Adam Cecile <acecile at le-vert.net> schrieb am 15.02.2023 um
> > > > > 10:49 in
> > 
> > Nachricht
> > <b4f1f2f1-66fe-ca62-ff4f-708d781a507c at le-vert.net>:
> > > Hello,
> > > 
> > > Just had some issue with unexpected server behavior after reboot.
> > > This 
> > > node was powered off, so cluster was running fine with this
> > > tomcat9 
> > > resource running on a different machine.
> > > 
> > > After powering on this node again, it briefly started tomcat
> > > before 
> > > joining the cluster and decided to stop it again. I'm not sure
> > > why.
> > > 
> > > 
> > > Here is the systemctl status tomcat9 on this host:
> > > 
> > > tomcat9.service - Apache Tomcat 9 Web Application Server
> > >       Loaded: loaded (/lib/systemd/system/tomcat9.service;
> > > disabled; 
> > > vendor preset: enabled)
> > >      Drop-In: /etc/systemd/system/tomcat9.service.d
> > >               └─override.conf
> > >       Active: inactive (dead)
> > >         Docs: https://tomcat.apache.org/tomcat-9.0-doc/index.html
> > >  
> > > 
> > > Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina]
> > > Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine:
> > > [Apache 
> > > Tomcat/9.0.43 (Debian)]
> > > Feb 15 09:43:27 server tomcat9[1398]: [...]
> > > Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web 
> > > Application Server...
> > > Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded.
> > > Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web 
> > > Application Server.
> > > Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed
> > > 8.017s CPU 
> > > time.
> > > 
> > > You can see it is disabled and should NOT be started with the
> > > same, 
> > > start/stop is under Corosync control
> > > 
> > > 
> > > The systemd resource is defined like this:
> > > 
> > > primitive tomcat9 systemd:tomcat9.service \
> > >          op start interval=0 timeout=120 \
> > >          op stop interval=0 timeout=120 \
> > >          op monitor interval=60 timeout=100
> > > 
> > > 
> > > Any idea why this happened ?
> > 
> > Your journal (syslog) should tell you!
> 
> Indeed, I overlooked yesterday... But it says it's pacemaker that
> decided to start it:
> 
> Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Sync members[3]: 1
> 2 3
> Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Sync joined[2]: 1 2
> Feb 15 09:43:26 server3 corosync[568]:   [TOTEM ] A new membership
> (1.42d) was formed. Members joined: 1 2
> Feb 15 09:43:26 server3 pacemaker-attrd[860]:  notice: Node server1
> state is now member
> Feb 15 09:43:26 server3 pacemaker-based[857]:  notice: Node server1
> state is now member
> Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] This node is within
> the primary component and will provide service.
> Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Members[3]: 1 2 3
> Feb 15 09:43:26 server3 corosync[568]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Quorum
> acquired
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Node
> server1 state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Node
> server2 state is now member
> Feb 15 09:43:26 server3 pacemaker-based[857]:  notice: Node server2
> state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Transition
> 0 aborted: Peer Halt
> Feb 15 09:43:26 server3 pacemaker-fenced[858]:  notice: Node server1
> state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  warning: Another DC
> detected: server2 (op=noop)
> Feb 15 09:43:26 server3 pacemaker-fenced[858]:  notice: Node server2
> state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: State
> transition S_ELECTION -> S_RELEASE_DC
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  warning: Cancelling
> timer for action 12 (src=67)
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: No need to
> invoke the TE (A_TE_HALT) in state S_RELEASE_DC
> Feb 15 09:43:26 server3 pacemaker-attrd[860]:  notice: Node server2
> state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: State
> transition S_PENDING -> S_NOT_DC
> Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Setting
> #attrd-protocol[server1]: (unset) -> 2
> Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Detected
> another attribute writer (server2), starting new election
> Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Setting
> #attrd-protocol[server2]: (unset) -> 2
> Feb 15 09:43:27 server3 IPaddr2(Shared-IPv4)[1258]: INFO:
> Feb 15 09:43:27 server3 ntpd[602]: Listen normally on 8 eth0
> 10.13.68.12:123
> Feb 15 09:43:27 server3 ntpd[602]: new interface(s) found: waking up
> resolver
> => Feb 15 09:43:28 server3 pacemaker-controld[862]:  notice: Result
> of start operation for tomcat9 on server3: ok
> Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: PMTUD link
> change for host: 2 link: 0 from 485 to 1397
> Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: PMTUD link
> change for host: 1 link: 0 from 485 to 1397
> Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: Global data
> MTU changed to: 1397
> => Feb 15 09:43:29 server3 pacemaker-controld[862]:  notice:
> Requesting local execution of stop operation for tomcat9 on server3
> 
> Any idea ?

What do the logs on the other node say over the same time frame?
-- 
Ken Gaillot <kgaillot at redhat.com>