[ClusterLabs] Antw: [EXT] Systemd resource started on node after reboot before cluster is stable ?

Reid Wahl nwahl at redhat.com
Fri Feb 17 05:27:13 EST 2023


On Thu, Feb 16, 2023 at 2:14 AM Adam Cecile <acecile at le-vert.net> wrote:
>
>
> On 2/16/23 07:57, Ulrich Windl wrote:
>
> Adam Cecile <acecile at le-vert.net> schrieb am 15.02.2023 um 10:49 in
>
> Nachricht
> <b4f1f2f1-66fe-ca62-ff4f-708d781a507c at le-vert.net>:
>
> Hello,
>
> Just had some issue with unexpected server behavior after reboot. This
> node was powered off, so cluster was running fine with this tomcat9
> resource running on a different machine.
>
> After powering on this node again, it briefly started tomcat before
> joining the cluster and decided to stop it again. I'm not sure why.
>
>
> Here is the systemctl status tomcat9 on this host:
>
> tomcat9.service - Apache Tomcat 9 Web Application Server
>       Loaded: loaded (/lib/systemd/system/tomcat9.service; disabled;
> vendor preset: enabled)
>      Drop-In: /etc/systemd/system/tomcat9.service.d
>               └─override.conf
>       Active: inactive (dead)
>         Docs: https://tomcat.apache.org/tomcat-9.0-doc/index.html
>
> Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina]
> Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine: [Apache
> Tomcat/9.0.43 (Debian)]
> Feb 15 09:43:27 server tomcat9[1398]: [...]
> Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web
> Application Server...
> Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded.
> Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web
> Application Server.
> Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed 8.017s CPU
> time.
>
> You can see it is disabled and should NOT be started with the same,
> start/stop is under Corosync control
>
>
> The systemd resource is defined like this:
>
> primitive tomcat9 systemd:tomcat9.service \
>          op start interval=0 timeout=120 \
>          op stop interval=0 timeout=120 \
>          op monitor interval=60 timeout=100
>
>
> Any idea why this happened ?
>
> Your journal (syslog) should tell you!
>
> Indeed, I overlooked yesterday... But it says it's pacemaker that decided to start it:
>
>
> Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Sync members[3]: 1 2 3
> Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Sync joined[2]: 1 2
> Feb 15 09:43:26 server3 corosync[568]:   [TOTEM ] A new membership (1.42d) was formed. Members joined: 1 2
> Feb 15 09:43:26 server3 pacemaker-attrd[860]:  notice: Node server1 state is now member
> Feb 15 09:43:26 server3 pacemaker-based[857]:  notice: Node server1 state is now member
> Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] This node is within the primary component and will provide service.
> Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Members[3]: 1 2 3
> Feb 15 09:43:26 server3 corosync[568]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Quorum acquired
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Node server1 state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Node server2 state is now member
> Feb 15 09:43:26 server3 pacemaker-based[857]:  notice: Node server2 state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Transition 0 aborted: Peer Halt
> Feb 15 09:43:26 server3 pacemaker-fenced[858]:  notice: Node server1 state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  warning: Another DC detected: server2 (op=noop)
> Feb 15 09:43:26 server3 pacemaker-fenced[858]:  notice: Node server2 state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: State transition S_ELECTION -> S_RELEASE_DC
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  warning: Cancelling timer for action 12 (src=67)
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: No need to invoke the TE (A_TE_HALT) in state S_RELEASE_DC
> Feb 15 09:43:26 server3 pacemaker-attrd[860]:  notice: Node server2 state is now member
> Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: State transition S_PENDING -> S_NOT_DC
> Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Setting #attrd-protocol[server1]: (unset) -> 2
> Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Detected another attribute writer (server2), starting new election
> Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Setting #attrd-protocol[server2]: (unset) -> 2
> Feb 15 09:43:27 server3 IPaddr2(Shared-IPv4)[1258]: INFO:
> Feb 15 09:43:27 server3 ntpd[602]: Listen normally on 8 eth0 10.13.68.12:123
> Feb 15 09:43:27 server3 ntpd[602]: new interface(s) found: waking up resolver
> => Feb 15 09:43:28 server3 pacemaker-controld[862]:  notice: Result of start operation for tomcat9 on server3: ok
> Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 485 to 1397
> Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 485 to 1397
> Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: Global data MTU changed to: 1397
> => Feb 15 09:43:29 server3 pacemaker-controld[862]:  notice: Requesting local execution of stop operation for tomcat9 on server3
>
>
> Any idea ?
>

Can you share the full cib.xml file, as well as the section of
pacemaker.log prior to this section? It looks as if server3 began
starting tomcat9 before the other nodes joined its membership.

>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker



More information about the Users mailing list