[ClusterLabs] Antw: [EXT] Systemd resource started on node after reboot before cluster is stable ?
Adam Cecile
acecile at le-vert.net
Mon Feb 20 05:42:33 EST 2023
On 2/16/23 20:54, Ken Gaillot wrote:
> On Thu, 2023-02-16 at 11:13 +0100, Adam Cecile wrote:
>> On 2/16/23 07:57, Ulrich Windl wrote:
>>>>>> Adam Cecile<acecile at le-vert.net> schrieb am 15.02.2023 um
>>>>>> 10:49 in
>>> Nachricht
>>> <b4f1f2f1-66fe-ca62-ff4f-708d781a507c at le-vert.net>:
>>>> Hello,
>>>>
>>>> Just had some issue with unexpected server behavior after reboot.
>>>> This
>>>> node was powered off, so cluster was running fine with this
>>>> tomcat9
>>>> resource running on a different machine.
>>>>
>>>> After powering on this node again, it briefly started tomcat
>>>> before
>>>> joining the cluster and decided to stop it again. I'm not sure
>>>> why.
>>>>
>>>>
>>>> Here is the systemctl status tomcat9 on this host:
>>>>
>>>> tomcat9.service - Apache Tomcat 9 Web Application Server
>>>> Loaded: loaded (/lib/systemd/system/tomcat9.service;
>>>> disabled;
>>>> vendor preset: enabled)
>>>> Drop-In: /etc/systemd/system/tomcat9.service.d
>>>> └─override.conf
>>>> Active: inactive (dead)
>>>> Docs:https://tomcat.apache.org/tomcat-9.0-doc/index.html
>>>>
>>>>
>>>> Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina]
>>>> Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine:
>>>> [Apache
>>>> Tomcat/9.0.43 (Debian)]
>>>> Feb 15 09:43:27 server tomcat9[1398]: [...]
>>>> Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web
>>>> Application Server...
>>>> Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded.
>>>> Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web
>>>> Application Server.
>>>> Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed
>>>> 8.017s CPU
>>>> time.
>>>>
>>>> You can see it is disabled and should NOT be started with the
>>>> same,
>>>> start/stop is under Corosync control
>>>>
>>>>
>>>> The systemd resource is defined like this:
>>>>
>>>> primitive tomcat9 systemd:tomcat9.service \
>>>> op start interval=0 timeout=120 \
>>>> op stop interval=0 timeout=120 \
>>>> op monitor interval=60 timeout=100
>>>>
>>>>
>>>> Any idea why this happened ?
>>> Your journal (syslog) should tell you!
>> Indeed, I overlooked yesterday... But it says it's pacemaker that
>> decided to start it:
>>
>> Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Sync members[3]: 1
>> 2 3
>> Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Sync joined[2]: 1 2
>> Feb 15 09:43:26 server3 corosync[568]: [TOTEM ] A new membership
>> (1.42d) was formed. Members joined: 1 2
>> Feb 15 09:43:26 server3 pacemaker-attrd[860]: notice: Node server1
>> state is now member
>> Feb 15 09:43:26 server3 pacemaker-based[857]: notice: Node server1
>> state is now member
>> Feb 15 09:43:26 server3 corosync[568]: [QUORUM] This node is within
>> the primary component and will provide service.
>> Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Members[3]: 1 2 3
>> Feb 15 09:43:26 server3 corosync[568]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Quorum
>> acquired
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Node
>> server1 state is now member
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Node
>> server2 state is now member
>> Feb 15 09:43:26 server3 pacemaker-based[857]: notice: Node server2
>> state is now member
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Transition
>> 0 aborted: Peer Halt
>> Feb 15 09:43:26 server3 pacemaker-fenced[858]: notice: Node server1
>> state is now member
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: warning: Another DC
>> detected: server2 (op=noop)
>> Feb 15 09:43:26 server3 pacemaker-fenced[858]: notice: Node server2
>> state is now member
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: State
>> transition S_ELECTION -> S_RELEASE_DC
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: warning: Cancelling
>> timer for action 12 (src=67)
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: No need to
>> invoke the TE (A_TE_HALT) in state S_RELEASE_DC
>> Feb 15 09:43:26 server3 pacemaker-attrd[860]: notice: Node server2
>> state is now member
>> Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: State
>> transition S_PENDING -> S_NOT_DC
>> Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Setting
>> #attrd-protocol[server1]: (unset) -> 2
>> Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Detected
>> another attribute writer (server2), starting new election
>> Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Setting
>> #attrd-protocol[server2]: (unset) -> 2
>> Feb 15 09:43:27 server3 IPaddr2(Shared-IPv4)[1258]: INFO:
>> Feb 15 09:43:27 server3 ntpd[602]: Listen normally on 8 eth0
>> 10.13.68.12:123
>> Feb 15 09:43:27 server3 ntpd[602]: new interface(s) found: waking up
>> resolver
>> => Feb 15 09:43:28 server3 pacemaker-controld[862]: notice: Result
>> of start operation for tomcat9 on server3: ok
>> Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: PMTUD link
>> change for host: 2 link: 0 from 485 to 1397
>> Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: PMTUD link
>> change for host: 1 link: 0 from 485 to 1397
>> Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: Global data
>> MTU changed to: 1397
>> => Feb 15 09:43:29 server3 pacemaker-controld[862]: notice:
>> Requesting local execution of stop operation for tomcat9 on server3
>>
>> Any idea ?
> What do the logs on the other node say over the same time frame?
Here is the logs from server2, same period:
Feb 15 09:43:26 server2 corosync[524]: [KNET ] rx: host: 3 link: 0 is up
Feb 15 09:43:26 server2 corosync[524]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
Feb 15 09:43:26 server2 corosync[524]: [QUORUM] Sync members[3]: 1 2 3
Feb 15 09:43:26 server2 corosync[524]: [QUORUM] Sync joined[1]: 3
Feb 15 09:43:26 server2 corosync[524]: [TOTEM ] A new membership
(1.42d) was formed. Members joined: 3
Feb 15 09:43:26 server2 pacemaker-based[812]: notice: Node server3.
state is now member
Feb 15 09:43:26 server2 pacemaker-fenced[813]: notice: Node server3.
state is now member
Feb 15 09:43:26 server2 pacemaker-attrd[815]: notice: Node server3.
state is now member
Feb 15 09:43:26 server2 corosync[524]: [QUORUM] Members[3]: 1 2 3
Feb 15 09:43:26 server2 corosync[524]: [MAIN ] Completed service
synchronization, ready to provide service.
Feb 15 09:43:26 server2 pacemaker-controld[817]: notice: Node server3.
state is now member
Feb 15 09:43:26 server2 pacemaker-controld[817]: notice: State
transition S_IDLE -> S_INTEGRATION
Feb 15 09:43:26 server2 pacemaker-controld[817]: warning: Another DC
detected: server3. (op=noop)
Feb 15 09:43:26 server2 pacemaker-controld[817]: notice: State
transition S_ELECTION -> S_INTEGRATION
Feb 15 09:43:27 server2 pacemaker-attrd[815]: notice: Detected another
attribute writer (server3.), starting new election
Feb 15 09:43:27 server2 pacemaker-attrd[815]: notice: Setting
#attrd-protocol[server3.]: (unset) -> 2
Feb 15 09:43:27 server2 pacemaker-attrd[815]: notice: Recorded local
node as attribute writer (was unset)
Feb 15 09:43:27 server2 pacemaker-schedulerd[816]: notice: On loss of
quorum: Ignore
Feb 15 09:43:27 server2 pacemaker-schedulerd[816]: error: Resource
Shared-IPv4 is active on 2 nodes (attempting recovery)
Feb 15 09:43:27 server2 pacemaker-schedulerd[816]: notice: See
https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
information
Feb 15 09:43:27 server2 pacemaker-schedulerd[816]: notice: *
Restart Shared-IPv4 ( server1. )
Feb 15 09:43:27 server2 pacemaker-schedulerd[816]: notice: *
Restart tomcat9 ( server1. ) due to required Shared-IPv4 start
Feb 15 09:43:27 server2 pacemaker-schedulerd[816]: error: Calculated
transition 767 (with errors), saving inputs in
/var/lib/pacemaker/pengine/pe-error-5.bz2
Feb 15 09:43:27 server2 pacemaker-controld[817]: notice: Initiating
stop operation tomcat9_stop_0 on server1.
Feb 15 09:43:28 server2 pacemaker-controld[817]: notice: Transition 767
aborted by operation tomcat9_start_0 'modify' on server3.: Foreign event
Feb 15 09:43:29 server2 pacemaker-controld[817]: notice: Transition 767
(Complete=2, Pending=0, Fired=0, Skipped=2, Incomplete=9,
Source=/var/lib/pacemaker/pengine/pe-error-5.bz2): Stopped
Feb 15 09:43:29 server2 pacemaker-schedulerd[816]: notice: On loss of
quorum: Ignore
Feb 15 09:43:29 server2 pacemaker-schedulerd[816]: error: Resource
Shared-IPv4 is active on 2 nodes (attempting recovery)
Feb 15 09:43:29 server2 pacemaker-schedulerd[816]: notice: See
https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
information
Feb 15 09:43:29 server2 pacemaker-schedulerd[816]: notice: *
Restart Shared-IPv4 ( server1. )
Feb 15 09:43:29 server2 pacemaker-schedulerd[816]: notice: *
Move tomcat9 ( server3. -> server1. )
Feb 15 09:43:29 server2 pacemaker-schedulerd[816]: error: Calculated
transition 768 (with errors), saving inputs in
/var/lib/pacemaker/pengine/pe-error-6.bz2
Feb 15 09:43:29 server2 pacemaker-controld[817]: notice: Initiating
stop operation tomcat9_stop_0 on server3.
Feb 15 09:43:31 server2 pacemaker-controld[817]: notice: Initiating
stop operation Shared-IPv4_stop_0 on server3.
Feb 15 09:43:31 server2 pacemaker-controld[817]: notice: Initiating
stop operation Shared-IPv4_stop_0 on server1.
Feb 15 09:43:31 server2 pacemaker-controld[817]: notice: Initiating
start operation Shared-IPv4_start_0 on server1.
Feb 15 09:43:31 server2 pacemaker-controld[817]: notice: Initiating
monitor operation Shared-IPv4_monitor_60000 on server1.
Feb 15 09:43:31 server2 pacemaker-controld[817]: notice: Initiating
start operation tomcat9_start_0 on server1.
Feb 15 09:43:33 server2 pacemaker-controld[817]: notice: Initiating
monitor operation tomcat9_monitor_60000 on server1.
Feb 15 09:43:33 server2 pacemaker-controld[817]: notice: Transition 768
(Complete=11, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-error-6.bz2): Complete
Feb 15 09:43:33 server2 pacemaker-controld[817]: notice: State
transition S_TRANSITION_ENGINE -> S_IDLE
Feb 15 09:43:38 server2 pacemaker-based[812]: notice: Local CIB
0.206.53.01293e9ed5a687db873d1ddd384803e2 differs from server3.:
0.206.51.411962ae9850d59bc1bbf4728b9e89b2 0x55f23d164cb0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230220/35c9a880/attachment.htm>
More information about the Users
mailing list