[ClusterLabs] [EXT] Re: Fedora 31 - systemd based resources don't start

Thu Feb 20 16:21:00 EST 2020

Hi,
it seems systemd is stopping Apache, so the cluster is surprised:
Feb 20 19:40:01 boss1 systemd[1]: Stopping The Apache HTTP Server...

IMHO it's best to dump systemd, but that's just my opinion...

>>> Maverick 20.02.2020, 21:30 >>>

> Hi Maverick,
>
>
> According this thread:
> https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>
> You have 'startup-fencing' is set to false.
>
> Check it out - maybe this is your reason.
>
> Best Regards,
> Strahil Nikolov
Yes, i have stonith disabled, because as soon as the resources startup
fail on boot, node was rebooted.

Anyway, i was checking the pacemaker logs and the journal log, and i see
that the service actually starts ok but for some reason pacemaker thinks
it has timeout and then because of that tries to stop and also thinks it
has timeout but actually stops it:
pacemaker.log:
Feb 20 19:39:52 boss1 pacemaker-execd [1499] (log_execute) info:
executing - rsc:apache action:start call_id:25
Feb 20 19:39:52 boss1 pacemaker-execd [1499] (systemd_unit_exec)
debug: Performing asynchronous start op on systemd unit httpd named 'apache'
Feb 20 19:39:52 boss1 pacemaker-execd [1499]
(systemd_unit_exec_with_unit) debug: Calling StartUnit for apache:
/org/freedesktop/systemd1/unit/httpd_2eservice
Feb 20 19:39:52 boss1 pacemaker-execd [1499] (action_complete)
notice: Giving up on apache start (rc=0): timeout (elapsed=248199ms,
remaining=-148199ms)
Feb 20 19:39:52 boss1 pacemaker-execd [1499] (log_finished)
debug: finished - rsc:apache action:monitor call_id:25 exit-code:198
exec-time:248205ms queue-time:216ms
Feb 20 19:40:00 boss1 pacemaker-execd [1499] (log_execute) info:
executing - rsc:apache action:stop call_id:81
Feb 20 19:40:00 boss1 pacemaker-execd [1499] (systemd_unit_exec)
debug: Performing asynchronous stop op on systemd unit httpd named 'apache'
Feb 20 19:40:00 boss1 pacemaker-execd [1499]
(systemd_unit_exec_with_unit) debug: Calling StopUnit for apache:
/org/freedesktop/systemd1/unit/httpd_2eservice
Feb 20 19:40:01 boss1 pacemaker-execd [1499] (action_complete)
notice: Giving up on apache stop (rc=0): timeout (elapsed=304539ms,
remaining=-204539ms)
Feb 20 19:40:01 boss1 pacemaker-execd [1499] (log_finished)
debug: finished - rsc:apache action:monitor call_id:81 exit-code:198
exec-time:304545ms queue-time:240ms

system journal:
Feb 20 19:39:52 boss1 systemd[1]: Starting Cluster Controlled httpd...
Feb 20 19:39:53 boss1 systemd[1]: Started Cluster Controlled httpd.
Feb 20 19:39:53 boss1 httpd[2145]: Server configured, listening on: port
443, port 80
Feb 20 19:40:01 boss1 systemd[1]: Stopping The Apache HTTP Server...
Feb 20 19:40:02 boss1 systemd[1]: httpd.service: Succeeded.
Feb 20 19:40:02 boss1 systemd[1]: Stopped The Apache HTTP Server.

On 20/02/2020 21:02, Strahil Nikolov wrote:
> On February 20, 2020 9:35:07 PM GMT+02:00, Maverick <mvrk at sapo.pt> wrote:
>> Manually it starts ok, no problems:
>>
>> pcs resource debug-start apache --full
>> (unpack_config) warning: Blind faith: not fencing unseen nodes
>> Operation start for apache (systemd::httpd) returned: 'ok' (0)
>>
>>
>> On 20/02/2020 16:46, Strahil Nikolov wrote:
>>> On February 20, 2020 12:49:43 PM GMT+02:00, Maverick <mvrk at sapo.pt>
>> wrote:
>>>>> You really need to debug the start & stop of tthe resource .
>>>>>
>>>>> Please try the debug procedure and provide the output:
>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>
>>>>> Best Regards,
>>>>> Strahil Nikolov
>>>> Hi,
>>>>
>>>> Correct me if i'm wrong, but i think that procedure doesn't work for
>>>> systemd class resources, i don't know which OCF script is
>> responsible
>>>> for handling systemd class resources.
>>>>
>>>> Also crm command doesn't exist in RHEL/Fedora, i've seen the crm
>>>> command
>>>> only in SUSE.
>>>>
>>>>
>>>>
>>>> On 19/02/2020 19:23, Strahil Nikolov wrote:
>>>>> On February 19, 2020 7:21:12 PM GMT+02:00, Maverick <mvrk at sapo.pt>
>>>> wrote:
>>>>>> How is it possible that pacemaker is reporting that takes 4.2
>>>> minutes
>>>>>> (254930ms) to execute the start of httpd systemd unit?
>>>>>>
>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514] (log_execute)
>>>>>> info:
>>>>>> executing - rsc:apache action:start call_id:25
>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514]
>> (systemd_unit_exec)
>>>>>>
>>>>>> debug: Performing asynchronous start op on systemd unit httpd
>> named
>>>>>> 'apache'
>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514]
>>>>>> (systemd_unit_exec_with_unit) debug: Calling StartUnit for
>>>> apache:
>>>>>> /org/freedesktop/systemd1/unit/httpd_2eservice
>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd [1514] (action_complete)
>>>>
>>>>>> notice: Giving up on apache start (rc=0): timeout
>> (elapsed=254930ms,
>>>>>> remaining=-154930ms)
>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd [1514] (log_finished)
>>
>>>>>> debug: finished - rsc:apache action:monitor call_id:25
>>>> exit-code:198
>>>>>> exec-time:254935ms queue-time:235ms
>>>>>>
>>>>>>
>>>>>> Starting manually works fine and fast:
>>>>>>
>>>>>> # time systemctl start httpd
>>>>>> real 0m0.144s
>>>>>> user 0m0.005s
>>>>>> sys 0m0.008s
>>>>>>
>>>>>>
>>>>>> On 17/02/2020 22:47, Mvrk wrote:
>>>>>>> In attachment the pacemaker.log. On the log i can see that the
>>>>>> cluster
>>>>>>> tries to start, the start fails, then tries to stop, and the stop
>>>>>> also
>>>>>>> fails also.
>>>>>>>
>>>>>>> One more thing, my cluster was working fine on Fedora 28, i
>> started
>>>>>>> having this problem after upgrade to Fedora 31.
>>>>>>>
>>>>>>> On 17/02/2020 21:30, Ricardo Esteves wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Yes, i also don't understand why is trying to stop them first.
>>>>>>>>
>>>>>>>> SELinux is disabled:
>>>>>>>>
>>>>>>>> # getenforce
>>>>>>>> Disabled
>>>>>>>>
>>>>>>>> All systemd services controlled by the cluster are disabled from
>>>>>>>> starting at boot:
>>>>>>>>
>>>>>>>> # systemctl is-enabled httpd
>>>>>>>> disabled
>>>>>>>>
>>>>>>>> # systemctl is-enabled openvpn-server at 01-server
>>>>>>>> disabled
>>>>>>>>
>>>>>>>>
>>>>>>>> On 17/02/2020 20:28, Ken Gaillot wrote:
>>>>>>>>> On Mon, 2020-02-17 at 17:35 +0000, Maverick wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> When i start my cluster, most of my systemd resources won't
>>>> start:
>>>>>>>>>> Failed Resource Actions:
>>>>>>>>>> * apache_stop_0 on boss1 'OCF_TIMEOUT' (198): call=82,
>>>>>>>>>> status='Timed Out', exitreason='', last-rc-change='1970-01-01
>>>>>>>>>> 01:00:54 +01:00', queued=29ms, exec=197799ms
>>>>>>>>>> * openvpn_stop_0 on boss1 'OCF_TIMEOUT' (198): call=61,
>>>>>>>>>> status='Timed Out', exitreason='', last-rc-change='1970-01-01
>>>>>>>>>> 01:00:54 +01:00', queued=1805ms, exec=198841ms
>>>>>>>>> These show that attempts to stop failed, rather than start.
>>>>>>>>>
>>>>>>>>>> So everytime i reboot my node, i need to start the resources
>>>>>> manually
>>>>>>>>>> using systemd, for example:
>>>>>>>>>>
>>>>>>>>>> systemd start apache
>>>>>>>>>>
>>>>>>>>>> and then pcs resource cleanup
>>>>>>>>>>
>>>>>>>>>> Resources configuration:
>>>>>>>>>>
>>>>>>>>>> Clone: apache-clone
>>>>>>>>>> Meta Attrs: maintenance=false
>>>>>>>>>> Resource: apache (class=systemd type=httpd)
>>>>>>>>>> Meta Attrs: maintenance=false
>>>>>>>>>> Operations: monitor interval=60 timeout=100
>> (apache-monitor-
>>>>>>>>>> interval-60)
>>>>>>>>>> start interval=0s timeout=100
>>>>>> (apache-start-interval-
>>>>>>>>>> 0s)
>>>>>>>>>> stop interval=0s timeout=100
>>>>>> (apache-stop-interval-0s)
>>>>>>>>>> Resource: openvpn (class=systemd
>> type=openvpn-server at 01-server)
>>>>>>>>>> Meta Attrs: maintenance=false
>>>>>>>>>> Operations: monitor interval=60 timeout=100
>> (openvpn-monitor-
>>>>>>>>>> interval-60)
>>>>>>>>>> start interval=0s timeout=100
>>>>>> (openvpn-start-interval-
>>>>>>>>>> 0s)
>>>>>>>>>> stop interval=0s timeout=100
>>>>>> (openvpn-stop-interval-
>>>>>>>>>> 0s)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Btw, if i try a debug-start / debug-stop the mentioned
>> resources
>>>>>>>>>> start and stop ok.
>>>>>>>>> Based on that, my first guess would be SELinux. Check the
>> SELinux
>>>>>> logs
>>>>>>>>> for denials.
>>>>>>>>>
>>>>>>>>> Also, make sure your systemd services are not enabled in
>> systemd
>>>>>> itself
>>>>>>>>> (e.g. via systemctl enable). Clustered systemd services should
>> be
>>>>>>>>> managed by the cluster only.
>>>>>> _______________________________________________
>>>>>> Manage your subscription:
>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>> You really need to debug the start & stop of tthe resource .
>>>>>
>>>>> Please try the debug procedure and provide the output:
>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>
>>>>> Best Regards,
>>>>> Strahil Nikolov
>>> Hi Maverick,
>>>
>>>
>>> you can replace 'crm resource stop' with 'pcs resource disable'.
>>> The rest is working, but sadly not for systemd.
>>>
>>> You can try to:
>>> 'pcs resource debug-start <resource> --full'
>>> Another approach is to:
>>> 1. Copy service to /etc/systemd/system
>>> 2. In '[service]' section add this:
>>> Environment=SYSTEMD_LOG_LEVEL=debug
>>> 3. Reload systemd:
>>> systemctl daemon_reload
>>> Note: I assume you got downtime for debugging the issue
>>> 4. Use 'debug-start --full'
>>>
>>> Note: Don't forget to remove the debug, or your journal will get
>> full.
>>> Best Regards,
>>> Strahil Nikolov
> Hi Maverick,
>
>
> According this thread:
> https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>
> You have 'startup-fencing' is set to false.
>
> Check it out - maybe this is your reason.
>
> Best Regards,
> Strahil Nikolov
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/