[ClusterLabs] Fedora 31 - systemd based resources don't start

Thu Feb 20 16:29:45 EST 2020

On February 20, 2020 10:29:54 PM GMT+02:00, Maverick <mvrk at sapo.pt> wrote:
>
>> Hi Maverick,
>>
>>
>> According this thread:
>>
>https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>>
>> You have 'startup-fencing' is set  to false.
>>
>> Check it out - maybe this is your reason.
>>
>> Best Regards,
>> Strahil Nikolov
>
>Yes, i have stonith disabled, because as soon as the resources startup
>fail on boot, node was rebooted.
>
>
>Anyway, i was checking the pacemaker logs and the journal log, and i
>see
>that the service actually starts ok but for some reason pacemaker
>thinks
>it has timeout and then because of that tries to stop and also thinks
>it
>has timeout but actually stops it:
>
>pacemaker.log:
>
>Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (log_execute)  info:
>executing - rsc:apache action:start call_id:25
>Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (systemd_unit_exec)   
>debug: Performing asynchronous start op on systemd unit httpd named
>'apache'
>Feb 20 19:39:52 boss1 pacemaker-execd     [1499]
>(systemd_unit_exec_with_unit)  debug: Calling StartUnit for apache:
>/org/freedesktop/systemd1/unit/httpd_2eservice
>Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (action_complete)     
>notice: Giving up on apache start (rc=0): timeout (elapsed=248199ms,
>remaining=-148199ms)
>Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (log_finished)        
>debug: finished - rsc:apache action:monitor call_id:25  exit-code:198
>exec-time:248205ms queue-time:216ms
>
>Feb 20 19:40:00 boss1 pacemaker-execd     [1499] (log_execute)  info:
>executing - rsc:apache action:stop call_id:81
>Feb 20 19:40:00 boss1 pacemaker-execd     [1499] (systemd_unit_exec)   
>debug: Performing asynchronous stop op on systemd unit httpd named
>'apache'
>Feb 20 19:40:00 boss1 pacemaker-execd     [1499]
>(systemd_unit_exec_with_unit)  debug: Calling StopUnit for apache:
>/org/freedesktop/systemd1/unit/httpd_2eservice
>Feb 20 19:40:01 boss1 pacemaker-execd     [1499] (action_complete)     
>notice: Giving up on apache stop (rc=0): timeout (elapsed=304539ms,
>remaining=-204539ms)
>Feb 20 19:40:01 boss1 pacemaker-execd     [1499] (log_finished)        
>debug: finished - rsc:apache action:monitor call_id:81  exit-code:198
>exec-time:304545ms queue-time:240ms
>
>
>system journal:
>
>Feb 20 19:39:52 boss1 systemd[1]: Starting Cluster Controlled httpd...
>Feb 20 19:39:53 boss1 systemd[1]: Started Cluster Controlled httpd.
>Feb 20 19:39:53 boss1 httpd[2145]: Server configured, listening on:
>port
>443, port 80
>
>Feb 20 19:40:01 boss1 systemd[1]: Stopping The Apache HTTP Server...
>Feb 20 19:40:02 boss1 systemd[1]: httpd.service: Succeeded.
>Feb 20 19:40:02 boss1 systemd[1]: Stopped The Apache HTTP Server.
>
>
>
>
>On 20/02/2020 21:02, Strahil Nikolov wrote:
>> On February 20, 2020 9:35:07 PM GMT+02:00, Maverick <mvrk at sapo.pt>
>wrote:
>>> Manually it starts ok, no problems:
>>>
>>> pcs resource debug-start apache --full
>>> (unpack_config)     warning: Blind faith: not fencing unseen nodes
>>> Operation start for apache (systemd::httpd) returned: 'ok' (0)
>>>
>>>
>>> On 20/02/2020 16:46, Strahil Nikolov wrote:
>>>> On February 20, 2020 12:49:43 PM GMT+02:00, Maverick <mvrk at sapo.pt>
>>> wrote:
>>>>>> You really need to debug the start & stop of  tthe resource .
>>>>>>
>>>>>> Please try the debug procedure  and provide the output:
>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>>
>>>>>> Best Regards,
>>>>>> Strahil Nikolov
>>>>> Hi,
>>>>>
>>>>> Correct me if i'm wrong, but i think that procedure doesn't work
>for
>>>>> systemd class resources, i don't know which OCF script is
>>> responsible
>>>>> for handling systemd class resources.
>>>>>
>>>>> Also crm command doesn't exist in RHEL/Fedora, i've seen the crm
>>>>> command
>>>>> only in SUSE.
>>>>>
>>>>>
>>>>>
>>>>> On 19/02/2020 19:23, Strahil Nikolov wrote:
>>>>>> On February 19, 2020 7:21:12 PM GMT+02:00, Maverick
><mvrk at sapo.pt>
>>>>> wrote:
>>>>>>> How is it possible that pacemaker is reporting that takes 4.2
>>>>> minutes
>>>>>>> (254930ms) to execute the start of httpd systemd unit?
>>>>>>>
>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514] (log_execute)
>   
>>>>>>> info:
>>>>>>> executing - rsc:apache action:start call_id:25
>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514]
>>> (systemd_unit_exec)
>>>>>>>    
>>>>>>> debug: Performing asynchronous start op on systemd unit httpd
>>> named
>>>>>>> 'apache'
>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514]
>>>>>>> (systemd_unit_exec_with_unit)     debug: Calling StartUnit for
>>>>> apache:
>>>>>>> /org/freedesktop/systemd1/unit/httpd_2eservice
>>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd     [1514]
>(action_complete)
>>>>>    
>>>>>>> notice: Giving up on apache start (rc=0): timeout
>>> (elapsed=254930ms,
>>>>>>> remaining=-154930ms)
>>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd     [1514] (log_finished)
>>>    
>>>>>>> debug: finished - rsc:apache action:monitor call_id:25 
>>>>> exit-code:198
>>>>>>> exec-time:254935ms queue-time:235ms
>>>>>>>
>>>>>>>
>>>>>>> Starting manually works fine and fast:
>>>>>>>
>>>>>>> # time systemctl start httpd
>>>>>>> real    0m0.144s
>>>>>>> user    0m0.005s
>>>>>>> sys    0m0.008s
>>>>>>>
>>>>>>>
>>>>>>> On 17/02/2020 22:47, Mvrk wrote:
>>>>>>>> In attachment the pacemaker.log. On the log i can see that the
>>>>>>> cluster
>>>>>>>> tries to start, the start fails, then tries to stop, and the
>stop
>>>>>>> also
>>>>>>>> fails also.
>>>>>>>>
>>>>>>>> One more thing, my cluster was working fine on Fedora 28, i
>>> started
>>>>>>>> having this problem after upgrade to Fedora 31.
>>>>>>>>
>>>>>>>> On 17/02/2020 21:30, Ricardo Esteves wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Yes, i also don't understand why is trying to stop them first.
>>>>>>>>>
>>>>>>>>> SELinux is disabled:
>>>>>>>>>
>>>>>>>>> # getenforce
>>>>>>>>> Disabled
>>>>>>>>>
>>>>>>>>> All systemd services controlled by the cluster are disabled
>from
>>>>>>>>> starting at boot:
>>>>>>>>>
>>>>>>>>> # systemctl is-enabled httpd
>>>>>>>>> disabled
>>>>>>>>>
>>>>>>>>> # systemctl is-enabled openvpn-server at 01-server
>>>>>>>>> disabled
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 17/02/2020 20:28, Ken Gaillot wrote:
>>>>>>>>>> On Mon, 2020-02-17 at 17:35 +0000, Maverick wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> When i start my cluster, most of my systemd resources won't
>>>>> start:
>>>>>>>>>>> Failed Resource Actions:
>>>>>>>>>>>   * apache_stop_0 on boss1 'OCF_TIMEOUT' (198): call=82,
>>>>>>>>>>> status='Timed Out', exitreason='',
>last-rc-change='1970-01-01
>>>>>>>>>>> 01:00:54 +01:00', queued=29ms, exec=197799ms
>>>>>>>>>>>   * openvpn_stop_0 on boss1 'OCF_TIMEOUT' (198): call=61,
>>>>>>>>>>> status='Timed Out', exitreason='',
>last-rc-change='1970-01-01
>>>>>>>>>>> 01:00:54 +01:00', queued=1805ms, exec=198841ms
>>>>>>>>>> These show that attempts to stop failed, rather than start.
>>>>>>>>>>
>>>>>>>>>>> So everytime i reboot my node, i need to start the resources
>>>>>>> manually
>>>>>>>>>>> using systemd, for example:
>>>>>>>>>>>
>>>>>>>>>>> systemd start apache
>>>>>>>>>>>
>>>>>>>>>>> and then pcs resource cleanup
>>>>>>>>>>>
>>>>>>>>>>> Resources configuration:
>>>>>>>>>>>
>>>>>>>>>>> Clone: apache-clone
>>>>>>>>>>>   Meta Attrs: maintenance=false
>>>>>>>>>>>   Resource: apache (class=systemd type=httpd)
>>>>>>>>>>>    Meta Attrs: maintenance=false
>>>>>>>>>>>    Operations: monitor interval=60 timeout=100
>>> (apache-monitor-
>>>>>>>>>>> interval-60)
>>>>>>>>>>>                start interval=0s timeout=100
>>>>>>> (apache-start-interval-
>>>>>>>>>>> 0s)
>>>>>>>>>>>                stop interval=0s timeout=100
>>>>>>> (apache-stop-interval-0s)
>>>>>>>>>>> Resource: openvpn (class=systemd
>>> type=openvpn-server at 01-server)
>>>>>>>>>>>    Meta Attrs: maintenance=false
>>>>>>>>>>>    Operations: monitor interval=60 timeout=100
>>> (openvpn-monitor-
>>>>>>>>>>> interval-60)
>>>>>>>>>>>                start interval=0s timeout=100
>>>>>>> (openvpn-start-interval-
>>>>>>>>>>> 0s)
>>>>>>>>>>>                stop interval=0s timeout=100
>>>>>>> (openvpn-stop-interval-
>>>>>>>>>>> 0s)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Btw, if i try a debug-start / debug-stop the mentioned
>>> resources
>>>>>>>>>>> start and stop ok.
>>>>>>>>>> Based on that, my first guess would be SELinux. Check the
>>> SELinux
>>>>>>> logs
>>>>>>>>>> for denials.
>>>>>>>>>>
>>>>>>>>>> Also, make sure your systemd services are not enabled in
>>> systemd
>>>>>>> itself
>>>>>>>>>> (e.g. via systemctl enable). Clustered systemd services
>should
>>> be
>>>>>>>>>> managed by the cluster only.
>>>>>>> _______________________________________________
>>>>>>> Manage your subscription:
>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>> You really need to debug the start & stop of  tthe resource .
>>>>>>
>>>>>> Please try the debug procedure  and provide the output:
>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>>
>>>>>> Best Regards,
>>>>>> Strahil Nikolov
>>>> Hi Maverick,
>>>>
>>>>
>>>> you can replace 'crm resource stop' with 'pcs  resource disable'.
>>>> The rest is working, but sadly not for systemd.
>>>>
>>>> You can try to:
>>>> 'pcs resource debug-start <resource> --full'
>>>> Another approach is to:
>>>> 1. Copy service  to /etc/systemd/system
>>>> 2. In '[service]' section add this:
>>>> Environment=SYSTEMD_LOG_LEVEL=debug
>>>> 3. Reload  systemd:
>>>> systemctl daemon_reload
>>>> Note: I assume you got downtime for debugging the issue
>>>> 4. Use  'debug-start --full'
>>>>
>>>> Note: Don't forget to remove the debug, or your journal will get
>>> full.
>>>> Best Regards,
>>>> Strahil Nikolov
>> Hi Maverick,
>>
>>
>> According this thread:
>>
>https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>>
>> You have 'startup-fencing' is set  to false.
>>
>> Check it out - maybe this is your reason.
>>
>> Best Regards,
>> Strahil Nikolov

Hi Maverick,

Can you share your systemd service ?
What distribution are you using and what is the reason for using systemd instead of the ocf resource for apache ?

Could you enable the DEBUG for the systemd service ?

Best Regards,
Strahil Nikolov