[ClusterLabs] Fedora 31 - systemd based resources don't start

Maverick mvrk at sapo.pt
Thu Feb 20 17:10:32 EST 2020


Hi,

I'm using Fedora 31 (x86_64).

For apache i can use the ocf agent sure, but i have other resources for
who don't exist an ocf agent, so for them i need to use systemd.

All ocf and lsb type resources start ok on boot, only systemd resources
have this problem.

I already enabled debug for httpd and openvpn-server systemd units, but
i don't see any debug on /var/log/messages or journal about any of these
units.

Here some of the systemd units:

Apache:

[Unit]
Description=The Apache HTTP Server
Wants=httpd-init.service
After=network.target remote-fs.target nss-lookup.target httpd-init.service
Documentation=man:httpd.service(8)

[Service]
Type=notify
Environment=LANG=C
Environment=SYSTEMD_LOG_LEVEL=debug

ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND
ExecReload=/usr/sbin/httpd $OPTIONS -k graceful
# Send SIGWINCH for graceful stop
KillSignal=SIGWINCH
KillMode=mixed
PrivateTmp=true

[Install]
WantedBy=multi-user.target

-----------------

OpenVPN:

[Unit]
Description=OpenVPN service for %I
After=syslog.target network-online.target
Wants=network-online.target
Documentation=man:openvpn(8)
Documentation=https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage
Documentation=https://community.openvpn.net/openvpn/wiki/HOWTO

[Service]
Type=notify
PrivateTmp=true
WorkingDirectory=/etc/openvpn/server
Environment=SYSTEMD_LOG_LEVEL=debug
ExecStart=/usr/sbin/openvpn --status %t/openvpn-server/status-%i.log
--status-version 2 --suppress-timestamps --cipher AES-256-GCM
--ncp-ciphers AES-256-GCM:AES-128-GCM:AES-256-CBC:AES-128-CBC:BF-CBC
--config %i.conf
CapabilityBoundingSet=CAP_IPC_LOCK CAP_NET_ADMIN CAP_NET_BIND_SERVICE
CAP_NET_RAW CAP_SETGID CAP_SETUID CAP_SYS_CHROOT CAP_DAC_OVERRIDE
CAP_AUDIT_WRITE
LimitNPROC=10
DeviceAllow=/dev/null rw
DeviceAllow=/dev/net/tun rw
ProtectSystem=true
ProtectHome=true
KillMode=process
RestartSec=5s
Restart=on-failure

[Install]
WantedBy=multi-user.target

---------------------------------

Zabbix Server:

[Unit]
Description=Zabbix Server with Oracle DB
After=syslog.target network.target

[Service]
Type=simple
Environment="LD_LIBRARY_PATH=/opt/oracle/lib"
ExecStart=/usr/sbin/zabbix_server -f
User=zabbixsrv

[Install]
WantedBy=multi-user.target



On 20/02/2020 22:29, Strahil Nikolov wrote:
> On February 20, 2020 10:29:54 PM GMT+02:00, Maverick <mvrk at sapo.pt> wrote:
>>> Hi Maverick,
>>>
>>>
>>> According this thread:
>>>
>> https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>>> You have 'startup-fencing' is set  to false.
>>>
>>> Check it out - maybe this is your reason.
>>>
>>> Best Regards,
>>> Strahil Nikolov
>> Yes, i have stonith disabled, because as soon as the resources startup
>> fail on boot, node was rebooted.
>>
>>
>> Anyway, i was checking the pacemaker logs and the journal log, and i
>> see
>> that the service actually starts ok but for some reason pacemaker
>> thinks
>> it has timeout and then because of that tries to stop and also thinks
>> it
>> has timeout but actually stops it:
>>
>> pacemaker.log:
>>
>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (log_execute)  info:
>> executing - rsc:apache action:start call_id:25
>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (systemd_unit_exec)   
>> debug: Performing asynchronous start op on systemd unit httpd named
>> 'apache'
>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499]
>> (systemd_unit_exec_with_unit)  debug: Calling StartUnit for apache:
>> /org/freedesktop/systemd1/unit/httpd_2eservice
>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (action_complete)     
>> notice: Giving up on apache start (rc=0): timeout (elapsed=248199ms,
>> remaining=-148199ms)
>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (log_finished)        
>> debug: finished - rsc:apache action:monitor call_id:25  exit-code:198
>> exec-time:248205ms queue-time:216ms
>>
>> Feb 20 19:40:00 boss1 pacemaker-execd     [1499] (log_execute)  info:
>> executing - rsc:apache action:stop call_id:81
>> Feb 20 19:40:00 boss1 pacemaker-execd     [1499] (systemd_unit_exec)   
>> debug: Performing asynchronous stop op on systemd unit httpd named
>> 'apache'
>> Feb 20 19:40:00 boss1 pacemaker-execd     [1499]
>> (systemd_unit_exec_with_unit)  debug: Calling StopUnit for apache:
>> /org/freedesktop/systemd1/unit/httpd_2eservice
>> Feb 20 19:40:01 boss1 pacemaker-execd     [1499] (action_complete)     
>> notice: Giving up on apache stop (rc=0): timeout (elapsed=304539ms,
>> remaining=-204539ms)
>> Feb 20 19:40:01 boss1 pacemaker-execd     [1499] (log_finished)        
>> debug: finished - rsc:apache action:monitor call_id:81  exit-code:198
>> exec-time:304545ms queue-time:240ms
>>
>>
>> system journal:
>>
>> Feb 20 19:39:52 boss1 systemd[1]: Starting Cluster Controlled httpd...
>> Feb 20 19:39:53 boss1 systemd[1]: Started Cluster Controlled httpd.
>> Feb 20 19:39:53 boss1 httpd[2145]: Server configured, listening on:
>> port
>> 443, port 80
>>
>> Feb 20 19:40:01 boss1 systemd[1]: Stopping The Apache HTTP Server...
>> Feb 20 19:40:02 boss1 systemd[1]: httpd.service: Succeeded.
>> Feb 20 19:40:02 boss1 systemd[1]: Stopped The Apache HTTP Server.
>>
>>
>>
>>
>> On 20/02/2020 21:02, Strahil Nikolov wrote:
>>> On February 20, 2020 9:35:07 PM GMT+02:00, Maverick <mvrk at sapo.pt>
>> wrote:
>>>> Manually it starts ok, no problems:
>>>>
>>>> pcs resource debug-start apache --full
>>>> (unpack_config)     warning: Blind faith: not fencing unseen nodes
>>>> Operation start for apache (systemd::httpd) returned: 'ok' (0)
>>>>
>>>>
>>>> On 20/02/2020 16:46, Strahil Nikolov wrote:
>>>>> On February 20, 2020 12:49:43 PM GMT+02:00, Maverick <mvrk at sapo.pt>
>>>> wrote:
>>>>>>> You really need to debug the start & stop of  tthe resource .
>>>>>>>
>>>>>>> Please try the debug procedure  and provide the output:
>>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Strahil Nikolov
>>>>>> Hi,
>>>>>>
>>>>>> Correct me if i'm wrong, but i think that procedure doesn't work
>> for
>>>>>> systemd class resources, i don't know which OCF script is
>>>> responsible
>>>>>> for handling systemd class resources.
>>>>>>
>>>>>> Also crm command doesn't exist in RHEL/Fedora, i've seen the crm
>>>>>> command
>>>>>> only in SUSE.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19/02/2020 19:23, Strahil Nikolov wrote:
>>>>>>> On February 19, 2020 7:21:12 PM GMT+02:00, Maverick
>> <mvrk at sapo.pt>
>>>>>> wrote:
>>>>>>>> How is it possible that pacemaker is reporting that takes 4.2
>>>>>> minutes
>>>>>>>> (254930ms) to execute the start of httpd systemd unit?
>>>>>>>>
>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514] (log_execute)
>>    
>>>>>>>> info:
>>>>>>>> executing - rsc:apache action:start call_id:25
>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514]
>>>> (systemd_unit_exec)
>>>>>>>>    
>>>>>>>> debug: Performing asynchronous start op on systemd unit httpd
>>>> named
>>>>>>>> 'apache'
>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514]
>>>>>>>> (systemd_unit_exec_with_unit)     debug: Calling StartUnit for
>>>>>> apache:
>>>>>>>> /org/freedesktop/systemd1/unit/httpd_2eservice
>>>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd     [1514]
>> (action_complete)
>>>>>>    
>>>>>>>> notice: Giving up on apache start (rc=0): timeout
>>>> (elapsed=254930ms,
>>>>>>>> remaining=-154930ms)
>>>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd     [1514] (log_finished)
>>>>    
>>>>>>>> debug: finished - rsc:apache action:monitor call_id:25 
>>>>>> exit-code:198
>>>>>>>> exec-time:254935ms queue-time:235ms
>>>>>>>>
>>>>>>>>
>>>>>>>> Starting manually works fine and fast:
>>>>>>>>
>>>>>>>> # time systemctl start httpd
>>>>>>>> real    0m0.144s
>>>>>>>> user    0m0.005s
>>>>>>>> sys    0m0.008s
>>>>>>>>
>>>>>>>>
>>>>>>>> On 17/02/2020 22:47, Mvrk wrote:
>>>>>>>>> In attachment the pacemaker.log. On the log i can see that the
>>>>>>>> cluster
>>>>>>>>> tries to start, the start fails, then tries to stop, and the
>> stop
>>>>>>>> also
>>>>>>>>> fails also.
>>>>>>>>>
>>>>>>>>> One more thing, my cluster was working fine on Fedora 28, i
>>>> started
>>>>>>>>> having this problem after upgrade to Fedora 31.
>>>>>>>>>
>>>>>>>>> On 17/02/2020 21:30, Ricardo Esteves wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Yes, i also don't understand why is trying to stop them first.
>>>>>>>>>>
>>>>>>>>>> SELinux is disabled:
>>>>>>>>>>
>>>>>>>>>> # getenforce
>>>>>>>>>> Disabled
>>>>>>>>>>
>>>>>>>>>> All systemd services controlled by the cluster are disabled
>> from
>>>>>>>>>> starting at boot:
>>>>>>>>>>
>>>>>>>>>> # systemctl is-enabled httpd
>>>>>>>>>> disabled
>>>>>>>>>>
>>>>>>>>>> # systemctl is-enabled openvpn-server at 01-server
>>>>>>>>>> disabled
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 17/02/2020 20:28, Ken Gaillot wrote:
>>>>>>>>>>> On Mon, 2020-02-17 at 17:35 +0000, Maverick wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> When i start my cluster, most of my systemd resources won't
>>>>>> start:
>>>>>>>>>>>> Failed Resource Actions:
>>>>>>>>>>>>   * apache_stop_0 on boss1 'OCF_TIMEOUT' (198): call=82,
>>>>>>>>>>>> status='Timed Out', exitreason='',
>> last-rc-change='1970-01-01
>>>>>>>>>>>> 01:00:54 +01:00', queued=29ms, exec=197799ms
>>>>>>>>>>>>   * openvpn_stop_0 on boss1 'OCF_TIMEOUT' (198): call=61,
>>>>>>>>>>>> status='Timed Out', exitreason='',
>> last-rc-change='1970-01-01
>>>>>>>>>>>> 01:00:54 +01:00', queued=1805ms, exec=198841ms
>>>>>>>>>>> These show that attempts to stop failed, rather than start.
>>>>>>>>>>>
>>>>>>>>>>>> So everytime i reboot my node, i need to start the resources
>>>>>>>> manually
>>>>>>>>>>>> using systemd, for example:
>>>>>>>>>>>>
>>>>>>>>>>>> systemd start apache
>>>>>>>>>>>>
>>>>>>>>>>>> and then pcs resource cleanup
>>>>>>>>>>>>
>>>>>>>>>>>> Resources configuration:
>>>>>>>>>>>>
>>>>>>>>>>>> Clone: apache-clone
>>>>>>>>>>>>   Meta Attrs: maintenance=false
>>>>>>>>>>>>   Resource: apache (class=systemd type=httpd)
>>>>>>>>>>>>    Meta Attrs: maintenance=false
>>>>>>>>>>>>    Operations: monitor interval=60 timeout=100
>>>> (apache-monitor-
>>>>>>>>>>>> interval-60)
>>>>>>>>>>>>                start interval=0s timeout=100
>>>>>>>> (apache-start-interval-
>>>>>>>>>>>> 0s)
>>>>>>>>>>>>                stop interval=0s timeout=100
>>>>>>>> (apache-stop-interval-0s)
>>>>>>>>>>>> Resource: openvpn (class=systemd
>>>> type=openvpn-server at 01-server)
>>>>>>>>>>>>    Meta Attrs: maintenance=false
>>>>>>>>>>>>    Operations: monitor interval=60 timeout=100
>>>> (openvpn-monitor-
>>>>>>>>>>>> interval-60)
>>>>>>>>>>>>                start interval=0s timeout=100
>>>>>>>> (openvpn-start-interval-
>>>>>>>>>>>> 0s)
>>>>>>>>>>>>                stop interval=0s timeout=100
>>>>>>>> (openvpn-stop-interval-
>>>>>>>>>>>> 0s)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Btw, if i try a debug-start / debug-stop the mentioned
>>>> resources
>>>>>>>>>>>> start and stop ok.
>>>>>>>>>>> Based on that, my first guess would be SELinux. Check the
>>>> SELinux
>>>>>>>> logs
>>>>>>>>>>> for denials.
>>>>>>>>>>>
>>>>>>>>>>> Also, make sure your systemd services are not enabled in
>>>> systemd
>>>>>>>> itself
>>>>>>>>>>> (e.g. via systemctl enable). Clustered systemd services
>> should
>>>> be
>>>>>>>>>>> managed by the cluster only.
>>>>>>>> _______________________________________________
>>>>>>>> Manage your subscription:
>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>>> You really need to debug the start & stop of  tthe resource .
>>>>>>>
>>>>>>> Please try the debug procedure  and provide the output:
>>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Strahil Nikolov
>>>>> Hi Maverick,
>>>>>
>>>>>
>>>>> you can replace 'crm resource stop' with 'pcs  resource disable'.
>>>>> The rest is working, but sadly not for systemd.
>>>>>
>>>>> You can try to:
>>>>> 'pcs resource debug-start <resource> --full'
>>>>> Another approach is to:
>>>>> 1. Copy service  to /etc/systemd/system
>>>>> 2. In '[service]' section add this:
>>>>> Environment=SYSTEMD_LOG_LEVEL=debug
>>>>> 3. Reload  systemd:
>>>>> systemctl daemon_reload
>>>>> Note: I assume you got downtime for debugging the issue
>>>>> 4. Use  'debug-start --full'
>>>>>
>>>>> Note: Don't forget to remove the debug, or your journal will get
>>>> full.
>>>>> Best Regards,
>>>>> Strahil Nikolov
>>> Hi Maverick,
>>>
>>>
>>> According this thread:
>>>
>> https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>>> You have 'startup-fencing' is set  to false.
>>>
>>> Check it out - maybe this is your reason.
>>>
>>> Best Regards,
>>> Strahil Nikolov
> Hi Maverick,
>
> Can you share your systemd service ?
> What distribution are you using and what is the reason for using systemd instead of the ocf resource for apache ?
>
> Could you enable the DEBUG for the systemd service ?
>
>
> Best Regards,
> Strahil Nikolov



More information about the Users mailing list