[ClusterLabs] Fedora 31 - systemd based resources don't start

Sat Feb 22 10:26:06 EST 2020

Hi,

As i don't have much time to dig into this pacemaker vs systemd problem,
i decided to dump systemd.

For apache resource i replaced it with ocf::heartbeat:apache, openvpn i
replaced with ocf::heartbeat:anything
and for the other resources that need some more elaborated start/stop
script i created /etc/init.d/ scripts and used lsb resource type.

Everything is working perfectly now.

On 20/02/2020 23:10, Maverick wrote:
> Hi,
>
> I'm using Fedora 31 (x86_64).
>
> For apache i can use the ocf agent sure, but i have other resources for
> who don't exist an ocf agent, so for them i need to use systemd.
>
> All ocf and lsb type resources start ok on boot, only systemd resources
> have this problem.
>
> I already enabled debug for httpd and openvpn-server systemd units, but
> i don't see any debug on /var/log/messages or journal about any of these
> units.
>
> Here some of the systemd units:
>
> Apache:
>
> [Unit]
> Description=The Apache HTTP Server
> Wants=httpd-init.service
> After=network.target remote-fs.target nss-lookup.target httpd-init.service
> Documentation=man:httpd.service(8)
>
> [Service]
> Type=notify
> Environment=LANG=C
> Environment=SYSTEMD_LOG_LEVEL=debug
>
> ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND
> ExecReload=/usr/sbin/httpd $OPTIONS -k graceful
> # Send SIGWINCH for graceful stop
> KillSignal=SIGWINCH
> KillMode=mixed
> PrivateTmp=true
>
> [Install]
> WantedBy=multi-user.target
>
> -----------------
>
> OpenVPN:
>
> [Unit]
> Description=OpenVPN service for %I
> After=syslog.target network-online.target
> Wants=network-online.target
> Documentation=man:openvpn(8)
> Documentation=https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage
> Documentation=https://community.openvpn.net/openvpn/wiki/HOWTO
>
> [Service]
> Type=notify
> PrivateTmp=true
> WorkingDirectory=/etc/openvpn/server
> Environment=SYSTEMD_LOG_LEVEL=debug
> ExecStart=/usr/sbin/openvpn --status %t/openvpn-server/status-%i.log
> --status-version 2 --suppress-timestamps --cipher AES-256-GCM
> --ncp-ciphers AES-256-GCM:AES-128-GCM:AES-256-CBC:AES-128-CBC:BF-CBC
> --config %i.conf
> CapabilityBoundingSet=CAP_IPC_LOCK CAP_NET_ADMIN CAP_NET_BIND_SERVICE
> CAP_NET_RAW CAP_SETGID CAP_SETUID CAP_SYS_CHROOT CAP_DAC_OVERRIDE
> CAP_AUDIT_WRITE
> LimitNPROC=10
> DeviceAllow=/dev/null rw
> DeviceAllow=/dev/net/tun rw
> ProtectSystem=true
> ProtectHome=true
> KillMode=process
> RestartSec=5s
> Restart=on-failure
>
> [Install]
> WantedBy=multi-user.target
>
> ---------------------------------
>
> Zabbix Server:
>
> [Unit]
> Description=Zabbix Server with Oracle DB
> After=syslog.target network.target
>
> [Service]
> Type=simple
> Environment="LD_LIBRARY_PATH=/opt/oracle/lib"
> ExecStart=/usr/sbin/zabbix_server -f
> User=zabbixsrv
>
> [Install]
> WantedBy=multi-user.target
>
>
>
> On 20/02/2020 22:29, Strahil Nikolov wrote:
>> On February 20, 2020 10:29:54 PM GMT+02:00, Maverick <mvrk at sapo.pt> wrote:
>>>> Hi Maverick,
>>>>
>>>>
>>>> According this thread:
>>>>
>>> https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>>>> You have 'startup-fencing' is set  to false.
>>>>
>>>> Check it out - maybe this is your reason.
>>>>
>>>> Best Regards,
>>>> Strahil Nikolov
>>> Yes, i have stonith disabled, because as soon as the resources startup
>>> fail on boot, node was rebooted.
>>>
>>>
>>> Anyway, i was checking the pacemaker logs and the journal log, and i
>>> see
>>> that the service actually starts ok but for some reason pacemaker
>>> thinks
>>> it has timeout and then because of that tries to stop and also thinks
>>> it
>>> has timeout but actually stops it:
>>>
>>> pacemaker.log:
>>>
>>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (log_execute)  info:
>>> executing - rsc:apache action:start call_id:25
>>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (systemd_unit_exec)   
>>> debug: Performing asynchronous start op on systemd unit httpd named
>>> 'apache'
>>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499]
>>> (systemd_unit_exec_with_unit)  debug: Calling StartUnit for apache:
>>> /org/freedesktop/systemd1/unit/httpd_2eservice
>>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (action_complete)     
>>> notice: Giving up on apache start (rc=0): timeout (elapsed=248199ms,
>>> remaining=-148199ms)
>>> Feb 20 19:39:52 boss1 pacemaker-execd     [1499] (log_finished)        
>>> debug: finished - rsc:apache action:monitor call_id:25  exit-code:198
>>> exec-time:248205ms queue-time:216ms
>>>
>>> Feb 20 19:40:00 boss1 pacemaker-execd     [1499] (log_execute)  info:
>>> executing - rsc:apache action:stop call_id:81
>>> Feb 20 19:40:00 boss1 pacemaker-execd     [1499] (systemd_unit_exec)   
>>> debug: Performing asynchronous stop op on systemd unit httpd named
>>> 'apache'
>>> Feb 20 19:40:00 boss1 pacemaker-execd     [1499]
>>> (systemd_unit_exec_with_unit)  debug: Calling StopUnit for apache:
>>> /org/freedesktop/systemd1/unit/httpd_2eservice
>>> Feb 20 19:40:01 boss1 pacemaker-execd     [1499] (action_complete)     
>>> notice: Giving up on apache stop (rc=0): timeout (elapsed=304539ms,
>>> remaining=-204539ms)
>>> Feb 20 19:40:01 boss1 pacemaker-execd     [1499] (log_finished)        
>>> debug: finished - rsc:apache action:monitor call_id:81  exit-code:198
>>> exec-time:304545ms queue-time:240ms
>>>
>>>
>>> system journal:
>>>
>>> Feb 20 19:39:52 boss1 systemd[1]: Starting Cluster Controlled httpd...
>>> Feb 20 19:39:53 boss1 systemd[1]: Started Cluster Controlled httpd.
>>> Feb 20 19:39:53 boss1 httpd[2145]: Server configured, listening on:
>>> port
>>> 443, port 80
>>>
>>> Feb 20 19:40:01 boss1 systemd[1]: Stopping The Apache HTTP Server...
>>> Feb 20 19:40:02 boss1 systemd[1]: httpd.service: Succeeded.
>>> Feb 20 19:40:02 boss1 systemd[1]: Stopped The Apache HTTP Server.
>>>
>>>
>>>
>>>
>>> On 20/02/2020 21:02, Strahil Nikolov wrote:
>>>> On February 20, 2020 9:35:07 PM GMT+02:00, Maverick <mvrk at sapo.pt>
>>> wrote:
>>>>> Manually it starts ok, no problems:
>>>>>
>>>>> pcs resource debug-start apache --full
>>>>> (unpack_config)     warning: Blind faith: not fencing unseen nodes
>>>>> Operation start for apache (systemd::httpd) returned: 'ok' (0)
>>>>>
>>>>>
>>>>> On 20/02/2020 16:46, Strahil Nikolov wrote:
>>>>>> On February 20, 2020 12:49:43 PM GMT+02:00, Maverick <mvrk at sapo.pt>
>>>>> wrote:
>>>>>>>> You really need to debug the start & stop of  tthe resource .
>>>>>>>>
>>>>>>>> Please try the debug procedure  and provide the output:
>>>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Strahil Nikolov
>>>>>>> Hi,
>>>>>>>
>>>>>>> Correct me if i'm wrong, but i think that procedure doesn't work
>>> for
>>>>>>> systemd class resources, i don't know which OCF script is
>>>>> responsible
>>>>>>> for handling systemd class resources.
>>>>>>>
>>>>>>> Also crm command doesn't exist in RHEL/Fedora, i've seen the crm
>>>>>>> command
>>>>>>> only in SUSE.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 19/02/2020 19:23, Strahil Nikolov wrote:
>>>>>>>> On February 19, 2020 7:21:12 PM GMT+02:00, Maverick
>>> <mvrk at sapo.pt>
>>>>>>> wrote:
>>>>>>>>> How is it possible that pacemaker is reporting that takes 4.2
>>>>>>> minutes
>>>>>>>>> (254930ms) to execute the start of httpd systemd unit?
>>>>>>>>>
>>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514] (log_execute)
>>>    
>>>>>>>>> info:
>>>>>>>>> executing - rsc:apache action:start call_id:25
>>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514]
>>>>> (systemd_unit_exec)
>>>>>>>>>    
>>>>>>>>> debug: Performing asynchronous start op on systemd unit httpd
>>>>> named
>>>>>>>>> 'apache'
>>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd     [1514]
>>>>>>>>> (systemd_unit_exec_with_unit)     debug: Calling StartUnit for
>>>>>>> apache:
>>>>>>>>> /org/freedesktop/systemd1/unit/httpd_2eservice
>>>>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd     [1514]
>>> (action_complete)
>>>>>>>    
>>>>>>>>> notice: Giving up on apache start (rc=0): timeout
>>>>> (elapsed=254930ms,
>>>>>>>>> remaining=-154930ms)
>>>>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd     [1514] (log_finished)
>>>>>    
>>>>>>>>> debug: finished - rsc:apache action:monitor call_id:25 
>>>>>>> exit-code:198
>>>>>>>>> exec-time:254935ms queue-time:235ms
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Starting manually works fine and fast:
>>>>>>>>>
>>>>>>>>> # time systemctl start httpd
>>>>>>>>> real    0m0.144s
>>>>>>>>> user    0m0.005s
>>>>>>>>> sys    0m0.008s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 17/02/2020 22:47, Mvrk wrote:
>>>>>>>>>> In attachment the pacemaker.log. On the log i can see that the
>>>>>>>>> cluster
>>>>>>>>>> tries to start, the start fails, then tries to stop, and the
>>> stop
>>>>>>>>> also
>>>>>>>>>> fails also.
>>>>>>>>>>
>>>>>>>>>> One more thing, my cluster was working fine on Fedora 28, i
>>>>> started
>>>>>>>>>> having this problem after upgrade to Fedora 31.
>>>>>>>>>>
>>>>>>>>>> On 17/02/2020 21:30, Ricardo Esteves wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Yes, i also don't understand why is trying to stop them first.
>>>>>>>>>>>
>>>>>>>>>>> SELinux is disabled:
>>>>>>>>>>>
>>>>>>>>>>> # getenforce
>>>>>>>>>>> Disabled
>>>>>>>>>>>
>>>>>>>>>>> All systemd services controlled by the cluster are disabled
>>> from
>>>>>>>>>>> starting at boot:
>>>>>>>>>>>
>>>>>>>>>>> # systemctl is-enabled httpd
>>>>>>>>>>> disabled
>>>>>>>>>>>
>>>>>>>>>>> # systemctl is-enabled openvpn-server at 01-server
>>>>>>>>>>> disabled
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 17/02/2020 20:28, Ken Gaillot wrote:
>>>>>>>>>>>> On Mon, 2020-02-17 at 17:35 +0000, Maverick wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> When i start my cluster, most of my systemd resources won't
>>>>>>> start:
>>>>>>>>>>>>> Failed Resource Actions:
>>>>>>>>>>>>>   * apache_stop_0 on boss1 'OCF_TIMEOUT' (198): call=82,
>>>>>>>>>>>>> status='Timed Out', exitreason='',
>>> last-rc-change='1970-01-01
>>>>>>>>>>>>> 01:00:54 +01:00', queued=29ms, exec=197799ms
>>>>>>>>>>>>>   * openvpn_stop_0 on boss1 'OCF_TIMEOUT' (198): call=61,
>>>>>>>>>>>>> status='Timed Out', exitreason='',
>>> last-rc-change='1970-01-01
>>>>>>>>>>>>> 01:00:54 +01:00', queued=1805ms, exec=198841ms
>>>>>>>>>>>> These show that attempts to stop failed, rather than start.
>>>>>>>>>>>>
>>>>>>>>>>>>> So everytime i reboot my node, i need to start the resources
>>>>>>>>> manually
>>>>>>>>>>>>> using systemd, for example:
>>>>>>>>>>>>>
>>>>>>>>>>>>> systemd start apache
>>>>>>>>>>>>>
>>>>>>>>>>>>> and then pcs resource cleanup
>>>>>>>>>>>>>
>>>>>>>>>>>>> Resources configuration:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Clone: apache-clone
>>>>>>>>>>>>>   Meta Attrs: maintenance=false
>>>>>>>>>>>>>   Resource: apache (class=systemd type=httpd)
>>>>>>>>>>>>>    Meta Attrs: maintenance=false
>>>>>>>>>>>>>    Operations: monitor interval=60 timeout=100
>>>>> (apache-monitor-
>>>>>>>>>>>>> interval-60)
>>>>>>>>>>>>>                start interval=0s timeout=100
>>>>>>>>> (apache-start-interval-
>>>>>>>>>>>>> 0s)
>>>>>>>>>>>>>                stop interval=0s timeout=100
>>>>>>>>> (apache-stop-interval-0s)
>>>>>>>>>>>>> Resource: openvpn (class=systemd
>>>>> type=openvpn-server at 01-server)
>>>>>>>>>>>>>    Meta Attrs: maintenance=false
>>>>>>>>>>>>>    Operations: monitor interval=60 timeout=100
>>>>> (openvpn-monitor-
>>>>>>>>>>>>> interval-60)
>>>>>>>>>>>>>                start interval=0s timeout=100
>>>>>>>>> (openvpn-start-interval-
>>>>>>>>>>>>> 0s)
>>>>>>>>>>>>>                stop interval=0s timeout=100
>>>>>>>>> (openvpn-stop-interval-
>>>>>>>>>>>>> 0s)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Btw, if i try a debug-start / debug-stop the mentioned
>>>>> resources
>>>>>>>>>>>>> start and stop ok.
>>>>>>>>>>>> Based on that, my first guess would be SELinux. Check the
>>>>> SELinux
>>>>>>>>> logs
>>>>>>>>>>>> for denials.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, make sure your systemd services are not enabled in
>>>>> systemd
>>>>>>>>> itself
>>>>>>>>>>>> (e.g. via systemctl enable). Clustered systemd services
>>> should
>>>>> be
>>>>>>>>>>>> managed by the cluster only.
>>>>>>>>> _______________________________________________
>>>>>>>>> Manage your subscription:
>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>>>> You really need to debug the start & stop of  tthe resource .
>>>>>>>>
>>>>>>>> Please try the debug procedure  and provide the output:
>>>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Strahil Nikolov
>>>>>> Hi Maverick,
>>>>>>
>>>>>>
>>>>>> you can replace 'crm resource stop' with 'pcs  resource disable'.
>>>>>> The rest is working, but sadly not for systemd.
>>>>>>
>>>>>> You can try to:
>>>>>> 'pcs resource debug-start <resource> --full'
>>>>>> Another approach is to:
>>>>>> 1. Copy service  to /etc/systemd/system
>>>>>> 2. In '[service]' section add this:
>>>>>> Environment=SYSTEMD_LOG_LEVEL=debug
>>>>>> 3. Reload  systemd:
>>>>>> systemctl daemon_reload
>>>>>> Note: I assume you got downtime for debugging the issue
>>>>>> 4. Use  'debug-start --full'
>>>>>>
>>>>>> Note: Don't forget to remove the debug, or your journal will get
>>>>> full.
>>>>>> Best Regards,
>>>>>> Strahil Nikolov
>>>> Hi Maverick,
>>>>
>>>>
>>>> According this thread:
>>>>
>>> https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>>>> You have 'startup-fencing' is set  to false.
>>>>
>>>> Check it out - maybe this is your reason.
>>>>
>>>> Best Regards,
>>>> Strahil Nikolov
>> Hi Maverick,
>>
>> Can you share your systemd service ?
>> What distribution are you using and what is the reason for using systemd instead of the ocf resource for apache ?
>>
>> Could you enable the DEBUG for the systemd service ?
>>
>>
>> Best Regards,
>> Strahil Nikolov