[ClusterLabs] Node restart if service is disabled

Mon Dec 18 21:02:12 UTC 2017

On Mon, 2017-12-18 at 12:36 +0000, Enrico Bianchi wrote:
> Hi,
> 
> I have a cluster on CentOS 7 with this configuration:
> 
> # pcs property show
> Cluster Properties:
>   cluster-infrastructure: corosync
>   cluster-name: mail_cluster
>   dc-version: 1.1.15-11.el7_3.4-e174ec8
>   have-watchdog: false
>   last-lrm-refresh: 1493425399
>   stonith-enabled: true
>   symmetric-cluster: true
> # pcs cluster status
> Cluster Status:
>   Stack: corosync
>   Current DC: node001.host.local (version 1.1.15-11.el7_3.4-e174ec8)
> - partition with quorum
>   Last updated: Mon Dec 18 13:11:19 2017         Last change: Thu Dec
> 14 11:59:42 2017 by root via crm_resource on node005.host.local
>   5 nodes and 19 resources configured
> 
> PCSD Status:
>    node005.host.local: Online  node001.host.local: Online
> 
>    node003.host.local: Online
>    node002.host.local: Online
>    node004.host.local: Online
> # pcs resource
>   mailstore1_fs  (ocf::heartbeat:Filesystem):    Started
> node001.host.local
>   mailstore2_fs  (ocf::heartbeat:Filesystem):    Started
> node002.host.local
>   mailstore3_fs  (ocf::heartbeat:Filesystem):    Started
> node003.host.local
>   mailstore4_fs  (ocf::heartbeat:Filesystem):    Started
> node004.host.local
>   logs_fs        (ocf::heartbeat:Filesystem):    Started
> node005.host.local
>   dsmail_fs      (ocf::heartbeat:Filesystem):    Started
> node005.host.local
>   dovecot1_vip   (ocf::heartbeat:IPaddr2):       Started
> node001.host.local
>   dovecot2_vip   (ocf::heartbeat:IPaddr2):       Started
> node002.host.local
>   dovecot3_vip   (ocf::heartbeat:IPaddr2):       Started
> node003.host.local
>   dovecot4_vip   (ocf::heartbeat:IPaddr2):       Started
> node004.host.local
>   dsmail_vip     (ocf::heartbeat:IPaddr2):       Started
> node005.host.local
>   dovecot1_svr   (systemd:dovecot at 1):    Started node001.host.local
>   dovecot2_svr   (systemd:dovecot at 2):    Started node002.host.local
>   dovecot3_svr   (systemd:dovecot at 3):    Started node003.host.local
>   dovecot4_svr   (systemd:dovecot at 4):    Started node004.host.local
>   haproxy_svr    (systemd:haproxy):      Started node005.host.local
>   service1_svr      (systemd:mwpec):        Started
> node005.host.local
>   service2_svr       (systemd:mwpecdaemons): Started
> node005.host.local
> # pcs stonith show
>   fence_vmware   (stonith:fence_vmware_soap):    Started
> node005.host.local
> # pcs constraint show
> Location Constraints:
>    Resource: dovecot1_vip
>      Enabled on: node001.host.local (score:50)
>    Resource: dovecot2_vip
>      Enabled on: node002.host.local (score:50)
>    Resource: dovecot3_vip
>      Enabled on: node003.host.local (score:50)
>    Resource: dovecot4_vip
>      Enabled on: node004.host.local (score:50)
>    Resource: dsmail_fs
>      Enabled on: node005.host.local (score:INFINITY) (role: Started)
>      Disabled on: node002.host.local (score:-INFINITY) (role:
> Started)
>    Resource: dsmail_vip
>      Enabled on: node005.host.local (score:50)
> Ordering Constraints:
>    start dovecot1_vip then start mailstore1_fs (kind:Mandatory)
>    start mailstore1_fs then start dovecot1_svr (kind:Mandatory)
>    start dovecot2_vip then start mailstore2_fs (kind:Mandatory)
>    start mailstore2_fs then start dovecot2_svr (kind:Mandatory)
>    start dovecot3_vip then start mailstore3_fs (kind:Mandatory)
>    start mailstore3_fs then start dovecot3_svr (kind:Mandatory)
>    start dovecot4_vip then start mailstore4_fs (kind:Mandatory)
>    start mailstore4_fs then start dovecot4_svr (kind:Mandatory)
>    start dsmail_vip then start dsmail_fs (kind:Mandatory)
>    start dsmail_vip then start haproxy_svr (kind:Mandatory)
>    start dsmail_fs then start logs_fs (kind:Mandatory)
>    start dsmail_vip then start service1_svr (kind:Mandatory)
>    start dsmail_vip then start service2_svr (kind:Mandatory)
> Colocation Constraints:
>    dovecot1_svr with dovecot1_vip (score:INFINITY)
>    mailstore1_fs with dovecot1_vip (score:INFINITY)
>    dovecot2_svr with dovecot2_vip (score:INFINITY)
>    mailstore2_fs with dovecot2_vip (score:INFINITY)
>    dovecot3_svr with dovecot3_vip (score:INFINITY)
>    mailstore3_fs with dovecot3_vip (score:INFINITY)
>    dovecot4_svr with dovecot4_vip (score:INFINITY)
>    mailstore4_fs with dovecot4_vip (score:INFINITY)
>    dsmail_fs with dsmail_vip (score:INFINITY)
>    haproxy_svr with dsmail_vip (score:INFINITY)
>    logs_fs with dsmail_fs (score:INFINITY)
>    service1_svr with dsmail_vip (score:INFINITY)
>    service2_svr with dsmail_vip (score:INFINITY)
> Ticket Constraints:
> # pcs status
> Cluster name: mail_cluster
> Stack: corosync
> Current DC: node001.host.local (version 1.1.15-11.el7_3.4-e174ec8) -
> partition with quorum
> Last updated: Mon Dec 18 13:28:27 2017          Last change: Thu Dec
> 14 11:59:42 2017 by root via crm_resource on node005.host.local
> 
> 5 nodes and 19 resources configured
> 
> Online: [ node001.host.local node002.host.local node003.host.local
> node004.host.local node005.host.local ]
> 
> Full list of resources:
> 
>   mailstore1_fs  (ocf::heartbeat:Filesystem):    Started
> node001.host.local
>   mailstore2_fs  (ocf::heartbeat:Filesystem):    Started
> node002.host.local
>   mailstore3_fs  (ocf::heartbeat:Filesystem):    Started
> node003.host.local
>   mailstore4_fs  (ocf::heartbeat:Filesystem):    Started
> node004.host.local
>   logs_fs        (ocf::heartbeat:Filesystem):    Started
> node005.host.local
>   dsmail_fs      (ocf::heartbeat:Filesystem):    Started
> node005.host.local
>   dovecot1_vip   (ocf::heartbeat:IPaddr2):       Started
> node001.host.local
>   dovecot2_vip   (ocf::heartbeat:IPaddr2):       Started
> node002.host.local
>   dovecot3_vip   (ocf::heartbeat:IPaddr2):       Started
> node003.host.local
>   dovecot4_vip   (ocf::heartbeat:IPaddr2):       Started
> node004.host.local
>   dsmail_vip     (ocf::heartbeat:IPaddr2):       Started
> node005.host.local
>   dovecot1_svr   (systemd:dovecot at 1):    Started node001.host.local
>   dovecot2_svr   (systemd:dovecot at 2):    Started node002.host.local
>   dovecot3_svr   (systemd:dovecot at 3):    Started node003.host.local
>   dovecot4_svr   (systemd:dovecot at 4):    Started node004.host.local
>   haproxy_svr    (systemd:haproxy):      Started node005.host.local
>   service1_svr      (systemd:mwpec):        Started
> node005.host.local
>   service2_svr       (systemd:mwpecdaemons): Started
> node005.host.local
>   fence_vmware   (stonith:fence_vmware_soap):    Started
> node005.host.local
> 
> Failed Actions:
> * fence_vmware_start_0 on node004.host.local 'unknown error' (1):
> call=803, status=Timed Out, exitreason='none',
>      last-rc-change='Mon Jun 26 18:21:29 2017', queued=0ms,
> exec=20961ms
> * fence_vmware_start_0 on node002.host.local 'unknown error' (1):
> call=906, status=Error, exitreason='none',
>      last-rc-change='Mon Jun 26 18:20:52 2017', queued=3ms,
> exec=13876ms
> * fence_vmware_start_0 on node001.host.local 'unknown error' (1):
> call=1194, status=Timed Out, exitreason='none',
>      last-rc-change='Mon Jun 26 18:20:31 2017', queued=0ms,
> exec=19676ms
> * fence_vmware_start_0 on node003.host.local 'unknown error' (1):
> call=783, status=Timed Out, exitreason='none',
>      last-rc-change='Mon Jun 26 18:21:07 2017', queued=0ms,
> exec=19954ms
> 
> 
> Daemon Status:
>    corosync: active/enabled
>    pacemaker: active/enabled
>    pcsd: active/enabled
> #
> 
> (please note, "Failed Actions" refers a resolved problem)
> 
> Cluster works, but I've noticed that if I disable a resource (e.g. 
> service1_svr), stonith restart the node (default action). Is normal? 
> Resources service1_svr and service2_svr are java applications
> started 
> with systemd as forking types, is a problem?
> 
> Cheers,
> 
> Enrico
> 

It depends on what you mean by "disable".

If you mean "pcs resource disable" (which, under the hood, sets target-
role to 'Stopped' in the Pacemaker configuration), then there should be
no fencing.

If you mean causing the service to stop working, so that Pacemaker sees
 a monitor failure, then it depends how you have on-fail configured for
that monitor -- the default of "restart" should attempt to recover it
without fencing, but using "fence" would of course result in fencing.

Also, Pacemaker tries to recover the resource by stopping it then
starting it again. If the stop fails, then that, too, causes the node
to be fenced (because there's no other way to stop the resource).

The systemd service type shouldn't affect Pacemaker's ability to manage
it. However you do want to be sure that any systemd resource managed by
Pacemaker is not enabled in systemd (because only Pacemaker should try
to start or stop it).

You can investigate the Pacemaker logs to see why Pacemaker decided the
node needed to be fenced. They can be difficult to follow, but they
have lots of information.
-- 
Ken Gaillot <kgaillot at redhat.com>