[ClusterLabs] reset of sticking service in peer node's reboot in Active/Passive configuration

Mon May 1 15:25:57 EDT 2017

Hi everyone and thank you Matsushima-san for response.

By researching logs, I’ve found the reason of the restart. The systemd service registered
as cluster service is enabled as systemd service. Therefore what happens are, (1) the service
starts automatically in OS boot sequence, (2) the services running on both node are detected
and then (3) the service is stopped by pacemaker on one node to make it passive.

Here’s the log of (2) and (3).

May  2 02:51:41 node-1 pengine[1111]:   error: Resource apache-httpd (systemd::httpd) is active on 2 nodes attempting recovery
May  2 02:51:41 node-1 pengine[1111]: warning: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
May  2 02:51:41 node-1 pengine[1111]:  notice: Restart apache-httpd#011(Started node-1)
May  2 02:51:41 node-1 pengine[1111]:   error: Calculated transition 48 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-53.bz2
May  2 02:51:41 node-1 crmd[1112]:  notice: Initiating stop operation apache-httpd_stop_0 on node-2
May  2 02:51:41 node-1 crmd[1112]:  notice: Initiating stop operation apache-httpd_stop_0 locally on node-1
May  2 02:51:41 node-1 systemd: Reloading.
May  2 02:51:41 node-1 systemd: Stopping The Apache HTTP Server...
May  2 02:51:42 node-1 systemd: Stopped The Apache HTTP Server.
May  2 02:51:43 node-1 crmd[1112]:  notice: Result of stop operation for apache-httpd on node-1: 0 (ok)
May  2 02:51:43 node-1 crmd[1112]:  notice: Initiating start operation apache-httpd_start_0 locally on node-1
May  2 02:51:43 node-1 systemd: Reloading.
May  2 02:51:44 node-1 systemd: Starting Cluster Controlled httpd...

How to solve the problem is obvious. The systemd service registered as cluster service
should be disabled as systemd service (on both nodes). Let it be started by pacemaker only.

  # systemctl disable httpd
  Removed symlink /etc/systemd/system/multi-user.target.wants/httpd.service.

Here’s log of node-1 on node-2’s bootup after httpd disabled as systemd service.

May  2 04:08:51 node-1 corosync[1057]: [TOTEM ] A new membership (192.168.1.201:720) was formed. Members joined: 2
May  2 04:08:51 node-1 corosync[1057]: [QUORUM] Members[2]: 1 2
May  2 04:08:51 node-1 corosync[1057]: [MAIN  ] Completed service synchronization, ready to provide service.
May  2 04:08:51 node-1 pacemakerd[1064]:  notice: Node node-2 state is now member
May  2 04:08:51 node-1 crmd[1074]:  notice: Node node-2 state is now member
May  2 04:08:52 node-1 attrd[1072]:  notice: Node node-2 state is now member
May  2 04:08:52 node-1 stonith-ng[1070]:  notice: Node node-2 state is now member
May  2 04:08:53 node-1 cib[1069]:  notice: Node node-2 state is now member 
May  2 04:08:53 node-1 crmd[1074]:  notice: State transition S_IDLE -> S_INTEGRATION
May  2 04:08:56 node-1 pengine[1073]:  notice: On loss of CCM Quorum: Ignore
May  2 04:08:56 node-1 pengine[1073]:  notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-232.bz2
May  2 04:08:56 node-1 crmd[1074]:  notice: Initiating monitor operation ClusterIP_monitor_0 on node-2
May  2 04:08:56 node-1 crmd[1074]:  notice: Initiating monitor operation apache-httpd_monitor_0 on node-2
May  2 04:08:56 node-1 crmd[1074]:  notice: Transition 2 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-232.bz2): Complete
May  2 04:08:56 node-1 crmd[1074]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE

Have a nice day.

> 2017/05/01 19:03、Takehiro Matsushima <takehiro.dreamizm at gmail.com>のメール:
> 
> Hello Ishii-san,
> 
> I could not reproduce the issue in my environment CentOS7 w/ Pacemaker 1.1.15.
> Following configuration works fine when reboot a passive node.
> (lighttpd is just for example of systemd resource)
> 
> ---- %< ----
> primitive ipaddr IPaddr2 \
>        params nic=enp0s10 ip=172.22.23.254 cidr_netmask=24 \
>        op start interval=0 timeout=20 on-fail=restart \
>        op stop interval=0 timeout=20 on-fail=ignore \
>        op monitor interval=10 timeout=20 on-fail=restart
> primitive lighttpd systemd:lighttpd \
>        op start interval=0 timeout=20 on-fail=restart \
>        op stop interval=0 timeout=20 on-fail=ignore \
>        op monitor interval=10 timeout=20 on-fail=restart
> colocation vip-colocation inf: ipaddr lighttpd
> order web-order inf: lighttpd ipaddr
> property cib-bootstrap-options: \
>        have-watchdog=false \
>        dc-version=1.1.15-1.el7-e174ec8 \
>        cluster-infrastructure=corosync \
>        no-quorum-policy=ignore \
>        startup-fencing=no \
>        stonith-enabled=no \
>        cluster-recheck-interval=1m
> rsc_defaults rsc-options: \
>        resource-stickiness=infinity \
>        migration-threshold=1
> ---- %< ----
> 
> I made sure resources did not restart and did not move by changing
> resource-stickiness to some positive values such as 10, 100 and 0.
> Also it works replacing colocation and order constraints by "group" constraint.
> 
> If you are watching cluster's status by crm_mon, please run with "-t"
> option and watch "last-run" in line of "start" operation for each
> resource.
> If the time is not change when you rebooted a passive node, the
> resource should not restarted actually.
> 
> 
> Thanks,
> 
> Takehiro Matsushima
> 
> 2017-04-30 19:32 GMT+09:00 石井 俊直 <i_j_e_x_a at yahoo.co.jp>:
>> Hi.
>> 
>> We have 2-node Active/Passive cluster each of which are CentOS7 and there are two cluster services,
>> one is ocf:heartbeat:IPaddr2 and the other is systemd based service. They have colocation constraint.
>> The configuration looks almost good so that they are normally running without problems.
>> 
>> When one of the OS reboots, there happens a thing we do not want to have, which is 5) of the following.
>> Suppose nodes are node-1 and node-2, cluster resource is running on node-1 and we reboot node-2.
>> Following is events sequence that happens.
>> 
>>  1) node-2 shutdowns
>>  2) node-1 detects node-2 is OFFLINE
>>  3) node-2 boots up
>>  4) node-1 detects node-2 is Online, node-2 detects both are Online
>>  5) cluster services running on node-1 Stops
>>  6) cluster services starts on node-1
>> 
>> 6) is based on our configuration of resource-stickiness to be something like 100. In the case the service
>> does not move to node-2, we do not our service stopped even just for a while.
>> 
>> If someone knows how to configure pacemaker not to behave like 5), please let us know.
>> 
>> Thanks you.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org