[ClusterLabs] Clone process appears running whereas not

Fri Jun 5 10:43:55 EDT 2015

On 06/05/2015 10:05 AM, Emmanuel Le Nohaïc wrote:
> Hi,
> 
> I am running an OpenVPN cluster active/passive with two nodes.
> 
> Cluster is configured to run clone openvpn processes on both nodes
> and it runs fine with the VIP.
> 
> For the purpose of HA tests, I have to emulate processus start
> error on the nodes (at boot time). So when I write deliberatly a
> bad config file in my openvpn.conf and try to reboot one node,
> after system reboot crm_mon says that clone is started on both
> nodes whereas openvpn process is not running in background.
> 
> So when I reboot the first node, VIP and openvpn turn on the second
> node where openvpn is not running.
> 
> Is there a solution to make sure that processus is running in
> background to avoid that kind of errors.
> 
> Otherwise, is it possible if my services turn on the bad node to
> make them return on good node when it appears online after reboot ?
> I would not like to make preferences for one or other node, just to
> migrate clients where service is really available.
> 
> Nodes are running Debian 8 with systemd. I think pacemaker reads
> output of anything like "systemctl status openvpn" and thinks that
> process is running because it appears running with :
> 
> systemctl status openvpn (on CENTRAL1) ● openvpn.service - OpenVPN
> service Loaded: loaded (/etc/systemd/system/openvpn.service;
> disabled) Active: active (exited) since ven. 2015-06-05 15:45:07
> CEST; 11min ago Process: 732 ExecStart=/bin/true (code=exited,
> status=0/SUCCESS)

Pacemaker relies on the resource agent (systemd's openvpn service in
this case) to tell it whether the resource is running. If the agent
says yes when it's really no, Pacemaker can't do the right thing.

Why is ExecStart=/bin/true? That doesn't look right.

> Main PID: 732 (code=exited, status=0/SUCCESS)
> 
> crm_mon (on CENTRAL1) Last updated: Fri Jun  5 16:02:39 2015 Stack:
> corosync Current DC: CENTRAL2 (version 1.1.12-a4478bd) - partition
> with quorum 2 nodes and 3 resources configured
> 
> Online: [ CENTRAL1 CENTRAL2 ]
> 
> VIP1    (ocf::heartbeat:IPaddr):        Started CENTRAL2 Clone Set:
> openvpn-clone [openvpn] Started: [ CENTRAL1 CENTRAL2 ]
> 
> 
> Here openvpn process does not run on CENTRAL1
> 
> systemctl list-units (on CENTRAL1) openvpn.service       loaded
> active exited    OpenVPN service
> 
> 
> Here is my cluster configuration : root at CENTRAL1:~# crm configure
> show node 167772161: CENTRAL1 node 167772162: CENTRAL2 primitive
> VIP1 IPaddr \ params ip=192.168.1.200 cidr_netmask=32
> nic="eth0:vip" \ op monitor interval=10s timeout=20s primitive
> openvpn service:openvpn \ op monitor interval=1s

A monitor interval of 1s is probably too small. The cluster has to
fork a subprocess, call systemctl status, and wait for it to return
(possibly as long as the configured timeout, which is left to default
above). Figure out what's a realistic time for determining that the
VPN is not working (will the command return immediately, or hang for a
while?) and set the interval and timeout a bit longer than that.

> clone openvpn-clone openvpn \ meta target-role=Started colocation
> OVPN-VIP inf: ( openvpn-clone VIP1 ) property
> cib-bootstrap-options: \ have-watchdog=false \ 
> dc-version=1.1.12-a4478bd \ cluster-infrastructure=corosync \ 
> stonith-enabled=false \ no-quorum-policy=ignore
> 
> If anyone has a solution. If you need more details or if my problem
> is not clear (or my english) I can explain again :)
> 
> Thanks for your help.