[ClusterLabs] Problems with corosync and pacemaker with error scenarios

Mon Jan 16 09:56:18 EST 2017

Hello,

I'm new to corosync and pacemaker and I want to setup a nginx cluster 
with quorum.

Requirements:
- 3 Linux maschines
- On 2 maschines floating IP should be handled and nginx as a load 
balancing proxy
- 3rd maschine is for quorum only, no services must run there

Installed on all 3 nodes corosync/pacemaker, firewall ports openend are: 
5404, 5405, 5406 for udp in both directions

OS: Fedora 25

Configuration of corosync (only the bindnetaddr is different on every 
maschine) and pacemaker below.

Configuration works so far but error test scenarios don't work like 
expected:
1.) I had cases in testing without qourum and quorum again where the 
cluster kept in Stopped state
   I had to restart the whole stack to get it online again (killall -9 
corosync;systemctl restart corosync;systemctl restart pacemaker)
   Any ideas?

2.) Restarting pacemaker on inactive node also restarts resources on the 
other active node:
a.) Everything up & ok
b.) lb01 handles all resources
c.) on lb02 which handles no resrouces: systemctl restart pacemaker:
   All resources will also be restart with a short outage on lb01 (state 
is Stopped, Started[ lb01 lb02 ] and then Started lb02)
   How can this be avoided?

3.) Stopping and starting corosync doesn't awake the node up again:
   systemctl stop corosync;sleep 10;systemctl restart corosync
   Online: [ kvm01 lb01 ]
   OFFLINE: [ lb02 ]
   Stays in that state until pacemaker is restarted: systemctl restart 
pacemaker
   Bug?

4.) "systemctl restart corosync" hangs sometimes (waiting 2 min)
   needs a
   killall -9 corosync;systemctl restart corosync;systemctl restart 
pacemaker
   sequence to get it up gain

5.) Simulation of split brain: Disabling/reenabling local firewall 
(ports 5404, 5405, 5406) on node lb01 and lb02 for the following ports 
doesn't bring corosync up again after reenabling lb02 firewall
partition WITHOUT quorum
Online: [ kvm01 ]
OFFLINE: [ lb01 lb02 ]
   NOK: restart on lb02: systemctl restart corosync;systemctl restart 
pacemaker
   OK:  restart on lb02 and kvm01 (quorum host): systemctl restart 
corosync;systemctl restart pacemaker
   I also see that non enabled hosts (quorum hosts) are also tried to be 
started on kvm01
   Started[ kvm01 lb02 ]
   Started lb02
   Any ideas?

I've also written a new ocf:heartbeat:Iprule to modify "ip rule" 
accordingly.

Versions are:
corosync: 2.4.2
pacemaker: 1.1.16
Kernel: 4.9.3-200.fc25.x86_64

Thnx.

Ciao,
Gerhard

Corosync config:
================================================================================================================================================================
totem {
         version: 2
         cluster_name: lbcluster
         crypto_cipher: aes256
         crypto_hash: sha512
         interface {
                 ringnumber: 0
                 bindnetaddr: 1.2.3.35
                 mcastport: 5405
         }
         transport: udpu
}
logging {
         fileline: off
         to_logfile: yes
         to_syslog: yes
         logfile: /var/log/cluster/corosync.log
         debug: off
         timestamp: on
         logger_subsys {
                 subsys: QUORUM
                 debug: off
         }
}
nodelist {
         node {
                 ring0_addr: lb01
                 nodeid: 1
         }
         node {
                 ring0_addr: lb02
                 nodeid: 2
         }
         node {
                 ring0_addr: kvm01
                 nodeid: 3
         }
}
quorum {
         # Enable and configure quorum subsystem (default: off)
         # see also corosync.conf.5 and votequorum.5
         #provider: corosync_votequorum
         provider: corosync_votequorum
         # Only for 2 node setup!
         #  two_node: 1
}
================================================================================================================================================================
################################################################################################################################################################
# Default properties
################################################################################################################################################################
pcs property set stonith-enabled=false
pcs property set no-quorum-policy=stop
pcs property set default-resource-stickiness=100
pcs property set symmetric-cluster=false
################################################################################################################################################################
# Delete & cleanup resources
################################################################################################################################################################
pcs resource delete webserver
pcs resource cleanup webserver
pcs resource delete ClusterIP_01
pcs resource cleanup ClusterIP_01
pcs resource delete ClusterIPRoute_01
pcs resource cleanup ClusterIPRoute_01
pcs resource delete ClusterIPRule_01
pcs resource cleanup ClusterIPRule_01
pcs resource delete ClusterIP_02
pcs resource cleanup ClusterIP_02
pcs resource delete ClusterIPRoute_02
pcs resource cleanup ClusterIPRoute_02
pcs resource delete ClusterIPRule_02
pcs resource cleanup ClusterIPRule_02
################################################################################################################################################################
# Create resources
################################################################################################################################################################
pcs resource create ClusterIP_01 ocf:heartbeat:IPaddr2 ip=1.2.3.81 
nic=eth1 cidr_netmask=28 broadcast=1.2.3.95 iflabel=1 meta 
migration-threshold=2 op monitor timeout=20s interval=10s 
on-fail=restart --group ClusterNetworking
pcs resource create ClusterIPRoute_01 ocf:heartbeat:Route params 
device=eth1 source=1.2.3.81 destination=default gateway=1.2.3.94 
table=125 meta migration-threshold=2 op monitor timeout=20s interval=10s 
on-fail=restart --group ClusterNetworking --after ClusterIP_01
pcs resource create ClusterIPRule_01 ocf:heartbeat:Iprule params 
from=1.2.3.81 table=125 meta migration-threshold=2 op monitor 
timeout=20s interval=10s on-fail=restart --group ClusterNetworking 
--after ClusterIPRoute_01
pcs constraint location ClusterIP_01 prefers lb01=INFINITY
pcs constraint location ClusterIP_01 prefers lb02=INFINITY
pcs constraint location ClusterIPRoute_01 prefers lb01=INFINITY
pcs constraint location ClusterIPRoute_01 prefers lb02=INFINITY
pcs constraint location ClusterIPRule_01 prefers lb01=INFINITY
pcs constraint location ClusterIPRule_01 prefers lb02=INFINITY
pcs resource create ClusterIP_02 ocf:heartbeat:IPaddr2 ip=1.2.3.82 
nic=eth1 cidr_netmask=28 broadcast=1.2.3.95 iflabel=2 meta 
migration-threshold=2 op monitor timeout=20s interval=10s 
on-fail=restart --group ClusterNetworking
pcs resource create ClusterIPRoute_02 ocf:heartbeat:Route params 
device=eth1 source=1.2.3.82 destination=default gateway=1.2.3.94 
table=126 meta migration-threshold=2 op monitor timeout=20s interval=10s 
on-fail=restart --group ClusterNetworking --after ClusterIP_02
pcs resource create ClusterIPRule_02 ocf:heartbeat:Iprule params 
from=1.2.3.82 table=126 meta migration-threshold=2 op monitor 
timeout=20s interval=10s on-fail=restart --group ClusterNetworking 
--after ClusterIPRoute_02
pcs constraint location ClusterIP_02 prefers lb01=INFINITY
pcs constraint location ClusterIP_02 prefers lb02=INFINITY
pcs constraint location ClusterIPRoute_02 prefers lb01=INFINITY
pcs constraint location ClusterIPRoute_02 prefers lb02=INFINITY
pcs constraint location ClusterIPRule_02 prefers lb01=INFINITY
pcs constraint location ClusterIPRule_02 prefers lb02=INFINITY
################################################################################################################################################################
# NGINX
################################################################################################################################################################
pcs resource create webserver ocf:heartbeat:nginx httpd=/usr/sbin/nginx 
configfile=/etc/nginx/nginx.conf meta migration-threshold=2 op monitor 
timeout=5s interval=5s on-fail=restart
pcs constraint colocation add webserver with ClusterNetworking INFINITY
pcs constraint order ClusterNetworking then webserver
pcs constraint location webserver prefers lb01=INFINITY
pcs constraint location webserver prefers lb02=INFINITY
================================================================================================================================================================