[ClusterLabs] Problems with corosync and pacemaker with error scenarios
Gerhard Wiesinger
lists at wiesinger.com
Mon Jan 16 09:56:18 EST 2017
Hello,
I'm new to corosync and pacemaker and I want to setup a nginx cluster
with quorum.
Requirements:
- 3 Linux maschines
- On 2 maschines floating IP should be handled and nginx as a load
balancing proxy
- 3rd maschine is for quorum only, no services must run there
Installed on all 3 nodes corosync/pacemaker, firewall ports openend are:
5404, 5405, 5406 for udp in both directions
OS: Fedora 25
Configuration of corosync (only the bindnetaddr is different on every
maschine) and pacemaker below.
Configuration works so far but error test scenarios don't work like
expected:
1.) I had cases in testing without qourum and quorum again where the
cluster kept in Stopped state
I had to restart the whole stack to get it online again (killall -9
corosync;systemctl restart corosync;systemctl restart pacemaker)
Any ideas?
2.) Restarting pacemaker on inactive node also restarts resources on the
other active node:
a.) Everything up & ok
b.) lb01 handles all resources
c.) on lb02 which handles no resrouces: systemctl restart pacemaker:
All resources will also be restart with a short outage on lb01 (state
is Stopped, Started[ lb01 lb02 ] and then Started lb02)
How can this be avoided?
3.) Stopping and starting corosync doesn't awake the node up again:
systemctl stop corosync;sleep 10;systemctl restart corosync
Online: [ kvm01 lb01 ]
OFFLINE: [ lb02 ]
Stays in that state until pacemaker is restarted: systemctl restart
pacemaker
Bug?
4.) "systemctl restart corosync" hangs sometimes (waiting 2 min)
needs a
killall -9 corosync;systemctl restart corosync;systemctl restart
pacemaker
sequence to get it up gain
5.) Simulation of split brain: Disabling/reenabling local firewall
(ports 5404, 5405, 5406) on node lb01 and lb02 for the following ports
doesn't bring corosync up again after reenabling lb02 firewall
partition WITHOUT quorum
Online: [ kvm01 ]
OFFLINE: [ lb01 lb02 ]
NOK: restart on lb02: systemctl restart corosync;systemctl restart
pacemaker
OK: restart on lb02 and kvm01 (quorum host): systemctl restart
corosync;systemctl restart pacemaker
I also see that non enabled hosts (quorum hosts) are also tried to be
started on kvm01
Started[ kvm01 lb02 ]
Started lb02
Any ideas?
I've also written a new ocf:heartbeat:Iprule to modify "ip rule"
accordingly.
Versions are:
corosync: 2.4.2
pacemaker: 1.1.16
Kernel: 4.9.3-200.fc25.x86_64
Thnx.
Ciao,
Gerhard
Corosync config:
================================================================================================================================================================
totem {
version: 2
cluster_name: lbcluster
crypto_cipher: aes256
crypto_hash: sha512
interface {
ringnumber: 0
bindnetaddr: 1.2.3.35
mcastport: 5405
}
transport: udpu
}
logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
nodelist {
node {
ring0_addr: lb01
nodeid: 1
}
node {
ring0_addr: lb02
nodeid: 2
}
node {
ring0_addr: kvm01
nodeid: 3
}
}
quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
#provider: corosync_votequorum
provider: corosync_votequorum
# Only for 2 node setup!
# two_node: 1
}
================================================================================================================================================================
################################################################################################################################################################
# Default properties
################################################################################################################################################################
pcs property set stonith-enabled=false
pcs property set no-quorum-policy=stop
pcs property set default-resource-stickiness=100
pcs property set symmetric-cluster=false
################################################################################################################################################################
# Delete & cleanup resources
################################################################################################################################################################
pcs resource delete webserver
pcs resource cleanup webserver
pcs resource delete ClusterIP_01
pcs resource cleanup ClusterIP_01
pcs resource delete ClusterIPRoute_01
pcs resource cleanup ClusterIPRoute_01
pcs resource delete ClusterIPRule_01
pcs resource cleanup ClusterIPRule_01
pcs resource delete ClusterIP_02
pcs resource cleanup ClusterIP_02
pcs resource delete ClusterIPRoute_02
pcs resource cleanup ClusterIPRoute_02
pcs resource delete ClusterIPRule_02
pcs resource cleanup ClusterIPRule_02
################################################################################################################################################################
# Create resources
################################################################################################################################################################
pcs resource create ClusterIP_01 ocf:heartbeat:IPaddr2 ip=1.2.3.81
nic=eth1 cidr_netmask=28 broadcast=1.2.3.95 iflabel=1 meta
migration-threshold=2 op monitor timeout=20s interval=10s
on-fail=restart --group ClusterNetworking
pcs resource create ClusterIPRoute_01 ocf:heartbeat:Route params
device=eth1 source=1.2.3.81 destination=default gateway=1.2.3.94
table=125 meta migration-threshold=2 op monitor timeout=20s interval=10s
on-fail=restart --group ClusterNetworking --after ClusterIP_01
pcs resource create ClusterIPRule_01 ocf:heartbeat:Iprule params
from=1.2.3.81 table=125 meta migration-threshold=2 op monitor
timeout=20s interval=10s on-fail=restart --group ClusterNetworking
--after ClusterIPRoute_01
pcs constraint location ClusterIP_01 prefers lb01=INFINITY
pcs constraint location ClusterIP_01 prefers lb02=INFINITY
pcs constraint location ClusterIPRoute_01 prefers lb01=INFINITY
pcs constraint location ClusterIPRoute_01 prefers lb02=INFINITY
pcs constraint location ClusterIPRule_01 prefers lb01=INFINITY
pcs constraint location ClusterIPRule_01 prefers lb02=INFINITY
pcs resource create ClusterIP_02 ocf:heartbeat:IPaddr2 ip=1.2.3.82
nic=eth1 cidr_netmask=28 broadcast=1.2.3.95 iflabel=2 meta
migration-threshold=2 op monitor timeout=20s interval=10s
on-fail=restart --group ClusterNetworking
pcs resource create ClusterIPRoute_02 ocf:heartbeat:Route params
device=eth1 source=1.2.3.82 destination=default gateway=1.2.3.94
table=126 meta migration-threshold=2 op monitor timeout=20s interval=10s
on-fail=restart --group ClusterNetworking --after ClusterIP_02
pcs resource create ClusterIPRule_02 ocf:heartbeat:Iprule params
from=1.2.3.82 table=126 meta migration-threshold=2 op monitor
timeout=20s interval=10s on-fail=restart --group ClusterNetworking
--after ClusterIPRoute_02
pcs constraint location ClusterIP_02 prefers lb01=INFINITY
pcs constraint location ClusterIP_02 prefers lb02=INFINITY
pcs constraint location ClusterIPRoute_02 prefers lb01=INFINITY
pcs constraint location ClusterIPRoute_02 prefers lb02=INFINITY
pcs constraint location ClusterIPRule_02 prefers lb01=INFINITY
pcs constraint location ClusterIPRule_02 prefers lb02=INFINITY
################################################################################################################################################################
# NGINX
################################################################################################################################################################
pcs resource create webserver ocf:heartbeat:nginx httpd=/usr/sbin/nginx
configfile=/etc/nginx/nginx.conf meta migration-threshold=2 op monitor
timeout=5s interval=5s on-fail=restart
pcs constraint colocation add webserver with ClusterNetworking INFINITY
pcs constraint order ClusterNetworking then webserver
pcs constraint location webserver prefers lb01=INFINITY
pcs constraint location webserver prefers lb02=INFINITY
================================================================================================================================================================
More information about the Users
mailing list