[ClusterLabs] Ugrading Ubuntu 14.04 to 16.04 with corosync/pacemaker failed

Thu Feb 20 09:14:19 EST 2020

>>> we run a 2-system cluster for Samba with Ubuntu 14.04 and Samba,
>>> Corosync and Pacemaker from the Ubuntu repos. We wanted to update
>>> to Ubuntu 16.04 but it failed:

Quick question, perhaps unimportant to this forum, but, since this is a
samba HA setup, why to update to 16.04 and not to 18.04 ? I know Xenial
is still supported until 2024 but, as you're taking the chance to
migrate, why not the latest LTS version (bionic ?).

>>> I checked the versions before and because of just minor updates
>>> of corosync and pacemaker I thought it should be possible to
>>> update node by node.

You can check all versions with a tool called "rmadison":

 corosync|1.4.2-2|precise
 corosync|1.4.2-2ubuntu0.2|precise-updates
 corosync|2.3.3-1ubuntu1|trusty
-corosync|2.3.3-1ubuntu4|trusty-updates
 corosync|2.3.5-3ubuntu1|xenial
 corosync|2.3.5-3ubuntu2.3|xenial-security
+corosync|2.3.5-3ubuntu2.3|xenial-updates
 corosync|2.4.3-0ubuntu1|bionic
 corosync|2.4.3-0ubuntu1.1|bionic-security
 corosync|2.4.3-0ubuntu1.1|bionic-updates
 corosync|2.4.4-3|disco
 corosync|3.0.1-2ubuntu1|eoan
 corosync|3.0.2-1ubuntu2|focal

 pacemaker|1.1.6-2ubuntu3|precise
 pacemaker|1.1.6-2ubuntu3.3|precise-updates
 pacemaker|1.1.10+git20130802-1ubuntu2|trusty
 pacemaker|1.1.10+git20130802-1ubuntu2.4|trusty-security
-pacemaker|1.1.10+git20130802-1ubuntu2.5|trusty-updates
 pacemaker|1.1.14-2ubuntu1|xenial
 pacemaker|1.1.14-2ubuntu1.6|xenial-security
+pacemaker|1.1.14-2ubuntu1.6|xenial-updates
 pacemaker|1.1.18-0ubuntu1|bionic
 pacemaker|1.1.18-0ubuntu1.1|bionic-security
 pacemaker|1.1.18-0ubuntu1.1|bionic-updates
 pacemaker|1.1.18-2ubuntu1|disco
 pacemaker|1.1.18-2ubuntu1.19.04.1|disco-security
 pacemaker|1.1.18-2ubuntu1.19.04.1|disco-updates
 pacemaker|2.0.1-4ubuntu2|eoan
 pacemaker|2.0.1-5ubuntu5|focal

>>> * Put srv2 into standby
>>> * Upgraded srv2 to Ubuntu 16.04 with reboot and so on
>>> * Added a nodelist to corosync.conf because it looked
>>>  like corosync on srv2 didn't know the names of the
>>>  node ids anymore

The debian packaging upgrade execution path is likely not a topic to
this list (likely targeted to the cluster software itself), but, since
we are here...

You can check the packaging scripts under "/var/lib/dpkg/info/"
directory. Those files are the files to run in case of a package is
uninstalled, purged, reinstalled, etc...

In my current environment, important files would be:

/var/lib/dpkg/info/pacemaker.conffiles
/var/lib/dpkg/info/pacemaker-common.conffiles
/var/lib/dpkg/info/pacemaker.postrm
/var/lib/dpkg/info/pacemaker-common.postrm
/var/lib/dpkg/info/pacemaker.prerm
/var/lib/dpkg/info/pacemaker-common.postinst
/var/lib/dpkg/info/pacemaker-cli-utils.postinst
/var/lib/dpkg/info/pacemaker.postinst

/var/lib/dpkg/info/corosync.prerm
/var/lib/dpkg/info/corosync.conffiles
/var/lib/dpkg/info/corosync.preinst
/var/lib/dpkg/info/corosync.postrm
/var/lib/dpkg/info/corosync.postinst

And you will understand their relation in the following wiki:

https://wiki.debian.org/MaintainerScripts

Session "Upgrading":

https://wiki.debian.org/MaintainerScripts?action=AttachFile&do=get&target=upgrade.png

I haven't explored your upgrade execution path deeply, but, it sounds
that your theory is that either you got important configuration files
purged during package upgrade OR the jump from:

-corosync|2.3.3-1ubuntu4|trusty-updates
to
+corosync|2.3.5-3ubuntu2.3|xenial-updates

and
-pacemaker|1.1.10+git20130802-1ubuntu2.5|trusty-updates
to
+pacemaker|1.1.14-2ubuntu1.6|xenial-updates

OR the upgrade  was not smooth in regards to config options (you were
using) and its compatibility.

Checking corosync only, there were 26 commits related to config (at
least in a simple grep try):

$ git log v2.3.3..v2.3.5 --pretty=oneline --grep config

aabbace6 Log: Add logrotate configuration file
b9f5c290 votequorum: Fix auto_tie_breaker behaviour in odd-sized clusters
997074cc totemconfig: Check for duplicate nodeids
d77cec24 Handle adding and removing UDPU members atomically
8f284b26 Reset timer_problem_decrementer on fault
6449bea8 config: Ensure mcast address/port differs for rrp
70bd35fc config: Process broadcast option consistently
6c028d4d config: Make sure user doesn't mix IPv6 and IPv4
57539d1a man page: Improve description of token timeout
bb52fc27 Store configuration values used by totem to cmap
17488909 votequorum: Make qdev timeout in sync configurable
88dbb9f7 totemconfig: Make sure join timeout is less than consensus
3b8365e8 config: Fix typos
63bf0977 totemconfig: refactor nodelist_to_interface func
10c80f45 totemconfig: totem_config_get_ip_version
dc35bfae totemconfig: Free ifaddrs list
e3ffd4fe Implement config file testing mode
72cf15af votequorum: Do not process events during reload
c8e3f14f Make config.reload_in_progress key read only
d23ee6a3 upstart: Make job conf file configurable
7557fdec config: Allow dynamic change of token_coefficient
1f7e78ab init: Make init script configurable
9a8de87c totemconfig: Log errors on key change and reload
b95ebd64 totemconfig: Key change process dependencies
eeb23841 Really clear totemconfig nodes on reload
2f0cad20 config: Handle totem_set_volatile_defaults errors

and it does not look like there was a major refactoring in config
handling of any kind (very fast look).

Lets move on into other ideas...

>>> srv2____________________________________________________________
>>> Last updated: Wed Feb 19 17:25:14 2020		Last change: Tue Feb 18
>>> 18:29:29
>>> 2020 by hacluster via crmd on srv2
>>> Stack: corosync
>>> Current DC: srv2 (version 1.1.14-70404b0) - partition with quorum
>>> 2 nodes and 9 resources configured
>>>
>>> Node srv2: standby
>>> OFFLINE: [ srv1 ]
> 
> Still don't understand the concept of corosync/pacemaker. Which part is
> responsible for this "OFFLINE" statement? I don't know where to
> look deeper about this mismatch (see some lines above, where it
> says "Online" about srv
> 
>>>
>>> Full list of resources:
>>>
>>> Resource Group: samba_daemons
>>>     samba-nmbd	(upstart:nmbd):	Stopped
>>> [..]>>
>>>
>>> Failed Actions:
>>> * samba-nmbd_monitor_0 on srv2 'not installed' (5): call=5, status=Not
>>> installed, exitreason='none',
>>>    last-rc-change='Wed Feb 19 14:13:20 2020', queued=0ms, exec=1ms
>>> [..]
> 
> According to the logs it looks like the service (e.g. nmbd) is not
> available (may be because of (upstart:nmbd) - how do I change this
> configuration in pacemaker? I want to change it to "service" instead
> of "upstart". I hope this will fix at least the service problems.
> 
>   crm configure primitive smbd ..
> gives me:
>   ERROR: smbd: id is already in use.
> 

A bunch of things to notice from these messages:

- Trusty used "upstart" as its init system
- Xenial uses systemd as its init system
- It looks to me you're using "upstart" resource agent
- In Xenial you would have to use systemd resource agent
- Before using systemd resource agent you have to make sure your
services are disabled (systemctl disable xxx) as the cluster resource
manager will be the one initiating the services.
- Independent of resource agents, your nodes should be ONLINE
  - better tool to check rings is corosync itself and not pacemaker:

$ sudo corosync-quorumtool -slai

Membership information
----------------------
    Nodeid      Votes Name
         1          1 10.250.3.10, 10.250.4.10 (local)
         2          1 10.250.3.11, 10.250.4.11
         3          1 10.250.3.12, 10.250.4.12

for example.

>>>
>>> Any suggestions, ideas? Is the a nice HowTo for this upgrade situation?

Yes

1) stop what you are doing, do it from the ground.

2) Take 1 of the servers and configure it appropriately using the proper
resource agent. Before configuring the resources, make sure the rings
are in good shape and the cluster has the proper votes.

3) Do not use 2 node clustering without fencing, do at least 3 nodes
and/or use extra votes from somewhere.

4) Ubuntu project has its bug system and you can open bugs here:

https://launchpad.net/ubuntu/+source/pacemaker/ -> report a bug
https://launchpad.net/ubuntu/+source/corosync/ -> report a bug

5) A good "debian" based tutorial for HA cluster can be found here:

https://www.digitalocean.com/community/tutorials/how-to-create-a-high-availability-setup-with-corosync-pacemaker-and-floating-ips-on-ubuntu-14-04

My 5 cents...