[Pacemaker] pacemaker node stuck offline

pacemaker at feystorm.net pacemaker at feystorm.net
Thu Mar 21 22:39:14 EDT 2013

On 03/21/2013 11:15 AM, Andreas Kurz wrote:
> On 2013-03-21 14:31, Patrick Hemmer wrote:
>> I've got a 2-node cluster where it seems last night one of the nodes
>> went offline, and I can't see any reason why.
>> Attached are the logs from the 2 nodes (the relevant timeframe seems to
>> be 2013-03-21 between 06:05 and 06:10).
>> This is on ubuntu 12.04
> Looks like your non-redundant cluster-communication was interrupted at
> around that time for whatever reason and your cluster split-brained.
> Does the drbd-replication use a different network-connection? If yes,
> why not using it for a redundant ring setup ... and you should use
> I also wonder why you have defined "expected_votes='1'" in your
> cluster.conf.
> Regards,
> Andreas
But shouldn't it have recovered? The node shows as "OFFLINE", even
though it's clearly communicating with the rest of the cluster. What is
the procedure for getting the node back online. Anything other than
bouncing pacemaker?

Unfortunately no to the different network connection for drbd. These are
2 EC2 instances, so redundant connections aren't available. Though since
it is EC2, I could set up a STONITH to whack the other instance. The
only problem here would be a race condition. The EC2 api for shutting
down or rebooting an instance isn't instantaneous. Both nodes could end
up sending the signal to reboot the other node.

As for expected_votes=1, it's because it's a two-node cluster. Though I
apparently forgot to set the `two_node` attribute :-(

- -Patrick
