[Pacemaker] Problems when quorum lost for a short period of time

Lev Sidorenko levs at securemedia.co.nz
Tue Oct 1 16:26:26 EDT 2013


Hello All!

I have a 4-nodes cluster setup.

It is actually 2 nodes for main+stanby and another two nodes just for
provide quorum.

So, all resources run on the main node but only DRBD-slave runs on the
standby node.

I have no-quorum-policy="stop"

So, sometimes main node looses connection to the cluster and reports
"quorum lost" but after 1-2 seconds connection re-establish and reports
"quorum retained"
This causes a big problem: as soon main node lost quorum it starts to
stop all resources. In the same time the second node starts to start
resources. After couple of seconds main node rejoins cluster but still
does not manage to stop all resources and part of resources already
started on the second node. So, I have lots of conflicts between
resources on these two nodes.

I tried to setup no-quorum-policy="suicide" hoping that as soon as main
node lost connection to the cluster it will reboot itself which will
give enough time for the second node start all of processes and become a
main one.
But with no-quorum-policy="suicide" main node just trying to STONITH all
of others nodes but not reboot itself.

So: the question is: how can I setup to instantly reboot a node when the
node detects that quorum lost?

Thank you in advance!

With the best regards,
Lev Sidorenko.






More information about the Pacemaker mailing list