[ClusterLabs] Is corosync supposed to be restarted if it fies?

Wed Nov 29 16:42:56 EST 2017

On 29/11/17 22:00 +0100, Jan Pokorný wrote:
> On 28/11/17 22:35 +0300, Andrei Borzenkov wrote:
>> 28.11.2017 13:01, Jan Pokorný пишет:
>>> On 27/11/17 17:43 +0300, Andrei Borzenkov wrote:
>>>> Отправлено с iPhone
>>>> 
>>>>> 27 нояб. 2017 г., в 14:36, Ferenc Wágner <wferi at niif.hu> написал(а):
>>>>> 
>>>>> Andrei Borzenkov <arvidjaar at gmail.com> writes:
>>>>> 
>>>>>> 25.11.2017 10:05, Andrei Borzenkov пишет:
>>>>>> 
>>>>>>> In one of guides suggested procedure to simulate split brain was to kill
>>>>>>> corosync process. It actually worked on one cluster, but on another
>>>>>>> corosync process was restarted after being killed without cluster
>>>>>>> noticing anything. Except after several attempts pacemaker died with
>>>>>>> stopping resources ... :)
>>>>>>> 
>>>>>>> This is SLES12 SP2; I do not see any Restart in service definition so it
>>>>>>> probably not systemd.
>>>>>>> 
>>>>>> FTR - it was not corosync, but pacemakker; its unit file specifies
>>>>>> RestartOn=error so killing corosync caused pacemaker to fail and be
>>>>>> restarted by systemd.
>>>>> 
>>>>> And starting corosync via a Requires dependency?
>>>> 
>>>> Exactly.
>>> 
>>> From my testing it looks like we should change
>>> "Requires=corosync.service" to "BindsTo=corosync.service"
>>> in pacemaker.service.
>>> 
>>> Could you give it a try?
>>> 
>> 
>> I'm not sure what is expected outcome, but pacemaker.service is still
>> restarted (due to Restart=on-failure).
> 
> Expected outcome is that pacemaker.service will become
> "inactive (dead)" after killing corosync (as a result of being
> "bound" by pacemaker).  Have you indeed issued "systemctl
> daemon-reload" after updating the pacemaker unit file?
> 
> (FTR, I tried with systemd 235).
> 
>> If intention is to unconditionally stop it when corosync dies,
>> pacemaker should probably exit with unique code and unit files have
>> RestartPreventExitStatus set to it.
> 
> That would be an elaborate way to reach the same.
> 
> But good point in questioning what's the "best intention" around these
> scenarios -- normally, fencing would happen, but as you note, the node
> had actually survived by being fast enough to put corosync back to
> life, and from there, whether it adds any value to have pacemaker
> restarted on non-clean terminations at all.  I don't know.
> 
> Would it make more sense to have FailureAction=reboot-immediate to
> at least in part emulate the fencing instead?

Although the restart may be also blazingly fast in some cases,
not making much difference except for taking all the previously
running resources forcibly down as an extra step, which may be
either good or bad.

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171129/274c0773/attachment-0003.sig>