[ClusterLabs] Antw: Antw: [EXT] Stopping a server failed and fenced, despite disabling stop timeout

Mon Jan 18 14:03:31 EST 2021

On 2021-01-18 3:31 a.m., Ulrich Windl wrote:
>>>> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> schrieb am 18.01.2021 um
> 09:28 in Nachricht <60054697020000A10003E4DC at gwsmtp.uni-regensburg.de>:
>>>>> Digimer <lists at alteeve.ca> schrieb am 18.01.2021 um 03:11 in Nachricht
>> <816a4d1e-a92d-2a4c-b1a0-cf4353e3fa41 at alteeve.ca>:
>>> Hi all,
>>>
>>>   Mind the slew of questions, well into testing now and finding lots of
>>> issues. This one is two questions... :)
>>>
>>>   I set a server to be unamaged in pacemaker while the server was
>>> running. Then I tried to remove the resource, and it refused saying it
>>> couldn't stop it, and to use '--force'. So I did, and the node got
>>> fenced. Now, the resource was setup with;
>>
>> My guess is you shouldn't do it that way: Why not stop the resource,
>> unconfigure it in the cluster, then start it manually?
>>
>>>
>>> pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
>>>  meta allow-migrate="true" target-role="started" \
>>>  op monitor interval="60" start timeout="INFINITY" \
>>>  on-fail="block" stop timeout="INFINITY" on-fail="block" \
>>>  migrate_to timeout="INFINITY"
>>>
>>>   I would have expected the 'stop timeout="INFINITY" on-fail="block"' to
>>> prevent fencing if the server failed to stop (question 1) and that if a
>>> resource was unmanaged, that the resource wouldn't even try to stop
>>> (question 2).
>>>
>>>   Can someone help me understand what happened here?
>>
>> Fencing reason was " srv01-test_stop_0 process (PID 113779) timed out".
>>
>> Did have a failutre before your actions? The logs indicate such it seems:
> 
> Sorry: "Did you have a failure before your actions?"

I had, yes, but I cleared it.

I'm intentionally doing "weird things" to see how the system reacts, and
when things go bad (like this), what can be done to make the system more
resilient.

If I've learned anything in 10 years of HA, it's that people will do all
the things you think they shouldn't do. So I'm trying to do them before
they do and learn how to mitigate as much as possible.

>> "Clearing failure of srv01-test on el8-a01n02 because resource  parameters
>> have changed"
>>
>> Haveing the cluster in a clean state before configuring it highly desirable
>> IMHO. I use this command frequently to check: "crm_mon -1Arfj"
>>
>> The logs should help to explain!
>>
>> Regards,
>> Ulrich
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould