[ClusterLabs] Antw: Antw: [EXT] Stopping a server failed and fenced, despite disabling stop timeout

Mon Jan 18 03:31:37 EST 2021

>>> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> schrieb am 18.01.2021 um
09:28 in Nachricht <60054697020000A10003E4DC at gwsmtp.uni-regensburg.de>:
>>>> Digimer <lists at alteeve.ca> schrieb am 18.01.2021 um 03:11 in Nachricht
> <816a4d1e-a92d-2a4c-b1a0-cf4353e3fa41 at alteeve.ca>:
>> Hi all,
>> 
>>   Mind the slew of questions, well into testing now and finding lots of
>> issues. This one is two questions... :)
>> 
>>   I set a server to be unamaged in pacemaker while the server was
>> running. Then I tried to remove the resource, and it refused saying it
>> couldn't stop it, and to use '--force'. So I did, and the node got
>> fenced. Now, the resource was setup with;
> 
> My guess is you shouldn't do it that way: Why not stop the resource,
> unconfigure it in the cluster, then start it manually?
> 
>> 
>> pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
>>  meta allow-migrate="true" target-role="started" \
>>  op monitor interval="60" start timeout="INFINITY" \
>>  on-fail="block" stop timeout="INFINITY" on-fail="block" \
>>  migrate_to timeout="INFINITY"
>> 
>>   I would have expected the 'stop timeout="INFINITY" on-fail="block"' to
>> prevent fencing if the server failed to stop (question 1) and that if a
>> resource was unmanaged, that the resource wouldn't even try to stop
>> (question 2).
>> 
>>   Can someone help me understand what happened here?
> 
> Fencing reason was " srv01-test_stop_0 process (PID 113779) timed out".
> 
> Did have a failutre before your actions? The logs indicate such it seems:

Sorry: "Did you have a failure before your actions?"

> "Clearing failure of srv01-test on el8-a01n02 because resource  parameters
> have changed"
> 
> Haveing the cluster in a clean state before configuring it highly desirable
> IMHO. I use this command frequently to check: "crm_mon -1Arfj"
> 
> The logs should help to explain!
> 
> Regards,
> Ulrich