[ClusterLabs] "Error: unable to fence '001db02a'" but It got fenced anyway

Sun Feb 28 10:59:21 EST 2021

On Sun, Feb 28, 2021 at 03:34:20PM +0000, Eric Robinson wrote:
> 001db02b rebooted. After it came back up, I tried it in the other direction.
> 
> On node 001db02b, the command...
> 
> # pcs stonith fence 001db02a
> 
> ...produced output...
> 
> Error: unable to fence '001db02a'.
> 
> However, node 001db02a did get restarted!
> 
> We also saw this error...
> 
> Failed Actions:
> * stonith-001db02ab_start_0 on 001db02a 'unknown error' (1): call=70, status=Timed Out, exitreason='',
>     last-rc-change='Sun Feb 28 10:11:10 2021', queued=0ms, exec=20014ms
> 
> When that happens, does Pacemaker take over the other node's resources, or not?

Cloud fencing usually requires a higher timeout (20s reported here).

Microsoft seems to suggest the following setup:

# pcs property set stonith-timeout=900
# pcs stonith create rsc_st_azure fence_azure_arm username="login ID"
  password="password" resourceGroup="resource group" tenantId="tenant ID"
  subscriptionId="subscription id"
  pcmk_host_map="prod-cl1-0:prod-cl1-0-vm-name;prod-cl1-1:prod-cl1-1-vm-name"
  power_timeout=240 pcmk_reboot_timeout=900 pcmk_monitor_timeout=120
  pcmk_monitor_retries=4 pcmk_action_limit=3
  op monitor interval=3600

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/high-availability-guide-rhel-pacemaker

-- 
Valentin