[ClusterLabs] pacemaker: 1.1.23 20sec timeout on cluster with disc I/O write delays
Dmytro Poliarush
Dmytro_Poliarush at epam.com
Tue Mar 17 11:32:14 UTC 2026
Hi all,
Need a small guidance on pacemaker: 1.1.23.
I'm chasing a stubborn issue in a 2node 2disc SBD cluster.
When running manual fencing test with `pcs stonith fence` command, I observe an error
```
Error: unable to fence '<nodehostname>'
```
Error manifests each time around a `20second` timeout(I assume this is a timeout).
`time` command is used to track how long execution runs: `time pcs stonith fence`.
Here is an example:
```
[root at node1 ~]# time pcs stonith fence --debug node2
Running: /usr/sbin/stonith_admin -B node2
> Return Value: 194
--Debug Output Start--
--Debug Output End--
Error: unable to fence 'node2'
> real 0m20.791s
user 0m0.063s
sys 0m0.033s
[root at node1 ~]#
```
For investigation, I've setup a testing cluster with 2 Virtualbox VMs.
Behaviour was NOT observed on testing cluster until I intentionally added disk write delays with dmsetup tool on one of the nodes.
Here is an example of setting a 22sec write delay:
```
# Create: read delay = 0 ms, write delay = 22000 ms
# Table format: delay <dev> <start> <read_ms> <dev> <start> <write_ms>
dmsetup --noudevsync create slow-sdc --table "0 ${SIZE} delay /dev/sdc 0 0 /dev/sdc 0 22000"
dmsetup mknodes
```
NOTE, that tests with delays upto(including) 19sec pass:
```
[root at node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 20000
[root at node1 ~]# dmsetup table slow-sdc
> 0 262144 delay 8:32 0 0 8:32 0 20000
[root at node1 ~]# time pcs stonith fence --debug node2
Running: /usr/sbin/stonith_admin -B node2
Return Value: 194
--Debug Output Start--
--Debug Output End--
```
> Error: unable to fence 'node2'
> real 0m20.588s
user 0m0.088s
sys 0m0.021s
> [root at node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 19000
++ blockdev --getsize /dev/sdc
+ SIZE=262144
++ lsblk -dn -o MAJ:MIN /dev/sdc
+ MAJMIN=' 8:32 '
+ dmsetup suspend slow-sdc
+ dmsetup reload slow-sdc --table '0 262144 delay /dev/sdc 0 0 /dev/sdc 0 19000'
+ dmsetup resume slow-sdc
+ dmsetup table slow-sdc
> 0 262144 delay 8:32 0 0 8:32 0 19000
[root at node1 ~]# pcs stonith history cleanup; pcs stonith cleanup # pcs-cleanup-error-cleanup
cleaning up fencing-history for node *
Cleaned up all resources on all nodes
[root at node1 ~]#
[root at node1 ~]# time pcs stonith fence --debug node2
Running: /usr/sbin/stonith_admin -B node2
Return Value: 0
--Debug Output Start--
--Debug Output End--
> Node: node2 fenced
> real 0m19.869s
user 0m0.098s
sys 0m0.035s
[root at node1 ~]#
```
So here is my question:
I assume there is a 20sec timeout value hardcoded somewhere in pacemaker 1.1.23 sources.
This hardcoded value impacts manual fencing in case of disc I/O delays(maybe in some other cases).
I expect that increasing timeout can mitigate clusters with disc I/O issues similar to ones described above.
Please note this timeout is NOT: stonith-timeout or stonith-watchdog-timeout.
Could you please comment if that is a meaningfull assumption and where does the 20sec timeout come from.
Regards, Dmytro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20260317/4ade9443/attachment-0001.htm>
More information about the Users
mailing list