[ClusterLabs] pacemaker: 1.1.23 20sec timeout on cluster with disc I/O write delays

Dmytro Poliarush Dmytro_Poliarush at epam.com
Tue Mar 17 11:32:14 UTC 2026


Hi all,
Need a small guidance on pacemaker: 1.1.23.
I'm chasing a stubborn issue in a 2node 2disc SBD cluster.

When running manual fencing test with `pcs stonith fence` command, I observe an error
```
    Error: unable to fence '<nodehostname>'
```
Error manifests each time around a `20second` timeout(I assume this is a timeout).
`time` command is used to track how long execution runs: `time pcs stonith fence`.
Here is an example:
```
    [root at node1 ~]# time pcs stonith fence --debug node2
    Running: /usr/sbin/stonith_admin -B node2
 >  Return Value: 194
    --Debug Output Start--
    --Debug Output End--

    Error: unable to fence 'node2'

 >  real    0m20.791s
    user    0m0.063s
    sys     0m0.033s
    [root at node1 ~]#
```

For investigation, I've setup a testing cluster with 2 Virtualbox VMs.
Behaviour was NOT observed on testing cluster until I intentionally added disk write delays with dmsetup tool on one of the nodes.
Here is an example of setting a 22sec write delay:
```
    # Create: read delay = 0 ms, write delay = 22000 ms
    # Table format: delay <dev> <start> <read_ms>  <dev> <start> <write_ms>
    dmsetup --noudevsync create slow-sdc --table "0 ${SIZE} delay /dev/sdc 0 0 /dev/sdc 0 22000"
    dmsetup mknodes
```

NOTE, that tests with delays upto(including) 19sec pass:
```
    [root at node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 20000
    [root at node1 ~]# dmsetup table slow-sdc
>   0 262144 delay 8:32 0 0 8:32 0 20000
    [root at node1 ~]# time pcs stonith fence --debug node2
    Running: /usr/sbin/stonith_admin -B node2
    Return Value: 194
    --Debug Output Start--
    --Debug Output End--

```
>   Error: unable to fence 'node2'

>   real    0m20.588s
    user    0m0.088s
    sys     0m0.021s

>   [root at node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 19000
    ++ blockdev --getsize /dev/sdc
    + SIZE=262144
    ++ lsblk -dn -o MAJ:MIN /dev/sdc
    + MAJMIN='  8:32 '
    + dmsetup suspend slow-sdc
    + dmsetup reload slow-sdc --table '0 262144 delay /dev/sdc 0 0 /dev/sdc 0 19000'
    + dmsetup resume slow-sdc
    + dmsetup table slow-sdc
>   0 262144 delay 8:32 0 0 8:32 0 19000
    [root at node1 ~]# pcs stonith history cleanup; pcs stonith cleanup # pcs-cleanup-error-cleanup
    cleaning up fencing-history for node *

    Cleaned up all resources on all nodes
    [root at node1 ~]#
    [root at node1 ~]# time pcs stonith fence --debug node2
    Running: /usr/sbin/stonith_admin -B node2
    Return Value: 0
    --Debug Output Start--
    --Debug Output End--

>   Node: node2 fenced

>   real    0m19.869s
    user    0m0.098s
    sys     0m0.035s
    [root at node1 ~]#
```

So here is my question:
I assume there is a 20sec timeout value hardcoded somewhere in pacemaker 1.1.23 sources.
This hardcoded value impacts manual fencing in case of disc I/O delays(maybe in some other cases).
I expect that increasing timeout can mitigate clusters with disc I/O issues similar to ones described above.
Please note this timeout is NOT: stonith-timeout or stonith-watchdog-timeout.

Could you please comment if that is a meaningfull assumption and where does the 20sec timeout come from.

Regards, Dmytro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20260317/4ade9443/attachment-0001.htm>


More information about the Users mailing list