[ClusterLabs] A bug? (SLES15 SP2 with "crm resource refresh")

Fri Jan 8 05:46:16 EST 2021

Hi!

Trying to reproduce a problem that had occurred in the past after a "crm resource refresh" ("reprobe"), I noticed something on the DC  that looks odd to me:

Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: Forcing the status of all resources to be redetected
Jan 08 11:13:21 h16 pacemaker-controld[4478]:  warning: new_event_notification (4478-26817-13): Broken pipe (32)

### We had that before, already...

Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 08 11:13:21 h16 pacemaker-schedulerd[4477]:  notice: Watchdog will be used via SBD if fencing is required and stonith-watchdog-timeout is nonzero
Jan 08 11:13:21 h16 pacemaker-schedulerd[4477]:  notice:  * Start      prm_stonith_sbd                      (             h16 )
Jan 08 11:13:21 h16 pacemaker-schedulerd[4477]:  notice:  * Start      prm_DLM:0                            (             h18 )
Jan 08 11:13:21 h16 pacemaker-schedulerd[4477]:  notice:  * Start      prm_DLM:1                            (             h19 )
Jan 08 11:13:21 h16 pacemaker-schedulerd[4477]:  notice:  * Start      prm_DLM:2                            (             h16 )
...

## So basically an announcemt to START everything that's running (everything is running); shouldn't that be "monitoring" (probe) instead?

Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: Initiating monitor operation prm_stonith_sbd_monitor_0 on h19
Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: Initiating monitor operation prm_stonith_sbd_monitor_0 on h18
Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: Initiating monitor operation prm_stonith_sbd_monitor_0 locally on h16
...
### So _probes_ are started,

Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: Transition 139 aborted by operation prm_testVG_testLV_activate_monitor_0 'modify' on h16: Event failed
Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: Transition 139 action 7 (prm_testVG_testLV_activate_monitor_0 on h16): expected 'not running' but got 'ok'
Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: Transition 139 action 19 (prm_testVG_testLV_activate_monitor_0 on h18): expected 'not running' but got 'ok'
Jan 08 11:13:21 h16 pacemaker-controld[4478]:  notice: Transition 139 action 31 (prm_testVG_testLV_activate_monitor_0 on h19): expected 'not running' but got 'ok'
...

### That's odd, because the clone WAS running on each node. (Similar results were reported for other clones)
Jan 08 11:13:43 h16 pacemaker-controld[4478]:  notice: Transition 140 (Complete=34, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-79.bz2): Complete
Jan 08 11:13:43 h16 pacemaker-controld[4478]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
### So in the end nothing was actually started, but those messages are quite confusing.

Pacemaker version was "(version 2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a)" on all three nodes (latest for SLES).

For reference here are the primitives that had odd result:
primitive prm_testVG_testLV_activate LVM-activate \
        params vgname=testVG lvname=testLV vg_access_mode=lvmlockd activation_mode=shared \
        op start timeout=90s interval=0 \
        op stop timeout=90s interval=0 \
        op monitor interval=60s timeout=90s \
        meta priority=9000
clone cln_testVG_activate prm_testVG_testLV_activate \
        meta interleave=true priority=9800 target-role=Started
primitive prm_lvmlockd lvmlockd \
        op start timeout=90 interval=0 \
        op stop timeout=100 interval=0 \
        op monitor interval=60 timeout=90 \
        meta priority=9800
clone cln_lvmlockd prm_lvmlockd \
        meta interleave=true priority=9800
order ord_lvmlockd__lvm_activate Mandatory: cln_lvmlockd ( cln_testVG_activate )
colocation col_lvm_activate__lvmlockd inf: ( cln_testVG_activate ) cln_lvmlockd
### lvmlockd similarly depends on DLM (order, colocation), so I don't see a problem

Finally:
h16:~ # vgs
  VG      #PV #LV #SN Attr   VSize   VFree
  sys       1   3   0 wz--n- 222.50g      0
  testVG    1   1   0 wz--ns 299.81g 289.81g

Regards,
Ulrich