[ClusterLabs] Pending Fencing Actions shown in pcs status

Mon Jan 11 21:45:01 EST 2021

Hi Steffen,

I've been experimenting with it since last weekend, but I haven't been able to reproduce the same situation.
It seems that the cause is that the reproduction method cannot be limited.

Can I attach a problem log?

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: Klaus Wenninger <kwenning at redhat.com>
> To: Steffen Vinther Sørensen <svinther at gmail.com>; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2021/1/7, Thu 21:42
> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> 
> On 1/7/21 1:13 PM, Steffen Vinther Sørensen wrote:
>>  Hi Klaus,
>> 
>>  Yes then the status does sync to the other nodes. Also it looks like
>>  there are some hostname resolving problems in play here, maybe causing
>>  problems,  here is my notes from restarting pacemaker etc.
> Don't think there are hostname resolving problems.
> The messages you are seeing, that look as if, are caused
> by using -EHOSTUNREACH as error-code to fail a pending
> fence action when a node that is just coming up sees
> a pending action that is claimed to be handled by himself.
> Back then I chose that error-code as there was none
> that really matched available right away and it was
> urgent for some reason so that introduction of something
> new was to risky at that state.
> Probably would make sense to introduce something that
> is more descriptive.
> Back then the issue was triggered by fenced crashing and
> being restarted - so not a node-restart but just fenced
> restarting.
> And it looks as if building the failed-message failed somehow.
> So that could be the reason why the pending action persists.
> Would be something else then what we solved with Bug 5401.
> But what triggers the logs below might as well just be a
> follow-up issue after the Bug 5401 thing.
> Will try to find time for a deeper look later today.
> 
> Klaus
>> 
>>  pcs cluster standby kvm03-node02.avigol-gcs.dk
>>  pcs cluster stop kvm03-node02.avigol-gcs.dk
>>  pcs status
>> 
>>  Pending Fencing Actions:
>>  * reboot of kvm03-node02.avigol-gcs.dk pending: client=crmd.37819,
>>  origin=kvm03-node03.avigol-gcs.dk
>> 
>>  # From logs on all 3 nodes:
>>  Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:  warning: received
>>  pending action we are supposed to be the owner but it's not in our
>>  records -> fail it
>>  Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:    error: Operation
>>  'reboot' targeting kvm03-node02.avigol-gcs.dk on <no-one> for
>>  crmd.37819 at kvm03-node03.avigol-gcs.dk.56a3018c: No route to host
>>  Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:    error:
>>  stonith_construct_reply: Triggered assert at commands.c:2406 : request
>>  != NULL
>>  Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:  warning: Can't create
>>  a sane reply
>>  Jan 07 12:48:18 kvm03-node03 crmd[37819]:   notice: Peer
>>  kvm03-node02.avigol-gcs.dk was not terminated (reboot) by <anyone> on
>>  behalf of crmd.37819: No route to host
>> 
>>  pcs cluster start kvm03-node02.avigol-gcs.dk
>>  pcs status (now outputs the same on all 3 nodes)
>> 
>>  Failed Fencing Actions:
>>  * reboot of kvm03-node02.avigol-gcs.dk failed: delegate=,
>>  client=crmd.37819, origin=kvm03-node03.avigol-gcs.dk,
>>      last-failed='Thu Jan  7 12:48:18 2021'
>> 
>> 
>>  pcs cluster unstandby kvm03-node02.avigol-gcs.dk
>> 
>>  # Now libvirtd refuses to start
>> 
>>  Jan 07 12:51:44 kvm03-node02 dnsmasq[20884]: read /etc/hosts - 8 addresses
>>  Jan 07 12:51:44 kvm03-node02 dnsmasq[20884]: read
>>  /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
>>  Jan 07 12:51:44 kvm03-node02 dnsmasq-dhcp[20884]: read
>>  /var/lib/libvirt/dnsmasq/default.hostsfile
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.729+0000: 24160: info : libvirt version: 4.5.0, package:
>>  36.el7_9.3 (CentOS BuildSystem <http://bugs.centos.org >,
>>  2020-11-16-16:25:20, x86-01.bsys.centos.org)
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.729+0000: 24160: info : hostname: kvm03-node02
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.729+0000: 24160: error : qemuMonitorOpenUnix:392 : failed to
>>  connect to monitor socket: Connection refused
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.729+0000: 24159: error : qemuMonitorOpenUnix:392 : failed to
>>  connect to monitor socket: Connection refused
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.730+0000: 24161: error : qemuMonitorOpenUnix:392 : failed to
>>  connect to monitor socket: Connection refused
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.730+0000: 24162: error : qemuMonitorOpenUnix:392 : failed to
>>  connect to monitor socket: Connection refused
>> 
>>  pcs status
>> 
>>  Failed Resource Actions:
>>  * libvirtd_start_0 on kvm03-node02.avigol-gcs.dk 'unknown error' 
> (1):
>>  call=142, status=complete, exitreason='',
>>      last-rc-change='Thu Jan  7 12:51:44 2021', queued=0ms, 
> exec=2157ms
>> 
>>  Failed Fencing Actions:
>>  * reboot of kvm03-node02.avigol-gcs.dk failed: delegate=,
>>  client=crmd.37819, origin=kvm03-node03.avigol-gcs.dk,
>>      last-failed='Thu Jan  7 12:48:18 2021'
>> 
>> 
>>  # from /etc/hosts on all 3 nodes:
>> 
>>  172.31.0.31    kvm03-node01 kvm03-node01.avigol-gcs.dk
>>  172.31.0.32    kvm03-node02 kvm03-node02.avigol-gcs.dk
>>  172.31.0.33    kvm03-node03 kvm03-node03.avigol-gcs.dk
>> 
>>  On Thu, Jan 7, 2021 at 11:15 AM Klaus Wenninger <kwenning at redhat.com> 
> wrote:
>>>  Hi Steffen,
>>> 
>>>  If you just see the leftover pending-action on one node
>>>  it would be interesting if restarting of pacemaker on
>>>  one of the other nodes does sync it to all of the
>>>  nodes.
>>> 
>>>  Regards,
>>>  Klaus
>>> 
>>>  On 1/7/21 9:54 AM, renayama19661014 at ybb.ne.jp wrote:
>>>>  Hi Steffen,
>>>> 
>>>>>  Unfortunately not sure about the exact scenario. But I have 
> been doing
>>>>>  some recent experiments with node standby/unstandby stop/start. 
> This
>>>>>  to get procedures right for updating node rpms etc.
>>>>> 
>>>>>  Later I noticed the uncomforting "pending fencing 
> actions" status msg.
>>>>  Okay!
>>>> 
>>>>  Repeat the standby and unstandby steps in the same way to check.
>>>>  We will start checking after tomorrow, so I think it will take some 
> time until next week.
>>>> 
>>>> 
>>>>  Many thanks,
>>>>  Hideo Yamauchi.
>>>> 
>>>> 
>>>> 
>>>>  ----- Original Message -----
>>>>>  From: "renayama19661014 at ybb.ne.jp" 
> <renayama19661014 at ybb.ne.jp>
>>>>>  To: Reid Wahl <nwahl at redhat.com>; Cluster Labs - All 
> topics related to open-source clustering welcomed <users at clusterlabs.org>
>>>>>  Cc:
>>>>>  Date: 2021/1/7, Thu 17:51
>>>>>  Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs 
> status
>>>>> 
>>>>>  Hi Steffen,
>>>>>  Hi Reid,
>>>>> 
>>>>>  The fencing history is kept inside stonith-ng and is not 
> written to cib.
>>>>>  However, getting the entire cib and getting it sent will help 
> you to reproduce
>>>>>  the problem.
>>>>> 
>>>>>  Best Regards,
>>>>>  Hideo Yamauchi.
>>>>> 
>>>>> 
>>>>>  ----- Original Message -----
>>>>>>  From: Reid Wahl <nwahl at redhat.com>
>>>>>>  To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics 
> related to
>>>>>  open-source clustering welcomed <users at clusterlabs.org>
>>>>>>  Date: 2021/1/7, Thu 17:39
>>>>>>  Subject: Re: [ClusterLabs] Pending Fencing Actions shown in 
> pcs status
>>>>>> 
>>>>>> 
>>>>>>  Hi, Steffen. Those attachments don't contain the CIB. 
> They contain the
>>>>>  `pcs config` output. You can get the cib with `pcs cluster cib 
>> 
>>>>>  $(hostname).cib.xml`.
>>>>>>  Granted, it's possible that this fence action 
> information wouldn't
>>>>>  be in the CIB at all. It might be stored in fencer memory.
>>>>>>  On Thu, Jan 7, 2021 at 12:26 AM 
> <renayama19661014 at ybb.ne.jp> wrote:
>>>>>> 
>>>>>>  Hi Steffen,
>>>>>>>>   Here CIB settings attached (pcs config show) for 
> all 3 of my nodes
>>>>>>>>   (all 3 seems 100% identical), node03 is the DC.
>>>>>>>  Thank you for the attachment.
>>>>>>> 
>>>>>>>  What is the scenario when this situation occurs?
>>>>>>>  In what steps did the problem appear when fencing was 
> performed (or
>>>>>  failed)?
>>>>>>>  Best Regards,
>>>>>>>  Hideo Yamauchi.
>>>>>>> 
>>>>>>> 
>>>>>>>  ----- Original Message -----
>>>>>>>>   From: Steffen Vinther Sørensen 
> <svinther at gmail.com>
>>>>>>>>   To: renayama19661014 at ybb.ne.jp; Cluster Labs - All 
> topics related
>>>>>  to open-source clustering welcomed 
> <users at clusterlabs.org>
>>>>>>>>   Cc:
>>>>>>>>   Date: 2021/1/7, Thu 17:05
>>>>>>>>   Subject: Re: [ClusterLabs] Pending Fencing Actions 
> shown in pcs
>>>>>  status
>>>>>>>>   Hi Hideo,
>>>>>>>> 
>>>>>>>>   Here CIB settings attached (pcs config show) for 
> all 3 of my nodes
>>>>>>>>   (all 3 seems 100% identical), node03 is the DC.
>>>>>>>> 
>>>>>>>>   Regards
>>>>>>>>   Steffen
>>>>>>>> 
>>>>>>>>   On Thu, Jan 7, 2021 at 8:06 AM 
> <renayama19661014 at ybb.ne.jp>
>>>>>  wrote:
>>>>>>>>>    Hi Steffen,
>>>>>>>>>    Hi Reid,
>>>>>>>>> 
>>>>>>>>>    I also checked the Centos source rpm and it 
> seems to include a
>>>>>  fix for the
>>>>>>>>   problem.
>>>>>>>>>    As Steffen suggested, if you share your CIB 
> settings, I might
>>>>>  know
>>>>>>>>   something.
>>>>>>>>>    If this issue is the same as the fix, the 
> display will only be
>>>>>  displayed on
>>>>>>>>   the DC node and will not affect the operation.
>>>>>>>>>    The pending actions shown will remain for a 
> long time, but
>>>>>  will not have a
>>>>>>>>   negative impact on the cluster.
>>>>>>>>>    Best Regards,
>>>>>>>>>    Hideo Yamauchi.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>    ----- Original Message -----
>>>>>>>>>    > From: Reid Wahl <nwahl at redhat.com>
>>>>>>>>>    > To: Cluster Labs - All topics related to 
> open-source
>>>>>  clustering
>>>>>>>>   welcomed <users at clusterlabs.org>
>>>>>>>>>    > Cc:
>>>>>>>>>    > Date: 2021/1/7, Thu 15:58
>>>>>>>>>    > Subject: Re: [ClusterLabs] Pending 
> Fencing Actions shown
>>>>>  in pcs status
>>>>>>>>>    >
>>>>>>>>>    > It's supposedly fixed in that 
> version.
>>>>>>>>>    >   - 
> https://bugzilla.redhat.com/show_bug.cgi?id=1787749 
>>>>>>>>>    >   - 
> https://access.redhat.com/solutions/4713471 
>>>>>>>>>    >
>>>>>>>>>    > So you may be hitting a different issue 
> (unless
>>>>>  there's a bug in
>>>>>>>>   the
>>>>>>>>>    > pcmk 1.1 backport of the fix).
>>>>>>>>>    >
>>>>>>>>>    > I may be a little bit out of my area of 
> knowledge here,
>>>>>  but can you
>>>>>>>>>    > share the CIBs from nodes 1 and 3? Maybe 
> Hideo, Klaus, or
>>>>>  Ken has some
>>>>>>>>>    > insight.
>>>>>>>>>    >
>>>>>>>>>    > On Wed, Jan 6, 2021 at 10:53 PM Steffen 
> Vinther Sørensen
>>>>>>>>>    > <svinther at gmail.com> wrote:
>>>>>>>>>    >>
>>>>>>>>>    >>  Hi Hideo,
>>>>>>>>>    >>
>>>>>>>>>    >>  If the fix is not going to make it 
> into the CentOS7
>>>>>  pacemaker
>>>>>>>>   version,
>>>>>>>>>    >>  I guess the stable approach to take 
> advantage of it
>>>>>  is to build
>>>>>>>>   the
>>>>>>>>>    >>  cluster on another OS than CentOS7 
> ? A little late
>>>>>  for that in
>>>>>>>>   this
>>>>>>>>>    >>  case though :)
>>>>>>>>>    >>
>>>>>>>>>    >>  Regards
>>>>>>>>>    >>  Steffen
>>>>>>>>>    >>
>>>>>>>>>    >>
>>>>>>>>>    >>
>>>>>>>>>    >>
>>>>>>>>>    >>  On Thu, Jan 7, 2021 at 7:27 AM
>>>>>  <renayama19661014 at ybb.ne.jp>
>>>>>>>>   wrote:
>>>>>>>>>    >>  >
>>>>>>>>>    >>  > Hi Steffen,
>>>>>>>>>    >>  >
>>>>>>>>>    >>  > The fix pointed out by Reid is 
> affecting it.
>>>>>>>>>    >>  >
>>>>>>>>>    >>  > Since the fencing action 
> requested by the DC
>>>>>  node exists
>>>>>>>>   only in the
>>>>>>>>>    > DC node, such an event occurs.
>>>>>>>>>    >>  > You will need to take 
> advantage of the modified
>>>>>  pacemaker to
>>>>>>>>   resolve
>>>>>>>>>    > the issue.
>>>>>>>>>    >>  >
>>>>>>>>>    >>  > Best Regards,
>>>>>>>>>    >>  > Hideo Yamauchi.
>>>>>>>>>    >>  >
>>>>>>>>>    >>  >
>>>>>>>>>    >>  >
>>>>>>>>>    >>  > ----- Original Message -----
>>>>>>>>>    >>  > > From: Reid Wahl 
> <nwahl at redhat.com>
>>>>>>>>>    >>  > > To: Cluster Labs - All 
> topics related to
>>>>>  open-source
>>>>>>>>   clustering
>>>>>>>>>    > welcomed <users at clusterlabs.org>
>>>>>>>>>    >>  > > Cc:
>>>>>>>>>    >>  > > Date: 2021/1/7, Thu 15:07
>>>>>>>>>    >>  > > Subject: Re: 
> [ClusterLabs] Pending Fencing
>>>>>  Actions
>>>>>>>>   shown in pcs
>>>>>>>>>    > status
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  > > Hi, Steffen. Are your 
> cluster nodes all
>>>>>  running the
>>>>>>>>   same
>>>>>>>>>    > Pacemaker
>>>>>>>>>    >>  > > versions? This looks like 
> Bug 5401[1],
>>>>>  which is fixed
>>>>>>>>   by upstream
>>>>>>>>>    >>  > > commit df71a07[2]. 
> I'm a little bit
>>>>>  confused about
>>>>>>>>   why it
>>>>>>>>>    > only shows
>>>>>>>>>    >>  > > up on one out of three 
> nodes though.
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  > > [1]
>>>>>  https://bugs.clusterlabs.org/show_bug.cgi?id=5401 
>>>>>>>>>    >>  > > [2]
>>>>>>>>   
> https://github.com/ClusterLabs/pacemaker/commit/df71a07 
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  > > On Tue, Jan 5, 2021 at 
> 8:31 AM Steffen
>>>>>  Vinther Sørensen
>>>>>>>>>    >>  > > 
> <svinther at gmail.com> wrote:
>>>>>>>>>    >>  > >>
>>>>>>>>>    >>  > >>  Hello
>>>>>>>>>    >>  > >>
>>>>>>>>>    >>  > >>  node 1 is showing 
> this in 'pcs
>>>>>  status'
>>>>>>>>>    >>  > >>
>>>>>>>>>    >>  > >>  Pending Fencing 
> Actions:
>>>>>>>>>    >>  > >>  * reboot of
>>>>>  kvm03-node02.avigol-gcs.dk pending:
>>>>>>>>>    > client=crmd.37819,
>>>>>>>>>    >>  > >>  
> origin=kvm03-node03.avigol-gcs.dk
>>>>>>>>>    >>  > >>
>>>>>>>>>    >>  > >>  node 2 and node 3 
> outputs no such
>>>>>  thing (node 3 is
>>>>>>>>   DC)
>>>>>>>>>    >>  > >>
>>>>>>>>>    >>  > >>  Google is not much 
> help, how to
>>>>>  investigate this
>>>>>>>>   further and
>>>>>>>>>    > get rid
>>>>>>>>>    >>  > >>  of such terrifying 
> status message ?
>>>>>>>>>    >>  > >>
>>>>>>>>>    >>  > >>  Regards
>>>>>>>>>    >>  > >>  Steffen
>>>>>>>>>    >>  > >>
>>>>>  _______________________________________________
>>>>>>>>>    >>  > >>  Manage your 
> subscription:
>>>>>>>>>    >>  > >>
>>>>>>>>   
> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>>>>>    >>  > >>
>>>>>>>>>    >>  > >>  ClusterLabs home:
>>>>>  https://www.clusterlabs.org/ 
>>>>>>>>>    >>  > >>
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  > > --
>>>>>>>>>    >>  > > Regards,
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  > > Reid Wahl, RHCA
>>>>>>>>>    >>  > > Senior Software 
> Maintenance Engineer, Red
>>>>>  Hat
>>>>>>>>>    >>  > > CEE - Platform Support 
> Delivery -
>>>>>  ClusterHA
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  > >
>>>>>  _______________________________________________
>>>>>>>>>    >>  > > Manage your subscription:
>>>>>>>>>    >>  > >
>>>>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  > > ClusterLabs home:
>>>>>  https://www.clusterlabs.org/ 
>>>>>>>>>    >>  > >
>>>>>>>>>    >>  >
>>>>>>>>>    >>  > 
> _______________________________________________
>>>>>>>>>    >>  > Manage your subscription:
>>>>>>>>>    >>  >
>>>>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>>>>>    >>  >
>>>>>>>>>    >>  > ClusterLabs home: 
> https://www.clusterlabs.org/ 
>>>>>>>>>    >>  
> _______________________________________________
>>>>>>>>>    >>  Manage your subscription:
>>>>>>>>>    >>  
> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>>>>>    >>
>>>>>>>>>    >>  ClusterLabs home: 
> https://www.clusterlabs.org/ 
>>>>>>>>>    >
>>>>>>>>>    >
>>>>>>>>>    >
>>>>>>>>>    > --
>>>>>>>>>    > Regards,
>>>>>>>>>    >
>>>>>>>>>    > Reid Wahl, RHCA
>>>>>>>>>    > Senior Software Maintenance Engineer, 
> Red Hat
>>>>>>>>>    > CEE - Platform Support Delivery - 
> ClusterHA
>>>>>>>>>    >
>>>>>>>>>    > 
> _______________________________________________
>>>>>>>>>    > Manage your subscription:
>>>>>>>>>    > 
> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>>>>>    >
>>>>>>>>>    > ClusterLabs home: 
> https://www.clusterlabs.org/ 
>>>>>>>>>    >
>>>>>>>>> 
>>>>>>>>>   
> _______________________________________________
>>>>>>>>>    Manage your subscription:
>>>>>>>>>   
> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>>>>> 
>>>>>>>>>    ClusterLabs home: 
> https://www.clusterlabs.org/ 
>>>>>>>  _______________________________________________
>>>>>>>  Manage your subscription:
>>>>>>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>>> 
>>>>>>>  ClusterLabs home: https://www.clusterlabs.org/ 
>>>>>>> 
>>>>>>  --
>>>>>> 
>>>>>>  Regards,
>>>>>> 
>>>>>>  Reid Wahl, RHCA
>>>>>> 
>>>>>>  Senior Software Maintenance Engineer, Red Hat
>>>>>>  CEE - Platform Support Delivery - ClusterHA
>>>>>> 
>>>>>> 
>>>>>  _______________________________________________
>>>>>  Manage your subscription:
>>>>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>> 
>>>>>  ClusterLabs home: https://www.clusterlabs.org/ 
>>>>> 
>>>>  _______________________________________________
>>>>  Manage your subscription:
>>>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>>>> 
>>>>  ClusterLabs home: https://www.clusterlabs.org/ 
>>>  _______________________________________________
>>>  Manage your subscription:
>>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>>> 
>>>  ClusterLabs home: https://www.clusterlabs.org/ 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
>