[ClusterLabs] Pending Fencing Actions shown in pcs status

Thu Jan 7 07:41:35 EST 2021

On 1/7/21 1:13 PM, Steffen Vinther Sørensen wrote:
> Hi Klaus,
>
> Yes then the status does sync to the other nodes. Also it looks like
> there are some hostname resolving problems in play here, maybe causing
> problems,  here is my notes from restarting pacemaker etc.
Don't think there are hostname resolving problems.
The messages you are seeing, that look as if, are caused
by using -EHOSTUNREACH as error-code to fail a pending
fence action when a node that is just coming up sees
a pending action that is claimed to be handled by himself.
Back then I chose that error-code as there was none
that really matched available right away and it was
urgent for some reason so that introduction of something
new was to risky at that state.
Probably would make sense to introduce something that
is more descriptive.
Back then the issue was triggered by fenced crashing and
being restarted - so not a node-restart but just fenced
restarting.
And it looks as if building the failed-message failed somehow.
So that could be the reason why the pending action persists.
Would be something else then what we solved with Bug 5401.
But what triggers the logs below might as well just be a
follow-up issue after the Bug 5401 thing.
Will try to find time for a deeper look later today.

Klaus
>
> pcs cluster standby kvm03-node02.avigol-gcs.dk
> pcs cluster stop kvm03-node02.avigol-gcs.dk
> pcs status
>
> Pending Fencing Actions:
> * reboot of kvm03-node02.avigol-gcs.dk pending: client=crmd.37819,
> origin=kvm03-node03.avigol-gcs.dk
>
> # From logs on all 3 nodes:
> Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:  warning: received
> pending action we are supposed to be the owner but it's not in our
> records -> fail it
> Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:    error: Operation
> 'reboot' targeting kvm03-node02.avigol-gcs.dk on <no-one> for
> crmd.37819 at kvm03-node03.avigol-gcs.dk.56a3018c: No route to host
> Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:    error:
> stonith_construct_reply: Triggered assert at commands.c:2406 : request
> != NULL
> Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:  warning: Can't create
> a sane reply
> Jan 07 12:48:18 kvm03-node03 crmd[37819]:   notice: Peer
> kvm03-node02.avigol-gcs.dk was not terminated (reboot) by <anyone> on
> behalf of crmd.37819: No route to host
>
> pcs cluster start kvm03-node02.avigol-gcs.dk
> pcs status (now outputs the same on all 3 nodes)
>
> Failed Fencing Actions:
> * reboot of kvm03-node02.avigol-gcs.dk failed: delegate=,
> client=crmd.37819, origin=kvm03-node03.avigol-gcs.dk,
>     last-failed='Thu Jan  7 12:48:18 2021'
>
>
> pcs cluster unstandby kvm03-node02.avigol-gcs.dk
>
> # Now libvirtd refuses to start
>
> Jan 07 12:51:44 kvm03-node02 dnsmasq[20884]: read /etc/hosts - 8 addresses
> Jan 07 12:51:44 kvm03-node02 dnsmasq[20884]: read
> /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
> Jan 07 12:51:44 kvm03-node02 dnsmasq-dhcp[20884]: read
> /var/lib/libvirt/dnsmasq/default.hostsfile
> Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
> 11:51:44.729+0000: 24160: info : libvirt version: 4.5.0, package:
> 36.el7_9.3 (CentOS BuildSystem <http://bugs.centos.org>,
> 2020-11-16-16:25:20, x86-01.bsys.centos.org)
> Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
> 11:51:44.729+0000: 24160: info : hostname: kvm03-node02
> Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
> 11:51:44.729+0000: 24160: error : qemuMonitorOpenUnix:392 : failed to
> connect to monitor socket: Connection refused
> Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
> 11:51:44.729+0000: 24159: error : qemuMonitorOpenUnix:392 : failed to
> connect to monitor socket: Connection refused
> Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
> 11:51:44.730+0000: 24161: error : qemuMonitorOpenUnix:392 : failed to
> connect to monitor socket: Connection refused
> Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
> 11:51:44.730+0000: 24162: error : qemuMonitorOpenUnix:392 : failed to
> connect to monitor socket: Connection refused
>
> pcs status
>
> Failed Resource Actions:
> * libvirtd_start_0 on kvm03-node02.avigol-gcs.dk 'unknown error' (1):
> call=142, status=complete, exitreason='',
>     last-rc-change='Thu Jan  7 12:51:44 2021', queued=0ms, exec=2157ms
>
> Failed Fencing Actions:
> * reboot of kvm03-node02.avigol-gcs.dk failed: delegate=,
> client=crmd.37819, origin=kvm03-node03.avigol-gcs.dk,
>     last-failed='Thu Jan  7 12:48:18 2021'
>
>
> # from /etc/hosts on all 3 nodes:
>
> 172.31.0.31    kvm03-node01 kvm03-node01.avigol-gcs.dk
> 172.31.0.32    kvm03-node02 kvm03-node02.avigol-gcs.dk
> 172.31.0.33    kvm03-node03 kvm03-node03.avigol-gcs.dk
>
> On Thu, Jan 7, 2021 at 11:15 AM Klaus Wenninger <kwenning at redhat.com> wrote:
>> Hi Steffen,
>>
>> If you just see the leftover pending-action on one node
>> it would be interesting if restarting of pacemaker on
>> one of the other nodes does sync it to all of the
>> nodes.
>>
>> Regards,
>> Klaus
>>
>> On 1/7/21 9:54 AM, renayama19661014 at ybb.ne.jp wrote:
>>> Hi Steffen,
>>>
>>>> Unfortunately not sure about the exact scenario. But I have been doing
>>>> some recent experiments with node standby/unstandby stop/start. This
>>>> to get procedures right for updating node rpms etc.
>>>>
>>>> Later I noticed the uncomforting "pending fencing actions" status msg.
>>> Okay!
>>>
>>> Repeat the standby and unstandby steps in the same way to check.
>>> We will start checking after tomorrow, so I think it will take some time until next week.
>>>
>>>
>>> Many thanks,
>>> Hideo Yamauchi.
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
>>>> To: Reid Wahl <nwahl at redhat.com>; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
>>>> Cc:
>>>> Date: 2021/1/7, Thu 17:51
>>>> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
>>>>
>>>> Hi Steffen,
>>>> Hi Reid,
>>>>
>>>> The fencing history is kept inside stonith-ng and is not written to cib.
>>>> However, getting the entire cib and getting it sent will help you to reproduce
>>>> the problem.
>>>>
>>>> Best Regards,
>>>> Hideo Yamauchi.
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: Reid Wahl <nwahl at redhat.com>
>>>>> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to
>>>> open-source clustering welcomed <users at clusterlabs.org>
>>>>> Date: 2021/1/7, Thu 17:39
>>>>> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
>>>>>
>>>>>
>>>>> Hi, Steffen. Those attachments don't contain the CIB. They contain the
>>>> `pcs config` output. You can get the cib with `pcs cluster cib >
>>>> $(hostname).cib.xml`.
>>>>> Granted, it's possible that this fence action information wouldn't
>>>> be in the CIB at all. It might be stored in fencer memory.
>>>>> On Thu, Jan 7, 2021 at 12:26 AM <renayama19661014 at ybb.ne.jp> wrote:
>>>>>
>>>>> Hi Steffen,
>>>>>>>  Here CIB settings attached (pcs config show) for all 3 of my nodes
>>>>>>>  (all 3 seems 100% identical), node03 is the DC.
>>>>>> Thank you for the attachment.
>>>>>>
>>>>>> What is the scenario when this situation occurs?
>>>>>> In what steps did the problem appear when fencing was performed (or
>>>> failed)?
>>>>>> Best Regards,
>>>>>> Hideo Yamauchi.
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>>  From: Steffen Vinther Sørensen <svinther at gmail.com>
>>>>>>>  To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related
>>>> to open-source clustering welcomed <users at clusterlabs.org>
>>>>>>>  Cc:
>>>>>>>  Date: 2021/1/7, Thu 17:05
>>>>>>>  Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs
>>>> status
>>>>>>>  Hi Hideo,
>>>>>>>
>>>>>>>  Here CIB settings attached (pcs config show) for all 3 of my nodes
>>>>>>>  (all 3 seems 100% identical), node03 is the DC.
>>>>>>>
>>>>>>>  Regards
>>>>>>>  Steffen
>>>>>>>
>>>>>>>  On Thu, Jan 7, 2021 at 8:06 AM <renayama19661014 at ybb.ne.jp>
>>>> wrote:
>>>>>>>>   Hi Steffen,
>>>>>>>>   Hi Reid,
>>>>>>>>
>>>>>>>>   I also checked the Centos source rpm and it seems to include a
>>>> fix for the
>>>>>>>  problem.
>>>>>>>>   As Steffen suggested, if you share your CIB settings, I might
>>>> know
>>>>>>>  something.
>>>>>>>>   If this issue is the same as the fix, the display will only be
>>>> displayed on
>>>>>>>  the DC node and will not affect the operation.
>>>>>>>>   The pending actions shown will remain for a long time, but
>>>> will not have a
>>>>>>>  negative impact on the cluster.
>>>>>>>>   Best Regards,
>>>>>>>>   Hideo Yamauchi.
>>>>>>>>
>>>>>>>>
>>>>>>>>   ----- Original Message -----
>>>>>>>>   > From: Reid Wahl <nwahl at redhat.com>
>>>>>>>>   > To: Cluster Labs - All topics related to open-source
>>>> clustering
>>>>>>>  welcomed <users at clusterlabs.org>
>>>>>>>>   > Cc:
>>>>>>>>   > Date: 2021/1/7, Thu 15:58
>>>>>>>>   > Subject: Re: [ClusterLabs] Pending Fencing Actions shown
>>>> in pcs status
>>>>>>>>   >
>>>>>>>>   > It's supposedly fixed in that version.
>>>>>>>>   >   - https://bugzilla.redhat.com/show_bug.cgi?id=1787749
>>>>>>>>   >   - https://access.redhat.com/solutions/4713471
>>>>>>>>   >
>>>>>>>>   > So you may be hitting a different issue (unless
>>>> there's a bug in
>>>>>>>  the
>>>>>>>>   > pcmk 1.1 backport of the fix).
>>>>>>>>   >
>>>>>>>>   > I may be a little bit out of my area of knowledge here,
>>>> but can you
>>>>>>>>   > share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or
>>>> Ken has some
>>>>>>>>   > insight.
>>>>>>>>   >
>>>>>>>>   > On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
>>>>>>>>   > <svinther at gmail.com> wrote:
>>>>>>>>   >>
>>>>>>>>   >>  Hi Hideo,
>>>>>>>>   >>
>>>>>>>>   >>  If the fix is not going to make it into the CentOS7
>>>> pacemaker
>>>>>>>  version,
>>>>>>>>   >>  I guess the stable approach to take advantage of it
>>>> is to build
>>>>>>>  the
>>>>>>>>   >>  cluster on another OS than CentOS7 ? A little late
>>>> for that in
>>>>>>>  this
>>>>>>>>   >>  case though :)
>>>>>>>>   >>
>>>>>>>>   >>  Regards
>>>>>>>>   >>  Steffen
>>>>>>>>   >>
>>>>>>>>   >>
>>>>>>>>   >>
>>>>>>>>   >>
>>>>>>>>   >>  On Thu, Jan 7, 2021 at 7:27 AM
>>>> <renayama19661014 at ybb.ne.jp>
>>>>>>>  wrote:
>>>>>>>>   >>  >
>>>>>>>>   >>  > Hi Steffen,
>>>>>>>>   >>  >
>>>>>>>>   >>  > The fix pointed out by Reid is affecting it.
>>>>>>>>   >>  >
>>>>>>>>   >>  > Since the fencing action requested by the DC
>>>> node exists
>>>>>>>  only in the
>>>>>>>>   > DC node, such an event occurs.
>>>>>>>>   >>  > You will need to take advantage of the modified
>>>> pacemaker to
>>>>>>>  resolve
>>>>>>>>   > the issue.
>>>>>>>>   >>  >
>>>>>>>>   >>  > Best Regards,
>>>>>>>>   >>  > Hideo Yamauchi.
>>>>>>>>   >>  >
>>>>>>>>   >>  >
>>>>>>>>   >>  >
>>>>>>>>   >>  > ----- Original Message -----
>>>>>>>>   >>  > > From: Reid Wahl <nwahl at redhat.com>
>>>>>>>>   >>  > > To: Cluster Labs - All topics related to
>>>> open-source
>>>>>>>  clustering
>>>>>>>>   > welcomed <users at clusterlabs.org>
>>>>>>>>   >>  > > Cc:
>>>>>>>>   >>  > > Date: 2021/1/7, Thu 15:07
>>>>>>>>   >>  > > Subject: Re: [ClusterLabs] Pending Fencing
>>>> Actions
>>>>>>>  shown in pcs
>>>>>>>>   > status
>>>>>>>>   >>  > >
>>>>>>>>   >>  > > Hi, Steffen. Are your cluster nodes all
>>>> running the
>>>>>>>  same
>>>>>>>>   > Pacemaker
>>>>>>>>   >>  > > versions? This looks like Bug 5401[1],
>>>> which is fixed
>>>>>>>  by upstream
>>>>>>>>   >>  > > commit df71a07[2]. I'm a little bit
>>>> confused about
>>>>>>>  why it
>>>>>>>>   > only shows
>>>>>>>>   >>  > > up on one out of three nodes though.
>>>>>>>>   >>  > >
>>>>>>>>   >>  > > [1]
>>>> https://bugs.clusterlabs.org/show_bug.cgi?id=5401
>>>>>>>>   >>  > > [2]
>>>>>>>  https://github.com/ClusterLabs/pacemaker/commit/df71a07
>>>>>>>>   >>  > >
>>>>>>>>   >>  > > On Tue, Jan 5, 2021 at 8:31 AM Steffen
>>>> Vinther Sørensen
>>>>>>>>   >>  > > <svinther at gmail.com> wrote:
>>>>>>>>   >>  > >>
>>>>>>>>   >>  > >>  Hello
>>>>>>>>   >>  > >>
>>>>>>>>   >>  > >>  node 1 is showing this in 'pcs
>>>> status'
>>>>>>>>   >>  > >>
>>>>>>>>   >>  > >>  Pending Fencing Actions:
>>>>>>>>   >>  > >>  * reboot of
>>>> kvm03-node02.avigol-gcs.dk pending:
>>>>>>>>   > client=crmd.37819,
>>>>>>>>   >>  > >>  origin=kvm03-node03.avigol-gcs.dk
>>>>>>>>   >>  > >>
>>>>>>>>   >>  > >>  node 2 and node 3 outputs no such
>>>> thing (node 3 is
>>>>>>>  DC)
>>>>>>>>   >>  > >>
>>>>>>>>   >>  > >>  Google is not much help, how to
>>>> investigate this
>>>>>>>  further and
>>>>>>>>   > get rid
>>>>>>>>   >>  > >>  of such terrifying status message ?
>>>>>>>>   >>  > >>
>>>>>>>>   >>  > >>  Regards
>>>>>>>>   >>  > >>  Steffen
>>>>>>>>   >>  > >>
>>>> _______________________________________________
>>>>>>>>   >>  > >>  Manage your subscription:
>>>>>>>>   >>  > >>
>>>>>>>  https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>   >>  > >>
>>>>>>>>   >>  > >>  ClusterLabs home:
>>>> https://www.clusterlabs.org/
>>>>>>>>   >>  > >>
>>>>>>>>   >>  > >
>>>>>>>>   >>  > >
>>>>>>>>   >>  > > --
>>>>>>>>   >>  > > Regards,
>>>>>>>>   >>  > >
>>>>>>>>   >>  > > Reid Wahl, RHCA
>>>>>>>>   >>  > > Senior Software Maintenance Engineer, Red
>>>> Hat
>>>>>>>>   >>  > > CEE - Platform Support Delivery -
>>>> ClusterHA
>>>>>>>>   >>  > >
>>>>>>>>   >>  > >
>>>> _______________________________________________
>>>>>>>>   >>  > > Manage your subscription:
>>>>>>>>   >>  > >
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>   >>  > >
>>>>>>>>   >>  > > ClusterLabs home:
>>>> https://www.clusterlabs.org/
>>>>>>>>   >>  > >
>>>>>>>>   >>  >
>>>>>>>>   >>  > _______________________________________________
>>>>>>>>   >>  > Manage your subscription:
>>>>>>>>   >>  >
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>   >>  >
>>>>>>>>   >>  > ClusterLabs home: https://www.clusterlabs.org/
>>>>>>>>   >>  _______________________________________________
>>>>>>>>   >>  Manage your subscription:
>>>>>>>>   >>  https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>   >>
>>>>>>>>   >>  ClusterLabs home: https://www.clusterlabs.org/
>>>>>>>>   >
>>>>>>>>   >
>>>>>>>>   >
>>>>>>>>   > --
>>>>>>>>   > Regards,
>>>>>>>>   >
>>>>>>>>   > Reid Wahl, RHCA
>>>>>>>>   > Senior Software Maintenance Engineer, Red Hat
>>>>>>>>   > CEE - Platform Support Delivery - ClusterHA
>>>>>>>>   >
>>>>>>>>   > _______________________________________________
>>>>>>>>   > Manage your subscription:
>>>>>>>>   > https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>   >
>>>>>>>>   > ClusterLabs home: https://www.clusterlabs.org/
>>>>>>>>   >
>>>>>>>>
>>>>>>>>   _______________________________________________
>>>>>>>>   Manage your subscription:
>>>>>>>>   https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>>   ClusterLabs home: https://www.clusterlabs.org/
>>>>>> _______________________________________________
>>>>>> Manage your subscription:
>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>>
>>>>> --
>>>>>
>>>>> Regards,
>>>>>
>>>>> Reid Wahl, RHCA
>>>>>
>>>>> Senior Software Maintenance Engineer, Red Hat
>>>>> CEE - Platform Support Delivery - ClusterHA
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/