[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

Fri Apr 23 18:00:32 EDT 2021

Hi Ken,
Hi Klaus,

Thanks for your comment.

>We did not have time to get it into the RHEL 8.4 GA (general
>availability) release, which means for example it will not be in 8.4
>install images, but we did get a 0-day fix, which means that it will be
>available via "yum update" the same day that 8.4 is released.
>
>Thanks for testing the 8.4 build and finding the issue!

Okay!

Best Regards,
Hideo Yamauchi.

----- Original Message -----
>From: Ken Gaillot <kgaillot at redhat.com>
>To: renayama19661014 at ybb.ne.jp 
>Cc: kwenning <kwenning at redhat.com>
>Date: 2021/4/24, Sat 01:25
>Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
> 
>Hi Hideo,
>
>A private reply to follow up:
>
>The fix will be in the 2.1.0 upstream release.
>
>We did not have time to get it into the RHEL 8.4 GA (general
>availability) release, which means for example it will not be in 8.4
>install images, but we did get a 0-day fix, which means that it will be
>available via "yum update" the same day that 8.4 is released.
>
>Thanks for testing the 8.4 build and finding the issue!
>
>On Thu, 2021-04-15 at 11:45 +0900, renayama19661014 at ybb.ne.jp wrote:
>> Hi Klaus,
>> Hi Ken,
>> 
>> We have confirmed that the operation is improved by the test.
>> Thank you for your prompt response.
>> 
>> We look forward to including this fix in the release version of RHEL
>> 8.4.
>> 
>> Best Regards,
>> Hideo Yamauchi.
>> 
>> 
>> 
>> ----- Original Message -----
>> > From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
>> > To: "kwenning at redhat.com" <kwenning at redhat.com>; Cluster Labs - All
>> > topics related to open-source clustering welcomed <
>> > users at clusterlabs.org>; Cluster Labs - All topics related to open-
>> > source clustering welcomed <users at clusterlabs.org>
>> > Cc: 
>> > Date: 2021/4/13, Tue 07:08
>> > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource
>> > control fails.
>> > 
>> > Hi Klaus,
>> > Hi Ken,
>> > 
>> > >  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342
>> > > with
>> > >  I guess the simplest possible solution to the immediate issue so
>> > >  that we can discuss it.
>> > 
>> > 
>> > Thank you for the fix.
>> > 
>> > 
>> > I have confirmed that the fixes have been merged.
>> > 
>> > I'll test this fix today just in case.
>> > 
>> > Many thanks,
>> > Hideo Yamauchi.
>> > 
>> > 
>> > ----- Original Message -----
>> > >  From: Klaus Wenninger <kwenning at redhat.com>
>> > >  To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics
>> > > related to 
>> > 
>> > open-source clustering welcomed <users at clusterlabs.org>
>> > >  Cc: 
>> > >  Date: 2021/4/12, Mon 22:22
>> > >  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql
>> > > resource control 
>> > 
>> > fails.
>> > > 
>> > >  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>> > > >   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>> > > > >   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>> > > > > >   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>> > > > > > >   On 4/9/21 2:37 PM, renayama19661014 at ybb.ne.jp wrote:
>> > > > > > > >   Hi Klaus,
>> > > > > > > > 
>> > > > > > > >   Thanks for your comment.
>> > > > > > > > 
>> > > > > > > > >   Hmm ... is that with selinux enabled?
>> > > > > > > > >   Respectively do you see any related avc messages?
>> > > > > > > > 
>> > > > > > > >   Selinux is not enabled.
>> > > > > > > >   Isn't crm_mon caused by not returning a response 
>> > 
>> > when 
>> > >  pacemakerd 
>> > > > > > > >   prepares to stop?
>> > > > > > 
>> > > > > >   yep ... that doesn't look good.
>> > > > > >   While in pcmk_shutdown_worker ipc isn't handled.
>> > > > > 
>> > > > >   Stop ... that should actually work as pcmk_shutdown_worker
>> > > > >   should exit quite quickly and proceed after mainloop
>> > > > >   dispatching when called again.
>> > > > >   Don't see anything atm that might be blocking for longer
>> > > > > ...
>> > > > >   but let me dig into it further ...
>> > > > 
>> > > >   What happens is clear (thanks Ken for the hint ;-) ).
>> > > >   When pacemakerd is shutting down - already when it
>> > > >   shuts down the resources and not just when it starts to
>> > > >   reap the subdaemons - crm_mon reads that state and
>> > > >   doesn't try to connect to the cib anymore.
>> > > 
>> > >  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342
>> > > with
>> > >  I guess the simplest possible solution to the immediate issue so
>> > >  that we can discuss it.
>> > > > > >   Question is why that didn't create issue earlier.
>> > > > > >   Probably I didn't test with resources that had crm_mon in
>> > > > > >   their stop/monitor-actions but sbd should have run into
>> > > > > >   issues.
>> > > > > > 
>> > > > > >   Klaus
>> > > > > > >   But when shutting down a node the resources should be
>> > > > > > >   shutdown before pacemakerd goes down.
>> > > > > > >   But let me have a look if it can happen that pacemakerd
>> > > > > > >   doesn't react to the ipc-pings before. That btw. might 
>> > 
>> > be
>> > > > > > >   lethal for sbd-scenarios (if the phase is too long and
>> > > > > > > it
>> > > > > > >   migh actually not be defined).
>> > > > > > > 
>> > > > > > >   My idea with selinux would have been that it might
>> > > > > > > block
>> > > > > > >   the ipc if crm_mon is issued by execd. But well forget
>> > > > > > >   about it as it is not enabled ;-)
>> > > > > > > 
>> > > > > > > 
>> > > > > > >   Klaus
>> > > > > > > > 
>> > > > > > > >   pgsql needs the result of crm_mon in demote
>> > > > > > > > processing 
>> > 
>> > and 
>> > >  stop 
>> > > > > > > >   processing.
>> > > > > > > >   crm_mon should return a response even after
>> > > > > > > > pacemakerd 
>> > 
>> > goes 
>> > >  into a 
>> > > > > > > >   stop operation.
>> > > > > > > > 
>> > > > > > > >   Best Regards,
>> > > > > > > >   Hideo Yamauchi.
>> > > > > > > > 
>> > > > > > > > 
>> > > > > > > >   ----- Original Message -----
>> > > > > > > > >   From: Klaus Wenninger <kwenning at redhat.com>
>> > > > > > > > >   To: renayama19661014 at ybb.ne.jp; Cluster Labs - All 
>> > >  topics related 
>> > > > > > > > >   to open-source clustering welcomed 
>> > > 
>> > >  <users at clusterlabs.org>
>> > > > > > > > >   Cc:
>> > > > > > > > >   Date: 2021/4/9, Fri 21:12
>> > > > > > > > >   Subject: Re: [ClusterLabs] [Problem] In 
>> > 
>> > RHEL8.4beta, 
>> > >  pgsql 
>> > > > > > > > >   resource control fails.
>> > > > > > > > > 
>> > > > > > > > >   On 4/8/21 11:21 PM, renayama19661014 at ybb.ne.jp 
>> > 
>> > wrote:
>> > > > > > > > > >     Hi Ken,
>> > > > > > > > > >     Hi All,
>> > > > > > > > > > 
>> > > > > > > > > >     In the pgsql resource, crm_mon is executed 
>> > 
>> > in the 
>> > >  process of 
>> > > > > > > > > >   demote and
>> > > > > > > > > 
>> > > > > > > > >   stop, and the result is processed.
>> > > > > > > > > >     However, pacemaker included in RHEL8.4beta 
>> > 
>> > fails 
>> > >  to execute 
>> > > > > > > > > >   this crm_mon.
>> > > > > > > > > >       - The problem also occurs on github
>> > > > > > > > > 
>> > > > > > > > >   master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>> > > > > > > > > >     The problem can be easily reproduced in the 
>> > > 
>> > >  following ways.
>> > > > > > > > > > 
>> > > > > > > > > >     Step1. Modify to execute crm_mon in the stop 
>> > >  process of the 
>> > > > > > > > > >   Dummy resource.
>> > > > > > > > > >     ----
>> > > > > > > > > > 
>> > > > > > > > > >     dummy_stop() {
>> > > > > > > > > >          mon=$(crm_mon -1)
>> > > > > > > > > >          ret=$?
>> > > > > > > > > >          ocf_log info "### YAMAUCHI #### 
>> > > 
>> > >  crm_mon[${ret}] : ${mon}"
>> > > > > > > > > >          dummy_monitor
>> > > > > > > > > >          if [ $? =  $OCF_SUCCESS ]; then
>> > > > > > > > > >              rm ${OCF_RESKEY_state}
>> > > > > > > > > >          fi
>> > > > > > > > > >          return $OCF_SUCCESS
>> > > > > > > > > >     }
>> > > > > > > > > >     ----
>> > > > > > > > > > 
>> > > > > > > > > >     Step2. Configure a cluster with two nodes.
>> > > > > > > > > >     ----
>> > > > > > > > > > 
>> > > > > > > > > >     [root at rh84-beta01 ~]# crm_mon -rfA1
>> > > > > > > > > >     Cluster Summary:
>> > > > > > > > > >        * Stack: corosync
>> > > > > > > > > >        * Current DC: rh84-beta01 (version 
>> > > 
>> > >  2.0.5-8.el8-ba59be7122) 
>> > > > > > > > > >   - partition
>> > > > > > > > > 
>> > > > > > > > >   with quorum
>> > > > > > > > > >        * Last updated: Thu Apr  8 18:00:52 2021
>> > > > > > > > > >        * Last change:  Thu Apr  8 18:00:38 2021 
>> > 
>> > by 
>> > >  root via 
>> > > > > > > > > >   cibadmin on
>> > > > > > > > > 
>> > > > > > > > >   rh84-beta01
>> > > > > > > > > >        * 2 nodes configured
>> > > > > > > > > >        * 1 resource instance configured
>> > > > > > > > > > 
>> > > > > > > > > >     Node List:
>> > > > > > > > > >        * Online: [ rh84-beta01 rh84-beta02 ]
>> > > > > > > > > > 
>> > > > > > > > > >     Full List of Resources:
>> > > > > > > > > >        * dummy-1     (ocf::heartbeat:Dummy):  
>> > 
>> > Started 
>> > >  rh84-beta01
>> > > > > > > > > > 
>> > > > > > > > > >     Migration Summary:
>> > > > > > > > > >     ----
>> > > > > > > > > > 
>> > > > > > > > > >     Step3. Stop the node where the Dummy 
>> > 
>> > resource is 
>> > >  running. The 
>> > > > > > > > > >   resource will
>> > > > > > > > > 
>> > > > > > > > >   fail over.
>> > > > > > > > > >     ----
>> > > > > > > > > >     [root at rh84-beta02 ~]# crm_mon -rfA1
>> > > > > > > > > >     Cluster Summary:
>> > > > > > > > > >        * Stack: corosync
>> > > > > > > > > >        * Current DC: rh84-beta02 (version 
>> > > 
>> > >  2.0.5-8.el8-ba59be7122) 
>> > > > > > > > > >   - partition
>> > > > > > > > > 
>> > > > > > > > >   with quorum
>> > > > > > > > > >        * Last updated: Thu Apr  8 18:08:56 2021
>> > > > > > > > > >        * Last change:  Thu Apr  8 18:05:08 2021 
>> > 
>> > by 
>> > >  root via 
>> > > > > > > > > >   cibadmin on
>> > > > > > > > > 
>> > > > > > > > >   rh84-beta01
>> > > > > > > > > >        * 2 nodes configured
>> > > > > > > > > >        * 1 resource instance configured
>> > > > > > > > > > 
>> > > > > > > > > >     Node List:
>> > > > > > > > > >        * Online: [ rh84-beta02 ]
>> > > > > > > > > >        * OFFLINE: [ rh84-beta01 ]
>> > > > > > > > > > 
>> > > > > > > > > >     Full List of Resources:
>> > > > > > > > > >        * dummy-1     (ocf::heartbeat:Dummy):  
>> > 
>> > Started 
>> > >  rh84-beta02
>> > > > > > > > > >     ----
>> > > > > > > > > > 
>> > > > > > > > > >     However, if you look at the log, you can see 
>> > 
>> > that 
>> > >  the 
>> > > > > > > > > >   execution of crm_mon
>> > > > > > > > > 
>> > > > > > > > >   in the stop processing of the Dummy resource has 
>> > > 
>> > >  failed.
>> > > > > > > > > >     ----
>> > > > > > > > > >     Apr 08 18:05:17  Dummy(dummy-1)[2631]:    
>> > 
>> > INFO: 
>> > >  ### YAMAUCHI ####
>> > > > > > > > >   crm_mon[102] : Pacemaker daemons shutting down ...
>> > > > > > > > > >     Apr 08 18:05:17 rh84-beta01 pacemaker-execd  
>> > 
>> >  
>> > >   [2219] 
>> > > > > > > > > >   (log_op_output)
>> > > > > > > > > 
>> > > > > > > > >   notice: dummy-1_stop_0[2631] error output [ 
>> > 
>> > crm_mon: 
>> > >  Error: 
>> > > > > > > > >   cluster is not
>> > > > > > > > >   available on this node ]
>> > > > > > > > >   Hmm ... is that with selinux enabled?
>> > > > > > > > >   Respectively do you see any related avc messages?
>> > > > > > > > > 
>> > > > > > > > >   Klaus
>> > > > > > > > > >     ----
>> > > > > > > > > > 
>> > > > > > > > > >     Similarly, pgsql also executes crm_mon with 
>> > > 
>> > >  demote or stop, so 
>> > > > > > > > > >   control
>> > > > > > > > > 
>> > > > > > > > >   fails.
>> > > > > > > > > >     The problem seems to be related to the next 
>> > 
>> > fix.
>> > > > > > > > > >       * Report pacemakerd in state waiting for 
>> > 
>> > sbd
>> > > > > > > > > >        - 
>> > > 
>> > >  https://github.com/ClusterLabs/pacemaker/pull/2278
>> > > > > > > > > > 
>> > > > > > > > > >     The problem does not occur with the release 
>> > > 
>> > >  version of 
>> > > > > > > > > >   Pacemaker 2.0.5 or
>> > > > > > > > > 
>> > > > > > > > >   the Pacemaker included with RHEL8.3.
>> > > > > > > > > >     This issue has a huge impact on the user.
>> > > > > > > > > > 
>> > > > > > > > > >     Perhaps it also affects the control of other 
>> > >  resources that 
>> > > > > > > > > >   utilize
>> > > > > > > > > 
>> > > > > > > > >   crm_mon.
>> > > > > > > > > >     Please improve the release version of 
>> > 
>> > RHEL8.4 so 
>> > >  that it 
>> > > > > > > > > >   includes Pacemaker
>> > > > > > > > > 
>> > > > > > > > >   which does not cause this problem.
>> > > > > > > > > >       * Distributions other than RHEL may also 
>> > 
>> > be 
>> > >  affected in 
>> > > > > > > > > >   future releases.
>> > > > > > > > > > 
>> > > > > > > > > >     ----
>> > > > > > > > > >     This content is the same as the following 
>> > > 
>> > >  Bugzilla.
>> > > > > > > > > >       - 
>> > > 
>> > >  https://bugs.clusterlabs.org/show_bug.cgi?id=5471
>> > > > > > > > > >     ----
>> > > > > > > > > > 
>> > > > > > > > > >     Best Regards,
>> > > > > > > > > >     Hideo Yamauchi.
>> > > > > > > > > > 
>> > > > > > > > > >    
>> > 
>> > _______________________________________________
>> > > > 
>> > 
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users
>> > 
>> > ClusterLabs home: https://www.clusterlabs.org/
>> > 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> ClusterLabs home: https://www.clusterlabs.org/
>-- 
>Ken Gaillot <kgaillot at redhat.com>
>
>
>
>