[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

Fri Apr 9 09:45:36 EDT 2021

On 4/9/21 3:36 PM, Klaus Wenninger wrote:
> On 4/9/21 2:37 PM, renayama19661014 at ybb.ne.jp wrote:
>> Hi Klaus,
>>
>> Thanks for your comment.
>>
>>> Hmm ... is that with selinux enabled?
>>> Respectively do you see any related avc messages?
>>
>> Selinux is not enabled.
>> Isn't crm_mon caused by not returning a response when pacemakerd 
>> prepares to stop?
yep ... that doesn't look good.
While in pcmk_shutdown_worker ipc isn't handled.
Question is why that didn't create issue earlier.
Probably I didn't test with resources that had crm_mon in
their stop/monitor-actions but sbd should have run into
issues.

Klaus
> But when shutting down a node the resources should be
> shutdown before pacemakerd goes down.
> But let me have a look if it can happen that pacemakerd
> doesn't react to the ipc-pings before. That btw. might be
> lethal for sbd-scenarios (if the phase is too long and it
> migh actually not be defined).
>
> My idea with selinux would have been that it might block
> the ipc if crm_mon is issued by execd. But well forget
> about it as it is not enabled ;-)
>
>
> Klaus
>>
>> pgsql needs the result of crm_mon in demote processing and stop 
>> processing.
>> crm_mon should return a response even after pacemakerd goes into a 
>> stop operation.
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>> ----- Original Message -----
>>> From: Klaus Wenninger <kwenning at redhat.com>
>>> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to 
>>> open-source clustering welcomed <users at clusterlabs.org>
>>> Cc:
>>> Date: 2021/4/9, Fri 21:12
>>> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
>>> control fails.
>>>
>>> On 4/8/21 11:21 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>   Hi Ken,
>>>>   Hi All,
>>>>
>>>>   In the pgsql resource, crm_mon is executed in the process of 
>>>> demote and
>>> stop, and the result is processed.
>>>>   However, pacemaker included in RHEL8.4beta fails to execute this 
>>>> crm_mon.
>>>>     - The problem also occurs on github
>>> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>>>>   The problem can be easily reproduced in the following ways.
>>>>
>>>>   Step1. Modify to execute crm_mon in the stop process of the Dummy 
>>>> resource.
>>>>   ----
>>>>
>>>>   dummy_stop() {
>>>>        mon=$(crm_mon -1)
>>>>        ret=$?
>>>>        ocf_log info "### YAMAUCHI #### crm_mon[${ret}] : ${mon}"
>>>>        dummy_monitor
>>>>        if [ $? =  $OCF_SUCCESS ]; then
>>>>            rm ${OCF_RESKEY_state}
>>>>        fi
>>>>        return $OCF_SUCCESS
>>>>   }
>>>>   ----
>>>>
>>>>   Step2. Configure a cluster with two nodes.
>>>>   ----
>>>>
>>>>   [root at rh84-beta01 ~]# crm_mon -rfA1
>>>>   Cluster Summary:
>>>>      * Stack: corosync
>>>>      * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - 
>>>> partition
>>> with quorum
>>>>      * Last updated: Thu Apr  8 18:00:52 2021
>>>>      * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on
>>> rh84-beta01
>>>>      * 2 nodes configured
>>>>      * 1 resource instance configured
>>>>
>>>>   Node List:
>>>>      * Online: [ rh84-beta01 rh84-beta02 ]
>>>>
>>>>   Full List of Resources:
>>>>      * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01
>>>>
>>>>   Migration Summary:
>>>>   ----
>>>>
>>>>   Step3. Stop the node where the Dummy resource is running. The 
>>>> resource will
>>> fail over.
>>>>   ----
>>>>   [root at rh84-beta02 ~]# crm_mon -rfA1
>>>>   Cluster Summary:
>>>>      * Stack: corosync
>>>>      * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - 
>>>> partition
>>> with quorum
>>>>      * Last updated: Thu Apr  8 18:08:56 2021
>>>>      * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on
>>> rh84-beta01
>>>>      * 2 nodes configured
>>>>      * 1 resource instance configured
>>>>
>>>>   Node List:
>>>>      * Online: [ rh84-beta02 ]
>>>>      * OFFLINE: [ rh84-beta01 ]
>>>>
>>>>   Full List of Resources:
>>>>      * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
>>>>   ----
>>>>
>>>>   However, if you look at the log, you can see that the execution 
>>>> of crm_mon
>>> in the stop processing of the Dummy resource has failed.
>>>>   ----
>>>>   Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI ####
>>> crm_mon[102] : Pacemaker daemons shutting down ...
>>>>   Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
>>>> (log_op_output)
>>> notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster 
>>> is not
>>> available on this node ]
>>> Hmm ... is that with selinux enabled?
>>> Respectively do you see any related avc messages?
>>>
>>> Klaus
>>>>   ----
>>>>
>>>>   Similarly, pgsql also executes crm_mon with demote or stop, so 
>>>> control
>>> fails.
>>>>   The problem seems to be related to the next fix.
>>>>     * Report pacemakerd in state waiting for sbd
>>>>      - https://github.com/ClusterLabs/pacemaker/pull/2278
>>>>
>>>>   The problem does not occur with the release version of Pacemaker 
>>>> 2.0.5 or
>>> the Pacemaker included with RHEL8.3.
>>>>   This issue has a huge impact on the user.
>>>>
>>>>   Perhaps it also affects the control of other resources that utilize
>>> crm_mon.
>>>>   Please improve the release version of RHEL8.4 so that it includes 
>>>> Pacemaker
>>> which does not cause this problem.
>>>>     * Distributions other than RHEL may also be affected in future 
>>>> releases.
>>>>
>>>>   ----
>>>>   This content is the same as the following Bugzilla.
>>>>     - https://bugs.clusterlabs.org/show_bug.cgi?id=5471
>>>>   ----
>>>>
>>>>   Best Regards,
>>>>   Hideo Yamauchi.
>>>>
>>>>   _______________________________________________
>>>>   Manage your subscription:
>>>>   https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>>   ClusterLabs home: https://www.clusterlabs.org/
>
>