[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Mon Apr 12 18:07:27 EDT 2021
Hi Klaus,
Hi Ken,
> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> I guess the simplest possible solution to the immediate issue so
> that we can discuss it.
Thank you for the fix.
I have confirmed that the fixes have been merged.
I'll test this fix today just in case.
Many thanks,
Hideo Yamauchi.
----- Original Message -----
> From: Klaus Wenninger <kwenning at redhat.com>
> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc:
> Date: 2021/4/12, Mon 22:22
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
>
> On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>> On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>> On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>>> On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>>> On 4/9/21 2:37 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>> Hi Klaus,
>>>>>>
>>>>>> Thanks for your comment.
>>>>>>
>>>>>>> Hmm ... is that with selinux enabled?
>>>>>>> Respectively do you see any related avc messages?
>>>>>>
>>>>>> Selinux is not enabled.
>>>>>> Isn't crm_mon caused by not returning a response when
> pacemakerd
>>>>>> prepares to stop?
>>>> yep ... that doesn't look good.
>>>> While in pcmk_shutdown_worker ipc isn't handled.
>>> Stop ... that should actually work as pcmk_shutdown_worker
>>> should exit quite quickly and proceed after mainloop
>>> dispatching when called again.
>>> Don't see anything atm that might be blocking for longer ...
>>> but let me dig into it further ...
>> What happens is clear (thanks Ken for the hint ;-) ).
>> When pacemakerd is shutting down - already when it
>> shuts down the resources and not just when it starts to
>> reap the subdaemons - crm_mon reads that state and
>> doesn't try to connect to the cib anymore.
> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> I guess the simplest possible solution to the immediate issue so
> that we can discuss it.
>>>> Question is why that didn't create issue earlier.
>>>> Probably I didn't test with resources that had crm_mon in
>>>> their stop/monitor-actions but sbd should have run into
>>>> issues.
>>>>
>>>> Klaus
>>>>> But when shutting down a node the resources should be
>>>>> shutdown before pacemakerd goes down.
>>>>> But let me have a look if it can happen that pacemakerd
>>>>> doesn't react to the ipc-pings before. That btw. might be
>>>>> lethal for sbd-scenarios (if the phase is too long and it
>>>>> migh actually not be defined).
>>>>>
>>>>> My idea with selinux would have been that it might block
>>>>> the ipc if crm_mon is issued by execd. But well forget
>>>>> about it as it is not enabled ;-)
>>>>>
>>>>>
>>>>> Klaus
>>>>>>
>>>>>> pgsql needs the result of crm_mon in demote processing and
> stop
>>>>>> processing.
>>>>>> crm_mon should return a response even after pacemakerd goes
> into a
>>>>>> stop operation.
>>>>>>
>>>>>> Best Regards,
>>>>>> Hideo Yamauchi.
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: Klaus Wenninger <kwenning at redhat.com>
>>>>>>> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All
> topics related
>>>>>>> to open-source clustering welcomed
> <users at clusterlabs.org>
>>>>>>> Cc:
>>>>>>> Date: 2021/4/9, Fri 21:12
>>>>>>> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta,
> pgsql
>>>>>>> resource control fails.
>>>>>>>
>>>>>>> On 4/8/21 11:21 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>>>> Hi Ken,
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> In the pgsql resource, crm_mon is executed in the
> process of
>>>>>>>> demote and
>>>>>>> stop, and the result is processed.
>>>>>>>> However, pacemaker included in RHEL8.4beta fails
> to execute
>>>>>>>> this crm_mon.
>>>>>>>> - The problem also occurs on github
>>>>>>> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>>>>>>>> The problem can be easily reproduced in the
> following ways.
>>>>>>>>
>>>>>>>> Step1. Modify to execute crm_mon in the stop
> process of the
>>>>>>>> Dummy resource.
>>>>>>>> ----
>>>>>>>>
>>>>>>>> dummy_stop() {
>>>>>>>> mon=$(crm_mon -1)
>>>>>>>> ret=$?
>>>>>>>> ocf_log info "### YAMAUCHI ####
> crm_mon[${ret}] : ${mon}"
>>>>>>>> dummy_monitor
>>>>>>>> if [ $? = $OCF_SUCCESS ]; then
>>>>>>>> rm ${OCF_RESKEY_state}
>>>>>>>> fi
>>>>>>>> return $OCF_SUCCESS
>>>>>>>> }
>>>>>>>> ----
>>>>>>>>
>>>>>>>> Step2. Configure a cluster with two nodes.
>>>>>>>> ----
>>>>>>>>
>>>>>>>> [root at rh84-beta01 ~]# crm_mon -rfA1
>>>>>>>> Cluster Summary:
>>>>>>>> * Stack: corosync
>>>>>>>> * Current DC: rh84-beta01 (version
> 2.0.5-8.el8-ba59be7122)
>>>>>>>> - partition
>>>>>>> with quorum
>>>>>>>> * Last updated: Thu Apr 8 18:00:52 2021
>>>>>>>> * Last change: Thu Apr 8 18:00:38 2021 by
> root via
>>>>>>>> cibadmin on
>>>>>>> rh84-beta01
>>>>>>>> * 2 nodes configured
>>>>>>>> * 1 resource instance configured
>>>>>>>>
>>>>>>>> Node List:
>>>>>>>> * Online: [ rh84-beta01 rh84-beta02 ]
>>>>>>>>
>>>>>>>> Full List of Resources:
>>>>>>>> * dummy-1 (ocf::heartbeat:Dummy): Started
> rh84-beta01
>>>>>>>>
>>>>>>>> Migration Summary:
>>>>>>>> ----
>>>>>>>>
>>>>>>>> Step3. Stop the node where the Dummy resource is
> running. The
>>>>>>>> resource will
>>>>>>> fail over.
>>>>>>>> ----
>>>>>>>> [root at rh84-beta02 ~]# crm_mon -rfA1
>>>>>>>> Cluster Summary:
>>>>>>>> * Stack: corosync
>>>>>>>> * Current DC: rh84-beta02 (version
> 2.0.5-8.el8-ba59be7122)
>>>>>>>> - partition
>>>>>>> with quorum
>>>>>>>> * Last updated: Thu Apr 8 18:08:56 2021
>>>>>>>> * Last change: Thu Apr 8 18:05:08 2021 by
> root via
>>>>>>>> cibadmin on
>>>>>>> rh84-beta01
>>>>>>>> * 2 nodes configured
>>>>>>>> * 1 resource instance configured
>>>>>>>>
>>>>>>>> Node List:
>>>>>>>> * Online: [ rh84-beta02 ]
>>>>>>>> * OFFLINE: [ rh84-beta01 ]
>>>>>>>>
>>>>>>>> Full List of Resources:
>>>>>>>> * dummy-1 (ocf::heartbeat:Dummy): Started
> rh84-beta02
>>>>>>>> ----
>>>>>>>>
>>>>>>>> However, if you look at the log, you can see that
> the
>>>>>>>> execution of crm_mon
>>>>>>> in the stop processing of the Dummy resource has
> failed.
>>>>>>>> ----
>>>>>>>> Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO:
> ### YAMAUCHI ####
>>>>>>> crm_mon[102] : Pacemaker daemons shutting down ...
>>>>>>>> Apr 08 18:05:17 rh84-beta01 pacemaker-execd
> [2219]
>>>>>>>> (log_op_output)
>>>>>>> notice: dummy-1_stop_0[2631] error output [ crm_mon:
> Error:
>>>>>>> cluster is not
>>>>>>> available on this node ]
>>>>>>> Hmm ... is that with selinux enabled?
>>>>>>> Respectively do you see any related avc messages?
>>>>>>>
>>>>>>> Klaus
>>>>>>>> ----
>>>>>>>>
>>>>>>>> Similarly, pgsql also executes crm_mon with
> demote or stop, so
>>>>>>>> control
>>>>>>> fails.
>>>>>>>> The problem seems to be related to the next fix.
>>>>>>>> * Report pacemakerd in state waiting for sbd
>>>>>>>> -
> https://github.com/ClusterLabs/pacemaker/pull/2278
>>>>>>>>
>>>>>>>> The problem does not occur with the release
> version of
>>>>>>>> Pacemaker 2.0.5 or
>>>>>>> the Pacemaker included with RHEL8.3.
>>>>>>>> This issue has a huge impact on the user.
>>>>>>>>
>>>>>>>> Perhaps it also affects the control of other
> resources that
>>>>>>>> utilize
>>>>>>> crm_mon.
>>>>>>>> Please improve the release version of RHEL8.4 so
> that it
>>>>>>>> includes Pacemaker
>>>>>>> which does not cause this problem.
>>>>>>>> * Distributions other than RHEL may also be
> affected in
>>>>>>>> future releases.
>>>>>>>>
>>>>>>>> ----
>>>>>>>> This content is the same as the following
> Bugzilla.
>>>>>>>> -
> https://bugs.clusterlabs.org/show_bug.cgi?id=5471
>>>>>>>> ----
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Hideo Yamauchi.
>>>>>>>>
>>>>>>>> _______________________________________________
>>
>
More information about the Users
mailing list