[ClusterLabs] 'pcs stonith update' takes, then reverts
kgaillot at redhat.com
kgaillot at redhat.com
Mon Jul 26 12:50:48 EDT 2021
On Mon, 2021-07-26 at 12:25 -0400, Digimer wrote:
> On 2021-07-26 9:54 a.m., kgaillot at redhat.com wrote:
> > On Fri, 2021-07-23 at 21:46 -0400, Digimer wrote:
> > > After a LOT of hassle, I finally got it updated, but OMG it was
> > > painful.
> > >
> > > I degraded the cluster (unsure if needed), set maintenance mode,
> > > deleted
> > > the stonith levels, deleted the stonith devices, recreated them
> > > with
> > > the
> > > updated values, recreated the stonith levels, and finally
> > > disabled
> > > maintenance mode.
> > >
> > > It should not have been this hard, right? Why is heck would it be
> > > that
> > > pacemaker kept "rolling back" to old configs? I'd delete the
> > > stonith
> >
> > That is bizarre. It sounds like the CIB changes were taking effect
> > locally, then being rejected by the rest of the cluster, which
> > would
> > send the "correct" CIB back to the originator.
> >
> > The logs of interest would be pacemaker.log from both nodes at the
> > time
> > you made the first configuration change that failed. I'm guessing
> > the
> > logs you posted were from after that point?
>
> Below are the logs. The change appears to first try at 'Jul 23
> 16:22:27', made on an-a02n01, included logs for a few minutes before
> in case relevant.
> * an-a02n01:
> https://www.alteeve.com/an-repo/files/an-a02n01.pacemaker.log
> * an-a02n02:
> https://www.alteeve.com/an-repo/files/an-a02n02.pacemaker.log
> Note that the PDUs as originally configured (10.201.2.1/2) were not
> available, so I had to disable and cleanup the stonith resources.
> They seemed to keep getting re-enabled, so I got to the habit of
> doing this cycle of disable -> cleanup -> disable -> cleanup before I
> could reliably get the resources to be 'stopped (disabled)' in 'pcs
> stonith status'.
> digimer
The initial change happened here:
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: --- 0.337.112 2
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: +++ 0.338.0 6a24af66df3d9f825cc2681222f8f5d6
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib: @epoch=338, @num_updates=0
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='apc_snmp_node1_an-pdu03']/instance_attributes[@id='apc_snmp_node1_an-pdu03-instance_attributes']/nvpair[@id='apc_snmp_node1_an-pdu03-instance_attributes-ip']: @value=10.201.2.3
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_replace_notify) info: Replaced: 0.337.112 -> 0.338.0 from an-a02n02
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_process_request) info: Completed cib_replace operation for section configuration: OK (rc=0, origin=an-a02n02/cibadmin/2, version=0.338.0)
origin=an-a02n02/cibadmin/2 means that someone or something ran the
cibadmin tool on an-02n02. Presumably this was your interactive pcs
command.
It was then reverted by:
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: --- 0.343.3 2
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: +++ 0.344.0 (null)
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib: @epoch=344, @num_updates=0
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ /cib/configuration/resources: <primitive class="stonith" id="apc_snmp_node1_an-pdu03" type="fence_apc_snmp"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <instance_attributes id="apc_snmp_node1_an-pdu03-instance_attributes">
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-ip" name="ip" value="10.201.2.1"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-a02n01"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-pcmk_off_action" name="pcmk_off_action" value="reboot"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-port" name="port" value="5"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </instance_attributes>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <operations>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <op id="apc_snmp_node1_an-pdu03-monitor-interval-60" interval="60" name="monitor"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </operations>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </primitive>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_process_request) info: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=an-a02n02/cibadmin/2, version=0.344.0)
Notice the origin is still cibadmin on an-a02n02. So this was either
you, or a script or cron on that node. I don't see any additional
details on that node.
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list