[ClusterLabs] cleanup of a resource leads to restart of Virtual Domains
Lentes, Bernd
bernd.lentes at helmholtz-muenchen.de
Fri Sep 27 12:12:37 EDT 2019
----- On Sep 26, 2019, at 5:19 PM, Yan Gao YGao at suse.com wrote:
> Hi,
>
> On 9/26/19 3:25 PM, Lentes, Bernd wrote:
>> HI,
>>
>> i had two errors with a GSF2 Partition several days ago:
>> gfs2_share_monitor_30000 on ha-idg-2 'unknown error' (1): call=103, status=Timed
>> Out, exitreason='',
>> last-rc-change='Thu Sep 19 13:44:22 2019', queued=0ms, exec=0ms
>>
>> gfs2_share_monitor_30000 on ha-idg-1 'unknown error' (1): call=103, status=Timed
>> Out, exitreason='',
>> last-rc-change='Thu Sep 19 13:44:12 2019', queued=0ms, exec=0ms
>>
>> Now i wanted to get rid of these messages and did a "resource cleanup".
>> I had to do this several times until both dissapeared.
>>
>> But then all VirtualDomain resources restarted.
>>
>> The config for the GSF2 is:
>> primitive gfs2_share Filesystem \
>> params device="/dev/vg_san/lv_share" directory="/mnt/share" fstype=gfs2
>> options=acl \
>> op monitor interval=30 timeout=20 \
>> op start timeout=60 interval=0 \
>> op stop timeout=60 interval=0 \
>> meta is-managed=true
>>
>> /mnt/share keeps the config files for VirtualDomains.
>>
>> Here one VirtualDomain config (the others are the same):
>> primitive vm_crispor VirtualDomain \
>> params config="/mnt/share/crispor.xml" \
>> params hypervisor="qemu:///system" \
>> params migration_transport=ssh \
>> params migrate_options="--p2p --tunnelled" \
>> op start interval=0 timeout=120 \
>> op stop interval=0 timeout=180 \
>> op monitor interval=30 timeout=25 \
>> op migrate_from interval=0 timeout=300 \
>> op migrate_to interval=0 timeout=300 \
>> meta allow-migrate=true target-role=Started is-managed=true maintenance=false \
>> utilization cpu=2 hv_memory=8192
>>
>> The GFS2 Share is a group and the group is cloned:
>> group gr_share dlm clvmd gfs2_share gfs2_snap fs_ocfs2
>> clone cl_share gr_share \
>> meta target-role=Started interleave=true
>>
>> And for each VirtualDomain i have an order:
>> order or_vm_crispor_after_gfs2 Mandatory: cl_share vm_crispor symmetrical=true
>>
>> Why are the domains restarted ? I thought a cleanup would just delete the error
>> message.
> It could be potentially fixed by this:
> https://github.com/ClusterLabs/pacemaker/pull/1765
>
> Regards,
> Yan
Hi Yan,
thanks for that information. I saw that this patch is included in the SuSE updates for Pacemaker for SLES 12 SP4.
I will install it on soon and let you know.
I had a look in the logs and what happened when i issued a "resource cleanup" of the GFS2 resource is
that the cluster deleted an entry in the status section:
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: Diff: --- 2.9157.0 2
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: Diff: +++ 2.9157.1 (null)
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: -- /cib/status/node_state[@id='1084777482']/lrm[@id='1084777482']/lrm_resources/lrm_resource[@id='dlm']
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: + /cib: @num_updates=1
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_process_request: Completed cib_delete operation for section //node_state[@uname='ha-idg-1']//lrm_resource[@id='dlm']: OK (rc=0, origin=ha-idg-1/crmd/113, version=2.9157.0)
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: Diff: --- 2.9157.0 2
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: Diff: +++ 2.9157.1 (null)
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: -- /cib/status/node_state[@id='1084777482']/lrm[@id='1084777482']/lrm_resources/lrm_resource[@id='dlm']
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: + /cib: @num_updates=1
Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_process_request: Completed cib_delete operation for section //node_state[@uname='ha-idg-1']//lrm_resource[@id='dlm']: OK (rc=0, origin=ha-idg-1/crmd/114, version=2.9157.1)
Sep 26 14:52:52 [9322] ha-idg-2 crmd: info: abort_transition_graph: Transition 1028 aborted by deletion of lrm_resource[@id='dlm']: Resource state removal | cib=2.9157.1 source=abort_unless_down:344 path=/cib/status/node_stat
e[@id='1084777482']/lrm[@id='1084777482']/lrm_resources/lrm_resource[@id='dlm'] complete=true
Sep 26 14:52:52 [9322] ha-idg-2 crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
and soon later on it recognized dlm on ha-idg-1 as stopped (or stops it):
Sep 26 14:52:54 [9321] ha-idg-2 pengine: warning: unpack_rsc_op_failure: Processing failed monitor of gfs2_share:1 on ha-idg-2: unknown error | rc=1
Sep 26 14:52:54 [9321] ha-idg-2 pengine: warning: unpack_rsc_op_failure: Processing failed monitor of vm_severin on ha-idg-2: not running | rc=7
Sep 26 14:52:54 [9321] ha-idg-2 pengine: warning: unpack_rsc_op_failure: Processing failed monitor of vm_geneious on ha-idg-2: not running | rc=7
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: unpack_node_loop: Node 1084777482 is already processed
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: unpack_node_loop: Node 1084777492 is already processed
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: unpack_node_loop: Node 1084777482 is already processed
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: unpack_node_loop: Node 1084777492 is already processed
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: fence_ilo_ha-idg-2 (stonith:fence_ilo2): Started ha-idg-1
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: fence_ilo_ha-idg-1 (stonith:fence_ilo4): Started ha-idg-2
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: clone_print: Clone Set: cl_share [gr_share]
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: group_print: Resource Group: gr_share:0
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: dlm (ocf::pacemaker:controld): Stopped <===============================
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: clvmd (ocf::heartbeat:clvm): Started ha-idg-1
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: gfs2_share (ocf::heartbeat:Filesystem): Started ha-idg-1
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: gfs2_snap (ocf::heartbeat:Filesystem): Started ha-idg-1
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: fs_ocfs2 (ocf::heartbeat:Filesystem): Started ha-idg-1
Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: short_print: Started: [ ha-idg-2 ]
Following the logs dlm is running before. Does the deletion of that entry leads to the stop of the dlm resource ?
Is that expected behaviour ?
I simulated the deletion of that entry with crm_simulate and it happened again the same procedure.
Bernd
Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671
More information about the Users
mailing list