[Pacemaker] inconsistence in crm_mon and crm resource show

Wed Mar 21 06:47:19 EDT 2012

> 
> On 2012-03-21T09:42:26, "Janec, Jozef" <jozef.janec at hp.com> wrote:
> 
> > Node b300ple0: UNCLEAN (offline)
> >         rs_nw_dbjj7     (ocf::heartbeat:IPaddr) Started
> >         rs_nw_cijj7     (ocf::heartbeat:IPaddr) Started
> > Node b400ple0: online
> >         sbd_fense_SHARED2       (stonith:external/sbd) Started
> >
> > Inactive resources:
> >
> > rs_nw_cijj7    (ocf::heartbeat:IPaddr):        Started b300ple0
> > rs_nw_dbjj7    (ocf::heartbeat:IPaddr):        Started b300ple0
> >
> > b400ple0:(/root/home/root)(root)#crm resource show
> > rs_nw_cijj7    (ocf::heartbeat:IPaddr) Started
> > sbd_fense_SHARED2      (stonith:external/sbd) Started
> > rs_nw_dbjj7    (ocf::heartbeat:IPaddr) Started
> > b400ple0:(/root/home/root)(root)#
> >
> > b400ple0:(/root/home/root)(root)#/usr/sbin/crm_resource -W -r
> > rs_nw_cijj7 resource rs_nw_cijj7 is running on: b300ple0
> > b400ple0:(/root/home/root)(root)#
> >
> > but b300ple0 is down
> 
> Resources are still considered owned because the node wasn't fenced yet.
> 

[Jozef Janec] 
Yes I can see in logs:

Mar 21 06:18:00 b400ple0 stonith-ng: [8603]: ERROR: log_operation: Operation 'reboot' [3159] for host 'b300ple0' with device 'sbd_fense_SHARED2' returned: 1 (call 0 from (null))
Mar 21 06:18:00 b400ple0 stonith-ng: [8603]: info: process_remote_stonith_execExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="5cb46419-bfdb-4115-85d9-6ec447b38823" st_callid="0" st_callopt="0" st_rc="1" st_output="Performing: stonith -t external/sbd -T reset b300ple0 failed: b300ple0 0.05859375" src="b400ple0" seq="172" />
Mar 21 06:18:06 b400ple0 stonith-ng: [8603]: ERROR: remote_op_timeout: Action reboot (5cb46419-bfdb-4115-85d9-6ec447b38823) for b300ple0 timed out
Mar 21 06:18:06 b400ple0 stonith-ng: [8603]: info: remote_op_done: Notifing clients of 5cb46419-bfdb-4115-85d9-6ec447b38823 (reboot of b300ple0 from a8125881-30df-4bd4-a5b1-666020a29eba by (null)): 1, rc=-7
Mar 21 06:18:06 b400ple0 crmd: [8608]: info: tengine_stonith_callbackStonithOp <remote-op state="1" st_target="b300ple0" st_op="reboot" />
Mar 21 06:18:06 b400ple0 stonith-ng: [8603]: info: stonith_notify_client: Sending st_fence-notification to client 8608/bc1b0c7d-2cec-4e96-9523-5f6c51b52508
Mar 21 06:18:06 b400ple0 crmd: [8608]: info: tengine_stonith_callback: Stonith operation 44/15:49:0:44f2b175-7292-473a-a4e8-f9abda5b3ef6: Operation timed out (-7)
Mar 21 06:18:06 b400ple0 crmd: [8608]: ERROR: tengine_stonith_callback: Stonith of b300ple0 failed (-7)... aborting transition.
Mar 21 06:18:06 b400ple0 crmd: [8608]: info: abort_transition_graph: tengine_stonith_callback:401 - Triggered transition abort (complete=0) : Stonith failed

Because I reboted the ndoe manualy to simulate outage, and I haven't started the rcopenais the sbd daemon isn't started yet too

b400ple0:(/var/log/ha)(root)#/usr/sbin/sbd -d /dev/mapper/SHARED1_part1 list
0       b400ple0        clear
1       b300ple0        reset   b400ple0
b400ple0:(/var/log/ha)(root)#/usr/sbin/sbd -d /dev/mapper/SHARED2_part1  list
0       b300ple0        reset   b400ple0
1       b400ple0        clear

It is waiting till the sbd will pick up the command and reset this.

Question is where is located the information that the resource is still up it is in lrm part? I have found that I can use crm node clearstate which should set offline state on node and probably release the resources, but I want to find where exactly it is hidden. All information are located or should be located in cib, and I would like to know exactly which one is responsible for this behavior to understand it better

Best regards

Jozef