[Pacemaker] [Problem]Time-out(action lost) of completed monitor occurs.

renayama19661014 at ybb.ne.jp renayama19661014 at ybb.ne.jp
Mon Sep 5 22:53:41 EDT 2011


Hi All, 

We came across a mysterious phenomenon on a test of the drbd environment.

It is the following procedure.

Step 1) Start two nodes.

Step 2) Cause the hang of the kernel in an active node.

Step 3) In a standby node, the cancellation of the monitor of drbd is carried
out.

The cancellation of the monitor of drbd is completed, but a timer occurs.

Because it was completed, the cancellation of the monitor of drbd should stop
the timer.


(snip)
Jul 25 23:06:16 de02 crmd: [18190]: info: te_rsc_command: Initiating action 23:
cancel prmDrPostgreSQLDB1:1_monitor_20000 on de02 (local) 
Jul 25 23:06:16 de02 lrmd: [18187]: info: cancel_op: operation monitor[53] on
ocf::drbd::prmDrPostgreSQLDB1:1 for client 18190, its parameters:
CRM_meta_interval=[20000] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[
] CRM_meta_notify_active_resource=[ ] CRM_meta_notify_slave_resource=[ ]
CRM_meta_notify_start_resource=[prmDrPostgreSQLDB1:0 prmDrPostgreSQLDB1:1
prmDrCRM_meta_notify_active_uname=[ ] CRM_meta_notify_promote_resource=[ ]
CRM_meta_notify_stop_uname=[ ] drbd_resource=[r0]
CRM_meta_notify_master_uname=[ ] CRM_meta_master_max=[1] CRM_meta_no cancelled
Jul 25 23:06:16 de02 crmd: [18190]: info: send_direct_ack: ACK'ing resource op
prmDrPostgreSQLDB1:1_monitor_20000 from
23:10:0:beda4e4c-4386-45f6-b984-26fb71ae0fdf: lrm_invoke-lrmd-1311602776-148
Jul 25 23:06:16 de02 crmd: [18190]: info: process_te_message: Processing (N)ACK
lrm_invoke-lrmd-1311602776-148 from de02
Jul 25 23:06:16 de02 crmd: [18190]: info: match_graph_event: Action
prmDrPostgreSQLDB1:1_monitor_20000 (23) confirmed on de02 (rc=0)
Jul 25 23:06:16 de02 crmd: [18190]: info: te_rsc_command: Initiating action 24:
cancel prmDrPostgreSQLDB2:1_monitor_20000 on de02 (local)
(snip)
Jul 25 23:06:16 de02 crmd: [18190]: info: match_graph_event: Action
prmDrPostgreSQLDB1:1_monitor_20000 (23) confirmed on de02 (rc=0)
Jul 25 23:06:16 de02 crmd: [18190]: info: te_rsc_command: Initiating action 24:
cancel prmDrPostgreSQLDB2:1_monitor_20000 on de02 (local)
Jul 25 23:06:16 de02 lrmd: [18187]: info: cancel_op: operation monitor[54] on
ocf::drbd::prmDrPostgreSQLDB2:1 for client 18190, its parameters:
CRM_meta_interval=[20000] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[
] CRM_meta_notify_active_resource=[ ]
CRM_meta_notify_slave_resource=[prmDrPostgreSQLDB1:0 prmDrPostgreSQLDB1:1 ]
CRM_meta_notify_start_resource=[prmDrPostgreSQLDB2:0 prmDrPostgreSQLDB2:1 ]
CRM_meta_notify_active_uname=[ ]
CRM_meta_notify_promote_resource=[prmDrPostgreSQLDB1:1 prmDrPostgreSQLDB2:1 ]
CRM_meta_notify_stop_uname=[ ] drbd_res cancelled
Jul 25 23:06:16 de02 crmd: [18190]: info: send_direct_ack: ACK'ing resource op
prmDrPostgreSQLDB2:1_monitor_20000 from
24:10:0:beda4e4c-4386-45f6-b984-26fb71ae0fdf: lrm_invoke-lrmd-1311602776-150
Jul 25 23:06:16 de02 crmd: [18190]: info: process_te_message: Processing (N)ACK
lrm_invoke-lrmd-1311602776-150 from de02
Jul 25 23:06:16 de02 crmd: [18190]: info: match_graph_event: Action
prmDrPostgreSQLDB2:1_monitor_20000 (24) confirmed on de02 (rc=0)
(snip)
Jul 25 23:07:36 de02 crmd: [18190]: WARN: action_timer_callback: Timer popped
(timeout=20000, abort_level=0, complete=false)
Jul 25 23:07:36 de02 crmd: [18190]: ERROR: print_elem: Aborting transition,
action lost: [Action 23]: Completed (id: prmDrPostgreSQLDB1:1_monitor_20000,
loc: de02, priority: 0) 
Jul 25 23:07:36 de02 crmd: [18190]: info: abort_transition_graph:
action_timer_callback:493 - Triggered transition abort (complete=0) : Action
lost
Jul 25 23:07:36 de02 crmd: [18190]: info: update_abort_priority: Abort priority
upgraded from 0 to 1000000
Jul 25 23:07:36 de02 crmd: [18190]: info: update_abort_priority: Abort action
done superceeded by restart
Jul 25 23:07:36 de02 crmd: [18190]: WARN: action_timer_callback: Timer popped
(timeout=20000, abort_level=1000000, complete=false)
Jul 25 23:07:36 de02 crmd: [18190]: ERROR: print_elem: Aborting transition,
action lost: [Action 24]: Completed (id: prmDrPostgreSQLDB2:1_monitor_20000,
loc: de02, priority: 0)
(snip)


crmd fails in stop_te_timer processing, and a timer seems to work.

Only this one time of phenomenon occurred.
It did not occur on several times of other tests.


I registered this problem with Bugzilla.
 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2639

Best Regards,
Hideo Yamauchi.




More information about the Pacemaker mailing list