[ClusterLabs Developers] commit abcdaa8 breaks compatibility with older pacemaker

Mon May 23 12:23:25 EDT 2016

Hello Andrew,

We have a system in the field running a MASTER-SLAVE resource on two nodes. 
We are trying to upgrade the pacemaker on these two nodes. First we upgrade 
the SLAVE node. Then we move the resource to be MASTER on the upgraded SLAVE 
node (“crm node standby” on the old MASTER). This move involves cancelling a 
monitor operation on the SLAVE node.

With commit
https://github.com/ClusterLabs/pacemaker/commit/abcdaa8893d6071574986af6abc85ae558473735
there is a change of how the “cancel” action is confirmed.

Previously, send_direct_ack was always used to confirm the cancel action. 
But now, the cancel action is being confirmed not by direct ACK but by 
parsing the XML.

So the new node receives the cancel action, but doesn’t call 
send_direct_ack. As a result on the old node, it sends the cancel action:
May 23 18:05:49 vsa-000001be-vc-0 crmd: [3089]: info: te_rsc_command: 
Initiating action 4: cancel VAM:1_monitor_5000 on vsa-000001be-vc-1

And after 3 minutes only it moves forward due to timeout
May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: WARN: action_timer_callback: 
Timer popped (timeout=120000, abort_level=1000000, complete=false)
May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: ERROR: print_elem: Aborting 
transition, action lost: [Action 4]: In-flight (id: VAM:1_monitor_5000, loc: 
vsa-000001be-vc-1, priority: 0)
May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: info: 
abort_transition_graph: action_timer_callback:512 - Triggered transition 
abort (complete=0) : Action lost

However, the 3 minute-timeout is unacceptable for our customers.

What would you recommend to fix this backward compatibility issue?

Only as a test, I called send_direct_ack in case “in_progress==TRUE” also. 
This fixed the problem, as the older node received the needed ACK. But I don’t 
know what this change might break.

Thanks,
Alex.