[ClusterLabs Developers] commit abcdaa8 breaks compatibility with older pacemaker

Fri Jun 10 21:35:27 UTC 2016

On 06/07/2016 05:52 AM, Alex Lyakas wrote:
> Hello Andrew,
>  
> Thank you for your response.
>  
> We have a two-node cluster, and we need to upgrade pacemaker at both nodes.
>  
> We ended up applying locally the patch[1], which sends explicit ACK if
> it matches the old version of pacemaker.

Thanks very much for the patch! A modified version has been merged, and
is included in 1.1.15-rc4.

> Thanks,
> Alex.
>  
>  
> [1]
> --- a/pacemaker/pacemaker-1.1.13/crmd/lrm.c
> +++ b/pacemaker/pacemaker-1.1.13/crmd/lrm.c
> @@ -1541,20 +1541,44 @@ do_lrm_invoke(long long action,
>                  op->rc = PCMK_OCF_OK;
>                  op->op_status = PCMK_LRM_OP_DONE;
>                  send_direct_ack(from_host, from_sys, rsc, op, rsc->id);
>                  lrmd_free_event(op);
>  
>                  /* needed?? surely not otherwise the cancel_op_(_key)
> wouldn't
>                   * have failed in the first place
>                   */
>                  g_hash_table_remove(lrm_state->pending_ops, op_key);
>              }
> +            else {
> +                const char *feature_set = NULL;
> +                gboolean need_direct_ack = FALSE;
> +
> +                /*
> +                 * For uprading from older versions, we need to send
> explicit ACK.
> +                 * See:
> +                 *
> https://github.com/ClusterLabs/pacemaker/commit/abcdaa8893d6071574986af6abc85ae558473735
> +                 *
> http://clusterlabs.org/pipermail/developers/2016-May/000219.html
> +                 */
> +                feature_set = crm_element_value(params,
> XML_ATTR_CRM_VERSION);
> +                need_direct_ack = safe_str_eq(feature_set, "3.0.5");
> +                crm_notice("PE requested op %s (call=%s) be cancelled
> in_progress==TRUE feature_set=%s need_direct_ack=%d",
> +                          op_key, call_id ? call_id : "NA",
> feature_set, need_direct_ack);
> +                if (need_direct_ack) {
> +                    lrmd_event_data_t *op = construct_op(lrm_state,
> input->xml, rsc->id, op_task);
> +
> +                    CRM_ASSERT(op != NULL);
> +                    op->rc = PCMK_OCF_OK;
> +                    op->op_status = PCMK_LRM_OP_DONE;
> +                    send_direct_ack(from_host, from_sys, rsc, op, rsc->id);
> +                    lrmd_free_event(op);
> +                }
> +            }
>  
>              free(op_key);
>  
>          } else if (rsc != NULL && safe_str_eq(operation,
> CRMD_ACTION_DELETE)) {
>              gboolean unregister = TRUE;
>  
> #if ENABLE_ACL
>              int cib_rc = delete_rsc_status(lrm_state, rsc->id,
> cib_dryrun | cib_sync_call, user_name);
>              if (cib_rc != pcmk_ok) {
>                  lrmd_event_data_t *op = NULL;
>  
>  
>  
> *From:* Andrew Beekhof <mailto:andrew at beekhof.net>
> *Sent:* Friday, June 03, 2016 3:36 AM
> *To:* Alex Lyakas <mailto:alex at zadarastorage.com>
> *Cc:* Developers at clusterlabs.org <mailto:Developers at clusterlabs.org> ;
> Yair Hershko <mailto:yair at zadarastorage.com> ; Shyam Kaushik
> <mailto:shyam at zadarastorage.com> ; Yaron Presente
> <mailto:yaron at zadarastorage.com> ; Lev Vainblat
> <mailto:lev at zadarastorage.com>
> *Subject:* Re: commit abcdaa8 breaks compatibility with older pacemaker
>  
>  
>> On 24 May 2016, at 2:23 AM, Alex Lyakas <alex at zadarastorage.com
>> <mailto:alex at zadarastorage.com>> wrote:
>>  
>> Hello Andrew,
>>
>> We have a system in the field running a MASTER-SLAVE resource on two
>> nodes. We are trying to upgrade the pacemaker on these two nodes.
>> First we upgrade the SLAVE node. Then we move the resource to be
>> MASTER on the upgraded SLAVE node (“crm node standby” on the old
>> MASTER). This move involves cancelling a monitor operation on the
>> SLAVE node.
>>
>> With commit
>> https://github.com/ClusterLabs/pacemaker/commit/abcdaa8893d6071574986af6abc85ae558473735
>> there is a change of how the “cancel” action is confirmed.
>>
>> Previously, send_direct_ack was always used to confirm the cancel
>> action. But now, the cancel action is being confirmed not by direct
>> ACK but by parsing the XML.
>  
> Oh, and you’re mixing pacemaker versions.
> I can see how that would be a problem.
>  
> Are you seeing this in the process of upgrading the entire cluster is
> the plan just to update one?
> 
>>
>> So the new node receives the cancel action, but doesn’t call
>> send_direct_ack. As a result on the old node, it sends the cancel action:
>> May 23 18:05:49 vsa-000001be-vc-0 crmd: [3089]: info: te_rsc_command:
>> Initiating action 4: cancel VAM:1_monitor_5000 on vsa-000001be-vc-1
>>
>> And after 3 minutes only it moves forward due to timeout
>> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: WARN:
>> action_timer_callback: Timer popped (timeout=120000,
>> abort_level=1000000, complete=false)
>> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: ERROR: print_elem:
>> Aborting transition, action lost: [Action 4]: In-flight (id:
>> VAM:1_monitor_5000, loc: vsa-000001be-vc-1, priority: 0)
>> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: info:
>> abort_transition_graph: action_timer_callback:512 - Triggered
>> transition abort (complete=0) : Action lost
>>
>> However, the 3 minute-timeout is unacceptable for our customers.
>>
>> What would you recommend to fix this backward compatibility issue?
>  
> Unfortunately, you might need to resort to the detach+upgrade
> everything+reattach method of upgrading as described here:
>  
>     
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_disconnect_and_reattach.html
> 
>>
>> Only as a test, I called send_direct_ack in case “in_progress==TRUE”
>> also. This fixed the problem, as the older node received the needed
>> ACK. But I don’t know what this change might break.
>  
> It would probably be fine as a transition plan.
> Ie. first do a rolling update to the patched version, then another to
> the unpatched version.
> 
>>
>> Thanks,
>> Alex.