[Pacemaker] Preventing Automatic Failback

Tue Jan 21 11:12:28 EST 2014

Hi David, 

Thanks for your reply. Just to clear it up:

If everything is running on node-1 and I do a "crm node standby node-1", everything goes to node-2. When I "crm node online node-1" everything is perfectly fine and things do not get disrupted on node-2. The services remain on node-2 until I move it manually, awesome.

I think you are on the right track because this only happens from a reboot. If I shutdown pacemaker and corosync services on node-1, everything fails to node-2. When I start the services back up on node-1, nothing get's interrupted on #2. It just comes online. I think this does in fact does have something to do with the reboot (it's just not so graceful).

The reason I am testing by hard rebooting the entire server is because I want to test the behavior of pacemaker/drbd/corosync in the event of power failure, or a system becoming frozen or having a kernel panic (I feel like a reboot was a good way to test all 3).

###
### Here are the Corosync log from node-2 right after I hard reset node-1: (Scroll down for the log when node-1 comes back up)
###

Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: short_print:             Slaves: [ node-2.mycompany.com ]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: short_print:             Stopped: [ node-1.mycompany.com ]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_color:       Resource drbd_data:1 cannot run anywhere
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Slave node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:        Cancelling action drbd_data:0_monitor_31000 (Slave vs. Master)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:        Cancelling action drbd_data:0_monitor_31000 (Slave vs. Master)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd1_opt-atlassian     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd2_var-atlassian     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   failover-ip     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   atlassian_jira  (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Promote drbd_data:0     (Slave -> Master node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:1     (Stopped)
Jan 21 07:05:12 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 18 (ref=pe_calc-dc-1390305912-111) derived from /var/lib/pacemaker/pengine/pe-input-835.bz2
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: process_pe_message:         Calculated Transition 18: /var/lib/pacemaker/pengine/pe-input-835.bz2
Jan 21 07:05:12 [1963] node-2.mycompany.com       crmd:   notice: run_graph:  Transition 18 (Complete=3, Pending=0, Fired=0, Skipped=10, Incomplete=4, Source=/var/lib/pacemaker/pengine/pe-input-835.bz2): Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: short_print:             Slaves: [ node-2.mycompany.com ]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: short_print:             Stopped: [ node-1.mycompany.com ]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_color:       Resource drbd_data:1 cannot run anywhere
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Slave node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd1_opt-atlassian     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd2_var-atlassian     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   failover-ip     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   atlassian_jira  (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Promote drbd_data:0     (Slave -> Master node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:1     (Stopped)
Jan 21 07:05:12 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 19 (ref=pe_calc-dc-1390305912-116) derived from /var/lib/pacemaker/pengine/pe-input-836.bz2
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: process_pe_message:         Calculated Transition 19: /var/lib/pacemaker/pengine/pe-input-836.bz2
Jan 21 07:05:14 [1963] node-2.mycompany.com       crmd:   notice: run_graph:  Transition 19 (Complete=9, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-836.bz2): Stopped
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Stopped

####
####  When node-1 has recovered:
####

Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: short_print:             Slaves: [ node-2.mycompany.com ]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: short_print:             Stopped: [ node-1.mycompany.com ]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_color:       Resource drbd_data:1 cannot run anywhere
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Slave node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:        Cancelling action drbd_data:0_monitor_31000 (Slave vs. Master)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:        Cancelling action drbd_data:0_monitor_31000 (Slave vs. Master)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd1_opt-atlassian     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd2_var-atlassian     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   failover-ip     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   atlassian_jira  (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Promote drbd_data:0     (Slave -> Master node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:1     (Stopped)
Jan 21 07:05:12 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 18 (ref=pe_calc-dc-1390305912-111) derived from /var/lib/pacemaker/pengine/pe-input-835.bz2
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: process_pe_message:         Calculated Transition 18: /var/lib/pacemaker/pengine/pe-input-835.bz2
Jan 21 07:05:12 [1963] node-2.mycompany.com       crmd:   notice: run_graph:  Transition 18 (Complete=3, Pending=0, Fired=0, Skipped=10, Incomplete=4, Source=/var/lib/pacemaker/pengine/pe-input-835.bz2): Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Stopped
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: short_print:             Slaves: [ node-2.mycompany.com ]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: short_print:             Stopped: [ node-1.mycompany.com ]
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: native_color:       Resource drbd_data:1 cannot run anywhere
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Slave node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd1_opt-atlassian     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd2_var-atlassian     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   failover-ip     (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   atlassian_jira  (node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Promote drbd_data:0     (Slave -> Master node-2.mycompany.com)
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:1     (Stopped)
Jan 21 07:05:12 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 19 (ref=pe_calc-dc-1390305912-116) derived from /var/lib/pacemaker/pengine/pe-input-836.bz2
Jan 21 07:05:12 [1962] node-2.mycompany.com    pengine:   notice: process_pe_message:         Calculated Transition 19: /var/lib/pacemaker/pengine/pe-input-836.bz2
Jan 21 07:05:14 [1963] node-2.mycompany.com       crmd:   notice: run_graph:  Transition 19 (Complete=9, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-836.bz2): Stopped
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Stopped
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Stopped
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Stopped
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: short_print:             Masters: [ node-2.mycompany.com ]
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: short_print:             Stopped: [ node-1.mycompany.com ]
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: native_color:       Resource drbd_data:1 cannot run anywhere
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Master node-2.mycompany.com)
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd1_opt-atlassian     (node-2.mycompany.com)
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd2_var-atlassian     (node-2.mycompany.com)
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   failover-ip     (node-2.mycompany.com)
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   atlassian_jira  (node-2.mycompany.com)
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:0     (Master node-2.mycompany.com)
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:1     (Stopped)
Jan 21 07:05:14 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 20 (ref=pe_calc-dc-1390305914-122) derived from /var/lib/pacemaker/pengine/pe-input-837.bz2
Jan 21 07:05:14 [1962] node-2.mycompany.com    pengine:   notice: process_pe_message:         Calculated Transition 20: /var/lib/pacemaker/pengine/pe-input-837.bz2
Jan 21 07:05:16 [1963] node-2.mycompany.com       crmd:   notice: run_graph:  Transition 20 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-837.bz2): Complete
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-1.mycompany.com is online
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Started node-2.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Started node-2.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Started node-2.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Started node-2.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: short_print:             Masters: [ node-2.mycompany.com ]
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: short_print:             Stopped: [ node-1.mycompany.com ]
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Master node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (31s) for drbd_data:1 on node-1.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (31s) for drbd_data:1 on node-1.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd1_opt-atlassian     (Started node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd2_var-atlassian     (Started node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   failover-ip     (Started node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   atlassian_jira  (Started node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:0     (Master node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd_data:1     (node-1.mycompany.com)
Jan 21 07:05:49 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 21 (ref=pe_calc-dc-1390305949-137) derived from /var/lib/pacemaker/pengine/pe-input-838.bz2
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: process_pe_message:         Calculated Transition 21: /var/lib/pacemaker/pengine/pe-input-838.bz2
Jan 21 07:05:49 [1963] node-2.mycompany.com       crmd:   notice: run_graph:  Transition 21 (Complete=10, Pending=0, Fired=0, Skipped=3, Incomplete=5, Source=/var/lib/pacemaker/pengine/pe-input-838.bz2): Stopped
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-1.mycompany.com is online
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: unpack_rsc_op:      Operation monitor found resource drbd2_var-atlassian active on node-1.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: unpack_rsc_op:      Operation monitor found resource drbd1_opt-atlassian active on node-1.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Started
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:               1 : node-2.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:               2 : node-1.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Started
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:               1 : node-2.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:               2 : node-1.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Started node-2.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Started node-2.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: short_print:             Masters: [ node-2.mycompany.com ]
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: short_print:             Stopped: [ node-1.mycompany.com ]
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Master node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:    error: native_create_actions:      Resource drbd1_opt-atlassian (ocf::Filesystem) is active on 2 nodes attempting recovery
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:  warning: native_create_actions:      See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:    error: native_create_actions:      Resource drbd2_var-atlassian (ocf::Filesystem) is active on 2 nodes attempting recovery
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:  warning: native_create_actions:      See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (31s) for drbd_data:1 on node-1.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: RecurringOp:         Start recurring monitor (31s) for drbd_data:1 on node-1.mycompany.com
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Restart drbd1_opt-atlassian     (Started node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Restart drbd2_var-atlassian     (Started node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Restart failover-ip     (Started node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Restart atlassian_jira  (Started node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:0     (Master node-2.mycompany.com)
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   drbd_data:1     (node-1.mycompany.com)
Jan 21 07:05:49 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 22 (ref=pe_calc-dc-1390305949-146) derived from /var/lib/pacemaker/pengine/pe-error-168.bz2
Jan 21 07:05:49 [1962] node-2.mycompany.com    pengine:    error: process_pe_message:         Calculated Transition 22: /var/lib/pacemaker/pengine/pe-error-168.bz2
Jan 21 07:06:11 [1963] node-2.mycompany.com       crmd:   notice: run_graph:  Transition 22 (Complete=14, Pending=0, Fired=0, Skipped=13, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-168.bz2): Stopped
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-1.mycompany.com is online
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: unpack_rsc_op:      Operation monitor found resource drbd2_var-atlassian active on node-1.mycompany.com
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: unpack_rsc_op:      Operation monitor found resource drbd1_opt-atlassian active on node-1.mycompany.com
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Started
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: native_print:               1 : node-2.mycompany.com
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: native_print:               2 : node-1.mycompany.com
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Started
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: native_print:               1 : node-2.mycompany.com
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: native_print:               2 : node-1.mycompany.com
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Started node-2.mycompany.com
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Stopped
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: short_print:             Masters: [ node-2.mycompany.com ]
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: short_print:             Slaves: [ node-1.mycompany.com ]
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Master node-2.mycompany.com)
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:    error: native_create_actions:      Resource drbd1_opt-atlassian (ocf::Filesystem) is active on 2 nodes attempting recovery
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:  warning: native_create_actions:      See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:    error: native_create_actions:      Resource drbd2_var-atlassian (ocf::Filesystem) is active on 2 nodes attempting recovery
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:  warning: native_create_actions:      See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Restart drbd1_opt-atlassian     (Started node-2.mycompany.com)
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Restart drbd2_var-atlassian     (Started node-2.mycompany.com)
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Restart failover-ip     (Started node-2.mycompany.com)
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:   notice: LogActions:         Start   atlassian_jira  (node-2.mycompany.com)
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:0     (Master node-2.mycompany.com)
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:1     (Slave node-1.mycompany.com)
Jan 21 07:06:11 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 23 (ref=pe_calc-dc-1390305971-156) derived from /var/lib/pacemaker/pengine/pe-error-169.bz2
Jan 21 07:06:11 [1962] node-2.mycompany.com    pengine:    error: process_pe_message:         Calculated Transition 23: /var/lib/pacemaker/pengine/pe-error-169.bz2
Jan 21 07:06:15 [1963] node-2.mycompany.com       crmd:   notice: run_graph:  Transition 23 (Complete=14, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-169.bz2): Complete
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-2.mycompany.com is online
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: determine_online_status:    Node node-1.mycompany.com is online
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: unpack_rsc_op:      Operation monitor found resource drbd2_var-atlassian active on node-1.mycompany.com
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: unpack_rsc_op:      Operation monitor found resource drbd1_opt-atlassian active on node-1.mycompany.com
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: group_print:         Resource Group: jira_services
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd1_opt-atlassian        (ocf::heartbeat:Filesystem):    Started node-2.mycompany.com
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: native_print:            drbd2_var-atlassian        (ocf::heartbeat:Filesystem):    Started node-2.mycompany.com
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: native_print:            failover-ip        (ocf::heartbeat:IPaddr2):       Started node-2.mycompany.com
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: native_print:            atlassian_jira     (lsb:jira):     Started node-2.mycompany.com
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: clone_print:         Master/Slave Set: ms_drbd_data [drbd_data]
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: short_print:             Masters: [ node-2.mycompany.com ]
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: short_print:             Slaves: [ node-1.mycompany.com ]
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: master_color:       Promoting drbd_data:0 (Master node-2.mycompany.com)
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: master_color:       ms_drbd_data: Promoted 1 instances of a possible 1 to master
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd1_opt-atlassian     (Started node-2.mycompany.com)
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd2_var-atlassian     (Started node-2.mycompany.com)
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   failover-ip     (Started node-2.mycompany.com)
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   atlassian_jira  (Started node-2.mycompany.com)
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:0     (Master node-2.mycompany.com)
Jan 21 07:06:21 [1962] node-2.mycompany.com    pengine:     info: LogActions:         Leave   drbd_data:1     (Slave node-1.mycompany.com)
Jan 21 07:06:21 [1963] node-2.mycompany.com       crmd:     info: do_te_invoke:       Processing graph 24 (ref=pe_calc-dc

Thanks again, David.

Mike.

----- Original Message -----
From: "David Vossel" <dvossel at redhat.com>
To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
Sent: Tuesday, January 21, 2014 10:26:45 AM
Subject: Re: [Pacemaker] Preventing Automatic Failback

----- Original Message -----
> From: "Michael Monette" <mmonette at 2keys.ca>
> To: pacemaker at oss.clusterlabs.org
> Sent: Monday, January 20, 2014 8:22:25 AM
> Subject: [Pacemaker] Preventing Automatic Failback
> 
> Hi,
> 
> I posted this question before but my question was a bit unclear.
> 
> I have 2 nodes with DRBD with Postgresql.
> 
> When node-1 fails, everything fails to node-2 . But when node 1 is recovered,
> things try to failback to node-1 and all the services running on node-2 get
> disrupted(things don't ACTUALLY fail back to node-1..they try, fail, and
> then all services on node-2 are simply restarted..very annoying). This does
> not happen if I perform the same tests on node-2! I can reboot node-2,
> things fail to node-1 and node-2 comes online and waits until he is
> needed(this is what I want!) It seems to only affect my node-1's.
> 
> I have tried to set resource stickiness, I have tried everything I can really
> think of, but whenever the Primary has recovered, it will always disrupt
> services running on node-2.
> 
> Also I tried removing things from this config to try and isolate this. At one
> point I removed the atlassian_jira and drbd2_var primitives and only had a
> failover-ip and drbd1_opt, but still had the same problem. Hopefully someone
> can pinpoint this out for me. If I can't really avoid this, I would at least
> like to make this "bug" or whatever happen on node-2 instead of the actives.

I bet this is due to the drbd resource's master score value on node1 being higher than node2.  When you recover node1, are you actually rebooting that node?  If node1 doesn't lose membership from the cluster (reboot), those transient attributes that the drbd agent uses to specify which node will be the master instance will stick around.  Otherwise if you are just putting node1 in standby and then bringing the node back online, the I believe the resources will come back if the drbd master was originally on node1.

If you provide a policy engine file that shows the unwanted transition from node2 back to node1, we'll be able to tell you exactly why it is occurring.

-- Vossel

> 
> Here is my config:
> 
> node node-1.comp.com \
>         attributes standby="off"
> node node-1.comp.com \
>         attributes standby="off"
> primitive atlassian_jira lsb:jira \
>         op start interval="0" timeout="240" \
>         op stop interval="0" timeout="240"
> primitive drbd1_opt ocf:heartbeat:Filesystem \
>         params device="/dev/drbd1" directory="/opt/atlassian" fstype="ext4"
> primitive drbd2_var ocf:heartbeat:Filesystem \
>         params device="/dev/drbd2" directory="/var/atlassian" fstype="ext4"
> primitive drbd_data ocf:linbit:drbd \
>         params drbd_resource="r0" \
>         op monitor interval="29s" role="Master" \
>         op monitor interval="31s" role="Slave"
> primitive failover-ip ocf:heartbeat:IPaddr2 \
>         params ip="10.199.0.13"
> group jira_services drbd1_opt drbd2_var failover-ip atlassian_jira
> ms ms_drbd_data drbd_data \
>         meta master-max="1" master-node-max="1" clone-max="2"
>         clone-node-max="1" notify="true"
> colocation jira_services_on_drbd inf: atlassian_jira ms_drbd_data:Master
> order jira_services_after_drbd inf: ms_drbd_data:promote jira_services:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.10-14.el6_5.1-368c726" \
>         cluster-infrastructure="classic openais (with plugin)" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1390183165" \
>         default-resource-stickiness="INFINITY"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="INFINITY"
> 
> Thanks
> 
> Mike
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org