[ClusterLabs] Issues with DB2 HADR Resource Agent

Mon Feb 19 09:25:10 EST 2018

Hello Ondrej,

	I am still having issues with my DB2 HADR on Pacemaker. When I do a
db2_kill on Primary for testing, initially it does a restart of DB2 on the
same node. But if I let it run for some days and then try the same test, it
goes into fencing and then reboots the Primary Node.

	I am not sure how exactly it should behave in case my DB2 crashes on
Primary.

	Also if I crash the Node 1 (the node itself, not only DB2), it
promotes Node 2  to Primary, but once the Pacemaker is started again on
Node 1, the DB on Node 1 is also promoted to Primary. Is that expected
behaviour ?

 Regards,                                                                          

 Dileep V Nair                                                                     
 Senior AIX Administrator                                                          
 Cloud Managed Services Delivery (MSD), India                                      
 IBM Cloud                                                                         

 E-mail: dilenair at in.ibm.com                         Outer Ring Road, Embassy Manya 
                                                               Bangalore, KA 560045 
                                                                              India 

From:	Ondrej Famera <ofamera at redhat.com>
To:	Dileep V Nair <dilenair at in.ibm.com>
Cc:	Cluster Labs - All topics related to open-source clustering
            welcomed <users at clusterlabs.org>
Date:	02/12/2018 11:46 AM
Subject:	Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

On 02/01/2018 07:24 PM, Dileep V Nair wrote:
> Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which
> I guess is a reasonable value. What I am noticing is it does not wait
> for the PEER_WINDOW. Before that itself the DB goes into a
> REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in
> STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.
>
>
> Regards,
>
> *Dileep V Nair*

Hi Dileep,

sorry for later response. The DB2 should not get into the
'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not
promote. From my experience it usually gets into that state when the DB2
on standby was restarted during or after PEER_WINDOW timeout.

When the primary DB2 fails then standby should end up in some state that
would match the one on line 770 of DB2 resource agent and the promote
operation is attempted.

  770  STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer)

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_blob_master_heartbeat_db2-23L770&d=DwIDBA&c=jf_iaSHvJObTbx-siA1ZOg&r=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw&m=dhvUwjWghTBfDEHmzU3P5eaU9Ce3DkCRdRPNd71L1bU&s=3vPiNA4KGdZzc0xJOYv5hMCObjWdlxZDO_bLb86YaGM&e=

The DB2 on standby can get restarted when the 'promote' operation times
out, so you can try increasing the 'promote' timeout to something higher
if this was the case.

So if you see that DB2 was restarted after Primary failed, increase the
promote timeout. If DB2 was not restarted then question is why DB2 has
decided to change the status in this way.

Let me know if above helped.

--
Ondrej Faměra
@Red Hat

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180219/92766ed0/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180219/92766ed0/attachment-0002.gif>