[ClusterLabs] Cluster resources migration from CMAN to Pacemaker

Wed Feb 10 18:00:03 UTC 2016

On 09/02/16 15:34 +0530, jaspal singla wrote:
> Hi Jan/Digiman,

(as a matter of fact, Digimer, from Digital Mermaid :-)

> Thanks for your replies. Based on your inputs, I managed to configure these
> values and results were fine but still have some doubts for which I would
> seek your help. I also tried to dig some of issues on internet but seems
> due to lack of cman -> pacemaker documentation, I couldn't find any.

That's not exactly CMAN -> Pacemaker, better conceptual expression is
  (CMAN,rgmanager) -> (Corosync v2,Pacemaker)
or
  (CMAN,rgmanager) -> (Corosync/CMAN,Pacemaker) 
depending on what's the exact target (these expressions is what's
"clufter -h" uses to provide a hint about facilitated conversions).

And yes, it's so non-existent I determined to put some bits of
non-code knowledge to the docs accompanying clufter:
    https://pagure.io/clufter/blob/master/f/__root__/doc/rgmanager-pacemaker
thus at least partially fill the vacuum (+ lay some common grounds
to talk about cluster properties in a way as implementation-agnostic
as possible <-- I am not aware of similar effort but I didn't search
extensively).

Any help with extending/refining it is welcome.

> I have configured 8 scripts under one resource as you recommended. But out
> of which 2 scripts are not being executed by cluster by cluster itself.
> When I tried to execute the same script manually, I am able to do it but
> through pacemaker command I don't.
> 
> For example:
> 
> This is the output of crm_mon command:
> 
> ###############################################################################################################
> Last updated: Mon Feb  8 17:30:57 2016          Last change: Mon Feb  8
> 17:03:29 2016 by hacluster via crmd on ha1-103.cisco.com
> Stack: corosync
> Current DC: ha1-103.cisco.com (version 1.1.13-10.el7-44eb2dd) - partition
> with quorum
> 1 node and 10 resources configured
> 
> Online: [ ha1-103.cisco.com ]
> 
>  Resource Group: ctm_service
>      FSCheck
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FsCheckAgent.py):
>  Started ha1-103.cisco.com
>      NTW_IF
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/NtwIFAgent.py):  Started
> ha1-103.cisco.com
>      CTM_RSYNC
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/RsyncAgent.py):  Started
> ha1-103.cisco.com
>      REPL_IF
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_IFAgent.py): Started
> ha1-103.cisco.com
>      ORACLE_REPLICATOR
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_ReplicatorAgent.py):
> Started ha1-103.cisco.com
>      CTM_SID
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/OracleAgent.py): Started
> ha1-103.cisco.com
>      CTM_SRV
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/CtmAgent.py):    Stopped
>      CTM_APACHE
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ApacheAgent.py): Stopped
>  Resource Group: ctm_heartbeat
>      CTM_HEARTBEAT
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/HeartBeat.py):   Started
> ha1-103.cisco.com
>  Resource Group: ctm_monitoring
>      FLASHBACK
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FlashBackMonitor.py):
>  Started ha1-103.cisco.com
> 
> Failed Actions:
> * CTM_SRV_start_0 on ha1-103.cisco.com 'unknown error' (1): call=577,
> status=complete, exitreason='none',
>     last-rc-change='Mon Feb  8 17:12:33 2016', queued=0ms, exec=74ms
> 
> #################################################################################################################
> 
> 
> CTM_SRV && CTM_APACHE are in stopped state. These services are not being
> executed by cluster OR it is being failed somehow by cluster, not sure
> why?  When I manually execute CTM_SRV script, the script gets executed
> without issues.
> 
> -> For manually execution of this script I ran the below command:
> 
> # /cisco/PrimeOpticalServer/HA/bin/OracleAgent.py status
> 
> Output:
> 
> _________________________________________________________________________________________________________________
> 2016-02-08 17:48:41,888 INFO MainThread CtmAgent
> =========================================================
> Executing preliminary checks...
>  Check Oracle and Listener availability
>   => Oracle and listener are up.
>  Migration check
>   => Migration check completed successfully.
>  Check the status of the DB archivelog
>   => DB archivelog check completed successfully.
>  Check of Oracle scheduler...
>   => Check of Oracle scheduler completed successfully
>  Initializing database tables
>   => Database tables initialized successfully.
>  Install in cache the store procedure
>   => Installing store procedures completed successfully
>  Gather the oracle system stats
>   => Oracle stats completed successfully
> Preliminary checks completed.
> =========================================================
> Starting base services...
> Starting Zookeeper...
> JMX enabled by default
> Using config: /opt/CiscoTransportManagerServer/zookeeper/bin/../conf/zoo.cfg
> Starting zookeeper ... STARTED
>  Retrieving name service port...
>  Starting name service...
> Base services started.
> =========================================================
> Starting Prime Optical services...
> Prime Optical services started.
> =========================================================
> Cisco Prime Optical Server Version: 10.5.0.0.214 / Oracle Embedded
> -------------------------------------------------------------------------------------
>       USER       PID      %CPU      %MEM     START      TIME   PROCESS
> -------------------------------------------------------------------------------------
>       root     16282       0.0       0.0  17:48:11      0:00   CTM Server
>       root     16308       0.0       0.1  17:48:16      0:00   CTM Server
>       root     16172       0.1       0.1  17:48:10      0:00   NameService
>       root     16701      24.8       7.5  17:48:27      0:27   TOMCAT
>       root     16104       0.2       0.2  17:48:09      0:00   Zookeeper
> -------------------------------------------------------------------------------------
> For startup details, see:
> /opt/CiscoTransportManagerServer/log/ctms-start.log
> 2016-02-08 17:48:41,888 WARNING MainThread CtmAgent CTM restartd at attempt
> 1
> _________________________________________________________________________________________________________________
> 
> 
> The script gets executed and I could see that service was started but
> crm_mon output still shows this CTM_SRV script in stopped state, why?

I guess we would need more complete logs to investigate this, but it
is possible that when you run it manually, the execution environment
differs from that of when Pacemaker runs it (environment variables,
access permissions, programs running concurrently incl. the scripts
from that very resource group, ...)

> -> When I try to start the script through pcs commad, I get the below
> errors in logs. I tried to debug, but couldn't manage to rectify. I'd
> really appreciate if any help can be provided in order to get this resolved.
> 
> # pcs resource enable CTM_SRV
> 
> 
> Output:
> _________________________________________________________________________________________________________________
> 
> Feb 08 17:12:42 [12877] ha1-103.cisco.com    pengine:    debug:
> determine_op_status:    CTM_SRV_start_0 on ha1-103.cisco.com returned
> 'unknown error' (1) instead of the expected value: 'ok' (0)
> Feb 08 17:12:42 [12877] ha1-103.cisco.com    pengine:  warning:
> unpack_rsc_op_failure:  Processing failed op start for CTM_SRV on
> ha1-103.cisco.com: unknown error (1)
> Feb 08 17:12:42 [12877] ha1-103.cisco.com    pengine:    debug:
> determine_op_status:    CTM_SRV_start_0 on ha1-103.cisco.com returned
> 'unknown error' (1) instead of the expected value: 'ok' (0)
> Feb 08 17:12:42 [12877] ha1-103.cisco.com    pengine:  warning:
> unpack_rsc_op_failure:  Processing failed op start for CTM_SRV on
> ha1-103.cisco.com: unknown error (1)
> Feb 08 17:12:42 [12877] ha1-103.cisco.com    pengine:     info:
> native_print:        CTM_SRV
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/CtmAgent.py):    FAILED
> ha1-103.cisco.com
> Feb 08 17:12:42 [12877] ha1-103.cisco.com    pengine:     info:
> get_failcount_full:     CTM_SRV has failed INFINITY times on
> ha1-103.cisco.com

This is because a first failure of starting a resource at particular
node will mark that node uneligible to run this resource, unless
start-failure-is-fatal cluster option is explicitly set to false:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_cluster_options

This actually looks like a bug in the clufter's answer for the
suggested conversion path as it, so far, assumed migration-threshold
and friends will work without tweaking any cluster-wide options.

I have to investigate this in greater detail.

In the meantime, following force a change in the behaviour:

# pcs property set start-failure-is-fatal=false

[rest of the log snipped]

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160210/ea9d5b0f/attachment-0002.sig>