[Pacemaker] rsc_op: Hard error - res_Nagios_monitor_0 failedwith rc=6: Preventing res_Nagios from re-starting anywhere inthe cluster

Fri Jun 25 10:41:56 EDT 2010

On Thu, Jun 24, 2010 at 1:54 PM, Koch, Sebastian
<Sebastian.Koch at netzwerk.de> wrote:
> Hi,
>
> thanks for your reply. It wasn't clear to me that pacemaker is issuing status commands in the background even on the passive node.

We run a single monitor op for each resource on each node when it
joins the cluster.
This is the only way to be sure what the current state of the resource is.

> The problem was, that on the passive node the symlinks tot he nagios configuration were broken cause the drbd was mounted on the other node. Therefore i just copied all needed configs to my /mnt/cluster/ dir and if the node is passive it can use the configs from there. If it gets active the drbd will be mounted on /mnt/cluster.
>
> Do you have a better idea

Not really. Unless you want to relax the checks in the RA.

> because it seems to me like a from-back-through-the-eye-into-the-chest solution and i would like to solve it in a more elegant way. Currently i have the same issue with ClusterMonitor because ist trying to write the html to /var/www but i symlinked this tot he cluster dir and therefore the status command fails on the passive node.
>
> Thanks in advance.
>
> Sebastian Koch
>
>
> NETZWERK GmbH
>
> Fon:  +49.711.220 5498 81
> Achtung neue Mobilfunknummer: +49.1522.299 6524
> Fax:  +49.711.220 5499 77
> Email: sebastian.koch at netzwerk.de
> Web:  www.netzwerk.de
> NETZWERK GmbH, Kurze Str. 40, 70794 Filderstadt-Bonlanden
> Geschäftsführer: Siegfried Herner, Hans-Baldung Luley, Olaf Müller-Haberland
> Sitz der Gesellschaft: Filderstadt-Bonlanden, Amtsgericht Stuttgart HRB 225547, WEEE-Reg Nr. DE 185 622 492
>
> -----Ursprüngliche Nachricht-----
> Von: Andrew Beekhof [mailto:andrew at beekhof.net]
> Gesendet: Donnerstag, 24. Juni 2010 09:19
> An: The Pacemaker cluster resource manager
> Betreff: Re: [Pacemaker] rsc_op: Hard error - res_Nagios_monitor_0 failedwith rc=6: Preventing res_Nagios from re-starting anywhere inthe cluster
>
> On Wed, Jun 23, 2010 at 5:19 PM, Koch, Sebastian
> <Sebastian.Koch at netzwerk.de> wrote:
>> Hi,
>>
>>
>>
>> i got a 2 Node Cluster up and running and right know i am trying to
>> configure a Nagios3 Resource. Therefore i already fixed the nagios init
>> script as it dind't pass the LSB Compatibility Checks as described here:
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html
>>
>>
>>
>> I just needed to make sure the pid file gets removed if the stop function is
>> called. After this small change i passed all the LSB Checks. Below you find
>> the error message:
>>
>>
>>
>> root at pilot01-node2:/var/run/nagios3# crm_verify -LV
>>
>> crm_verify[7094]: 2010/06/23_16:37:27 ERROR: unpack_rsc_op: Hard error -
>> res_Nagios_monitor_0 failed with rc=6: Preventing res_Nagios from
>> re-starting anywhere in the cluster
>
> Looks like its still failing the fifth LSB check from the above url.
> "Did the command print result: 3"
>
>>
>> crm_verify[7094]: 2010/06/23_16:37:27 WARN: native_color: Resource
>> res_Nagios cannot run anywhere
>>
>> Warnings found during check: config may not be valid
>>
>>
>>
>> I tried to find out what the init scripts must provide for allowing it to
>> use it in pacemaker but i just found the LSB Compatib. Hints on the
>> pacemaker website. I think i configured the primitive wrong or maybe the
>> init script is still wrong? Even if i configure it with a op monitor action
>> it fails. And even a crm resource cleanup  res_Nagios doesn't help me
>> starting the resource.
>>
>>
>>
>> I can run Nagios manually on the active node. I linked all shared
>> directories to my cluster storage device like this:
>>
>>
>>
>> root at pilot01-node2:/etc# ll /var/lib/nagios3* /etc/nagios*
>>
>> lrwxrwxrwx 1 root   root    25 23. Jun 13:54 /etc/nagios3 ->
>> /mnt/cluster/etc/nagios3/
>>
>> lrwxrwxrwx 1 root   root    29 23. Jun 14:04 /var/lib/nagios3 ->
>> /mnt/cluster/var/lib/nagios3/
>>
>>
>>
>> /etc/nagios3_bak:
>>
>> insgesamt 88K
>>
>> drwxr-xr-x  4 root root    146 23. Jun 13:54 .
>>
>> drwxr-xr-x 75 root root   4,0K 23. Jun 17:08 ..
>>
>> -rw-r--r--  1 root root   1,9K 30. Jun 2009  apache2.conf
>>
>> -rw-r--r--  1 root root    11K 23. Jun 13:49 cgi.cfg
>>
>> -rw-r--r--  1 root root   2,4K  2. Jul 2009  commands.cfg
>>
>> drwxr-xr-x  2 root root   4,0K  7. Jun 19:16 conf.d
>>
>> -rw-r--r--  1 root root     20 23. Jun 13:49 htpasswd.users
>>
>> -rw-r--r--  1 root root    42K  2. Jul 2009  nagios.cfg
>>
>> -rw-r-----  1 root nagios 1,3K 30. Jun 2009  resource.cfg
>>
>> drwxr-xr-x  2 root root   4,0K  7. Jun 19:16 stylesheets
>>
>>
>>
>> /etc/nagios-plugins:
>>
>> insgesamt 12K
>>
>> drwxr-xr-x  3 root root   19  7. Jun 19:16 .
>>
>> drwxr-xr-x 75 root root 4,0K 23. Jun 17:08 ..
>>
>> drwxr-xr-x  2 root root 4,0K  7. Jun 19:16 config
>>
>>
>>
>> /var/lib/nagios3_bak:
>>
>> insgesamt 20K
>>
>> drwxr-x---  4 nagios nagios     47 23. Jun 14:02 .
>>
>> drwxr-xr-x 33 root   root     4,0K 23. Jun 14:04 ..
>>
>> -rw-------  1 nagios www-data  14K 23. Jun 14:02 retention.dat
>>
>> drwx------  2 nagios www-data    6  2. Jul 2009  rw
>>
>> drwxr-x---  3 nagios nagios     25  7. Jun 19:16 spool
>>
>>
>>
>> Here is my Config.
>>
>>
>>
>> ########################
>>
>> ### 3. Cluster State ###
>>
>> ########################
>>
>>
>>
>> ============
>>
>> Last updated: Wed Jun 23 17:16:33 2010
>>
>> Stack: openais
>>
>> Current DC: pilot01-node2 - partition with quorum
>>
>> Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75
>>
>> 2 Nodes configured, 2 expected votes
>>
>> 4 Resources configured.
>>
>> ============
>>
>>
>>
>> Node pilot01-node1: standby
>>
>> Online: [ pilot01-node2 ]
>>
>>
>>
>> Full list of resources:
>>
>>
>>
>>  Resource Group: grp_MySQL
>>
>>      res_Filesystem     (ocf::heartbeat:Filesystem):    Started
>> pilot01-node2
>>
>>      res_ClusterIP      (ocf::heartbeat:IPaddr2):       Started
>> pilot01-node2
>>
>>      res_MySQL  (lsb:mysql):    Started pilot01-node2
>>
>>      res_Apache (lsb:apache2):  Started pilot01-node2
>>
>>      res_ClusterMonitor (ocf::pacemaker:ClusterMon):    Started
>> pilot01-node2
>>
>>      res_Nagios (lsb:nagios3):  Stopped
>>
>>  Master/Slave Set: ms_drbd_mysql0
>>
>>      Masters: [ pilot01-node2 ]
>>
>>      Stopped: [ drbd_pilot0:0 ]
>>
>>  Clone Set: cl-pinggw
>>
>>      Started: [ pilot01-node2 ]
>>
>>      Stopped: [ pinggw:0 ]
>>
>> Monitor-Cluster (ocf::pacemaker:ClusterMon):    Started pilot01-node1
>> (unmanaged) FAILED
>>
>>
>>
>> Failed actions:
>>
>>     Monitor-Cluster_stop_0 (node=pilot01-node1, call=34, rc=1,
>> status=complete): unknown error
>>
>>     res_Nagios_monitor_0 (node=pilot01-node1, call=84, rc=6,
>> status=complete): not configured
>>
>> #########################
>>
>> ### 4. Cluster Config ###
>>
>> #########################
>>
>>
>>
>> node pilot01-node1 \
>>
>>         attributes standby="on"
>>
>> node pilot01-node2 \
>>
>>         attributes standby="off"
>>
>> primitive Monitor-Cluster ocf:pacemaker:ClusterMon \
>>
>>         params htmlfile="/mnt/cluster/var/www/cluster-monitor.html" \
>>
>>         params pidfile="/var/run/rlb-cluster-monitor.pid" \
>>
>>         op start interval="0" timeout="90s" \
>>
>>         op stop interval="0" timeout="100s"
>>
>> primitive drbd_pilot0 ocf:linbit:drbd \
>>
>>         params drbd_resource="pilot0" \
>>
>>         operations $id="drbd_pilot0-operations" \
>>
>>         op monitor interval="15s"
>>
>> primitive pinggw ocf:pacemaker:pingd \
>>
>>         params host_list="10.1.1.162" multiplier="200" \
>>
>>         op monitor interval="10s"
>>
>> primitive res_Apache lsb:apache2 \
>>
>>         operations $id="res_Apache-operations" \
>>
>>         op monitor interval="15s" timeout="20s" start-delay="15s"
>>
>> primitive res_ClusterIP ocf:heartbeat:IPaddr2 \
>>
>>         params iflabel="ClusterIP" ip="10.1.1.12" nic="eth0"
>> cidr_netmask="24" \
>>
>>         operations $id="res_ClusterIP_1-operations" \
>>
>>         op monitor start-delay="0" interval="10s"
>>
>> primitive res_ClusterMonitor ocf:pacemaker:ClusterMon \
>>
>>         params htmlfile="/mnt/cluster/var/www/cluster-monitor.html" \
>>
>>         params pidfile="/var/run/rlb-cluster-monitor.pid" \
>>
>>         op start interval="0" timeout="90s" \
>>
>>         op stop interval="0" timeout="100s" \
>>
>>         meta target-role="Started"
>>
>> primitive res_Filesystem ocf:heartbeat:Filesystem \
>>
>>         params fstype="xfs" directory="/mnt/cluster" device="/dev/drbd0"
>> options="noatime,nodiratime,barrier=0"
>>
>> primitive res_MySQL lsb:mysql
>>
>> primitive res_Nagios lsb:nagios3 \
>>
>>         operations $id="res_Nagios-operations" \
>>
>>         op monitor interval="15s" timeout="20s" \
>>
>>         meta target-role="Started"
>>
>> group grp_MySQL res_Filesystem res_ClusterIP res_MySQL res_Apache
>> res_ClusterMonitor res_Nagios
>>
>> ms ms_drbd_mysql0 drbd_pilot0 \
>>
>>         meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>>
>> clone cl-pinggw pinggw \
>>
>>         meta globally-unique="false"
>>
>> location drbd-fence-by-handler-ms_drbd_mysql0 ms_drbd_mysql0 \
>>
>>         rule $id="drbd-fence-by-handler-rule-ms_drbd_mysql0" $role="Master"
>> -inf: #uname ne pilot01-node2
>>
>> location grp_MySQL-with-pinggw grp_MySQL \
>>
>>         rule $id="grp_MySQL-with-pinggw-rule-1" -inf: not_defined pingd or
>> pingd lte 0
>>
>> colocation col_drbd_on_mysql inf: grp_MySQL ms_drbd_mysql0:Master
>>
>> order mysql_after_drbd inf: ms_drbd_mysql0:promote grp_MySQL:start
>>
>> property $id="cib-bootstrap-options" \
>>
>>         expected-quorum-votes="2" \
>>
>>         stonith-enabled="false" \
>>
>>         no-quorum-policy="ignore" \
>>
>>         dc-version="1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75" \
>>
>>         cluster-infrastructure="openais" \
>>
>>         last-lrm-refresh="1277306106" \
>>
>>         symmetric-cluster="true" \
>>
>>         migration-threshold="1" \
>>
>>         default-action-timeout="240s"
>>
>>
>>
>> Thanks for your help in advance.
>>
>> Sebastian
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>