[ClusterLabs] Nodes see each other as OFFLINE - fence agent (fence_pcmk) may not be working properly on RHEL 6.5

Fri Dec 16 15:44:04 EST 2016

On 12/16/2016 07:46 AM, avinash shankar wrote:
> 
> Hello team,
> 
> I am a newbie in pacemaker and corosync cluster.
> I am facing trouble with fence_agent on RHEL 6.5
> I have installed pcs, pacemaker, corosync, cman on RHEL 6.5 on two
> virtual nodes (libvirt) cluster.
> SELINUX and firewall is completely disabled.
> 
> # yum list installed | egrep 'pacemaker|corosync|cman|fence'
> cman.x86_64                      3.0.12.1-78.el6        
> @rhel-ha-for-rhel-6-server-rpms
> corosync.x86_64                  1.4.7-5.el6            
> @rhel-ha-for-rhel-6-server-rpms
> corosynclib.x86_64               1.4.7-5.el6            
> @rhel-ha-for-rhel-6-server-rpms
> fence-agents.x86_64              4.0.15-12.el6          
> @rhel-6-server-rpms   
> fence-virt.x86_64                0.2.3-19.el6           
> @rhel-ha-for-rhel-6-server-eus-rpms
> pacemaker.x86_64                 1.1.14-8.el6_8.2       
> @rhel-ha-for-rhel-6-server-rpms
> pacemaker-cli.x86_64             1.1.14-8.el6_8.2       
> @rhel-ha-for-rhel-6-server-rpms
> pacemaker-cluster-libs.x86_64    1.1.14-8.el6_8.2       
> @rhel-ha-for-rhel-6-server-rpms
> pacemaker-libs.x86_64            1.1.14-8.el6_8.2       
> @rhel-ha-for-rhel-6-server-rpms
>                  
> 
> I bring up cluster using pcs cluster start --all
> also done pcs property set stonith-enabled=false

fence_pcmk simply tells CMAN to use pacemaker's fencing ... it can't
work if pacemaker's fencing is disabled.

> Below is the status
> ---------------------------
> # pcs status
> Cluster name: roamclus
> Last updated: Fri Dec 16 18:54:40 2016        Last change: Fri Dec 16
> 17:44:50 2016 by root via cibadmin on cnode1
> Stack: cman
> Current DC: NONE
> 2 nodes and 2 resources configured
> 
> Online: [ cnode1 ]
> OFFLINE: [ cnode2 ]
> 
> Full list of resources:
> 
> PCSD Status:
>   cnode1: Online
>   cnode2: Online
> ---------------------------
> Same kind of output is observed on other node = cnode2
> So nodes see each other as OFFLINE.
> Expected result is Online: [ cnode1 cnode2 ]
> I did same packages installation on RHEL 6.8 and when I am starting the
> cluster,
> it shows both nodes ONLINE from each other.
> 
> I need to resolve this such that on RHEL 6.5 nodes when we start cluster
> by default
> both nodes should display each others status as online.
> ----------------------------------------------
> Below is the  /etc/cluster/cluster.conf
> 
> <cluster config_version="9" name="roamclus">
>   <fence_daemon/>
>   <clusternodes>
>     <clusternode name="cnode1" nodeid="1" votes="1">
>       <fence>
>         <method name="pcmk-method">
>           <device name="pcmk-redirect" port="cnode1"/>
>         </method>
>       </fence>
>     </clusternode>
>     <clusternode name="cnode2" nodeid="2" votes="1">
>       <fence>
>         <method name="pcmk-method">
>           <device name="pcmk-redirect" port="cnode2"/>
>         </method>
>       </fence>
>     </clusternode>
>   </clusternodes>
>   <cman broadcast="no" expected_votes="1" transport="udp" two_node="1"/>
>   <fencedevices>
>     <fencedevice agent="fence_pcmk" name="pcmk-redirect"/>
>   </fencedevices>
>   <rm>
>     <failoverdomains/>
>     <resources/>
>   </rm>
> </cluster>
> ----------------------------------------------
> # cat /var/lib/pacemaker/cib/cib.xml
> <cib crm_feature_set="3.0.10" validate-with="pacemaker-2.4" epoch="15"
> num_updates="0" admin_epoch="0" cib-last-written="Fri Dec 16 18:57:10
> 2016" update-origin="cnode1" update-client="cibadmin" update-user="root"
> have-quorum="1" dc-uuid="cnode1">
>   <configuration>
>     <crm_config>
>       <cluster_property_set id="cib-bootstrap-options">
>         <nvpair id="cib-bootstrap-options-have-watchdog"
> name="have-watchdog" value="false"/>
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> value="1.1.14-8.el6_8.2-70404b0"/>
>         <nvpair id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="cman"/>
>         <nvpair id="cib-bootstrap-options-stonith-enabled"
> name="stonith-enabled" value="false"/>
>       </cluster_property_set>
>     </crm_config>
>     <nodes>
>       <node id="cnode1" uname="cnode1"/>
>       <node id="cnode2" uname="cnode2"/>
>     </nodes>
>     <resources/>
>     <constraints/>
>   </configuration>
> </cib>
> ------------------------------------------------
> /var/log/messages have below contents :
> 
> Dec 15 20:29:43 cnode2 kernel: DLM (built Oct 26 2016 10:26:08) installed
> Dec 15 20:29:46 cnode2 corosync[2464]:   [MAIN  ] Corosync Cluster
> Engine ('1.4.7'): started and ready to provide service.
> Dec 15 20:29:46 cnode2 corosync[2464]:   [MAIN  ] Corosync built-in
> features: nss dbus rdma snmp
> Dec 15 20:29:46 cnode2 corosync[2464]:   [MAIN  ] Successfully read
> config from /etc/cluster/cluster.conf
> Dec 15 20:29:46 cnode2 corosync[2464]:   [MAIN  ] Successfully parsed
> cman config
> Dec 15 20:29:46 cnode2 corosync[2464]:   [TOTEM ] Initializing transport
> (UDP/IP Multicast).
> Dec 15 20:29:46 cnode2 corosync[2464]:   [TOTEM ] Initializing
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Dec 15 20:29:46 cnode2 corosync[2464]:   [TOTEM ] The network interface
> [10.10.18.138] is now up.
> Dec 15 20:29:46 cnode2 corosync[2464]:   [QUORUM] Using quorum provider
> quorum_cman
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> corosync cluster quorum service v0.1
> Dec 15 20:29:46 cnode2 corosync[2464]:   [CMAN  ] CMAN 3.0.12.1 (built
> Feb  1 2016 07:06:19) started
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> corosync CMAN membership service 2.90
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> openais checkpoint service B.01.01
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> corosync extended virtual synchrony service
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> corosync configuration service
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> corosync cluster closed process group service v1.01
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> corosync cluster config database access v1.01
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> corosync profile loading service
> Dec 15 20:29:46 cnode2 corosync[2464]:   [QUORUM] Using quorum provider
> quorum_cman
> Dec 15 20:29:46 cnode2 corosync[2464]:   [SERV  ] Service engine loaded:
> corosync cluster quorum service v0.1
> Dec 15 20:29:46 cnode2 corosync[2464]:   [MAIN  ] Compatibility mode set
> to whitetank.  Using V1 and V2 of the synchronization engine.
> Dec 15 20:29:46 cnode2 corosync[2464]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Dec 15 20:29:46 cnode2 corosync[2464]:   [CMAN  ] quorum regained,
> resuming activity
> Dec 15 20:29:46 cnode2 corosync[2464]:   [QUORUM] This node is within
> the primary component and will provide service.
> Dec 15 20:29:46 cnode2 corosync[2464]:   [QUORUM] Members[1]: 2
> Dec 15 20:29:46 cnode2 corosync[2464]:   [QUORUM] Members[1]: 2
> Dec 15 20:29:46 cnode2 corosync[2464]:   [CPG   ] chosen downlist:
> sender r(0) ip(10.10.18.138) ; members(old:0 left:0)
> Dec 15 20:29:46 cnode2 corosync[2464]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Dec 15 20:29:50 cnode2 fenced[2529]: fenced 3.0.12.1 started
> Dec 15 20:29:50 cnode2 dlm_controld[2543]: dlm_controld 3.0.12.1 started
> Dec 15 20:29:51 cnode2 gfs_controld[2606]: gfs_controld 3.0.12.1 started
> Dec 15 20:30:36 cnode2 pacemaker: Starting Pacemaker Cluster Manager
> Dec 15 20:30:36 cnode2 pacemakerd[2767]:   notice: Additional logging
> available in /var/log/pacemaker.log
> Dec 15 20:30:36 cnode2 pacemakerd[2767]:   notice: Switching to
> /var/log/cluster/corosync.log
> Dec 15 20:30:36 cnode2 pacemakerd[2767]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
> Dec 15 20:30:36 cnode2 pacemakerd[2767]:   notice: Starting Pacemaker
> 1.1.14-8.el6_8.2 (Build: 70404b0):  generated-manpages agent-manpages
> ascii-docs ncurses libqb-logging libqb-ipc nagios  corosync-plugin cman acls
> 
> Dec 15 20:30:36 cnode2 pacemakerd[2767]:   notice: Membership 4: quorum
> acquired
> Dec 15 20:30:36 cnode2 pacemakerd[2767]:   notice: cman_event_callback:
> Node cnode2[2] - state is now member (was (null))
> 
> Dec 15 20:30:36 cnode2 cib[2773]:   notice: Additional logging available
> in /var/log/cluster/corosync.log
> 
> Dec 15 20:30:36 cnode2 cib[2773]:   notice: Using new config location:
> /var/lib/pacemaker/cib
> Dec 15 20:30:36 cnode2 cib[2773]:  warning: Could not verify cluster
> configuration file /var/lib/pacemaker/cib/cib.xml: No such file or
> directory (2)
> Dec 15 20:30:36 cnode2 cib[2773]:  warning: Primary configuration
> corrupt or unusable, trying backups in /var/lib/pacemaker/cib
> Dec 15 20:30:36 cnode2 cib[2773]:  warning: Continuing with an empty
> configuration.

The above is the problem. Your configuration may have a syntax error or
be compatible with a different version of pacemaker. Try running "pcs
cluster verify -V" to see what the issue is.

Also, feel free to open a support case with Red Hat.

> Dec 15 20:30:36 cnode2 stonith-ng[2774]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
> Dec 15 20:30:36 cnode2 stonith-ng[2774]:   notice: Connecting to cluster
> infrastructure: cman
> Dec 15 20:30:36 cnode2 attrd[2776]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
> Dec 15 20:30:36 cnode2 attrd[2776]:   notice: Connecting to cluster
> infrastructure: cman
> Dec 15 20:30:36 cnode2 stonith-ng[2774]:   notice: crm_update_peer_proc:
> Node cnode2[2] - state is now member (was (null))
> Dec 15 20:30:36 cnode2 pengine[2777]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
> Dec 15 20:30:36 cnode2 lrmd[2775]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
> Dec 15 20:30:36 cnode2 attrd[2776]:   notice: crm_update_peer_proc: Node
> cnode2[2] - state is now member (was (null))
> Dec 15 20:30:36 cnode2 crmd[2778]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
> Dec 15 20:30:36 cnode2 crmd[2778]:   notice: CRM Git Version:
> 1.1.14-8.el6_8.2 (70404b0)
> Dec 15 20:30:36 cnode2 cib[2773]:   notice: Connecting to cluster
> infrastructure: cman
> Dec 15 20:30:36 cnode2 attrd[2776]:   notice: Starting mainloop...
> Dec 15 20:30:36 cnode2 cib[2773]:   notice: crm_update_peer_proc: Node
> cnode2[2] - state is now member (was (null))
> Dec 15 20:30:36 cnode2 cib[2782]:  warning: Could not verify cluster
> configuration file /var/lib/pacemaker/cib/cib.xml: No such file or
> directory (2)
> Dec 15 20:30:37 cnode2 stonith-ng[2774]:   notice: Watching for stonith
> topology changes
> Dec 15 20:30:37 cnode2 crmd[2778]:   notice: Connecting to cluster
> infrastructure: cman
> Dec 15 20:30:37 cnode2 crmd[2778]:   notice: Membership 4: quorum acquired
> Dec 15 20:30:37 cnode2 crmd[2778]:   notice: cman_event_callback: Node
> cnode2[2] - state is now member (was (null))
> Dec 15 20:30:37 cnode2 crmd[2778]:   notice: The local CRM is operational
> Dec 15 20:30:37 cnode2 crmd[2778]:   notice: State transition S_STARTING
> -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
> Dec 15 20:30:42 cnode2 fenced[2529]: fencing node cnode1
> Dec 15 20:30:42 cnode2 fence_pcmk[2805]: Requesting Pacemaker fence
> cnode1 (reset)
> Dec 15 20:30:42 cnode2 stonith-ng[2774]:   notice: Client
> stonith_admin.cman.2806.6d791bd8 wants to fence (reboot) 'cnode1' with
> device '(any)'
> Dec 15 20:30:42 cnode2 stonith-ng[2774]:   notice: Initiating remote
> operation reboot for cnode1: c398b8b7-6ba1-4068-a174-547bac72476d (0)
> Dec 15 20:30:42 cnode2 stonith-ng[2774]:   notice: Couldn't find anyone
> to fence (reboot) cnode1 with any device
> Dec 15 20:30:42 cnode2 stonith-ng[2774]:    error: Operation reboot of
> cnode1 by <no-one> for stonith_admin.cman.2806 at cnode2.c398b8b7: No such
> device
> Dec 15 20:30:42 cnode2 crmd[2778]:   notice: Peer cnode1 was not
> terminated (reboot) by <anyone> for cnode2: No such device
> (ref=c398b8b7-6ba1-4068-a174-547bac72476d) by client stonith_admin.cman.2806
> Dec 15 20:30:42 cnode2 fence_pcmk[2805]: Call to fence cnode1 (reset)
> failed with rc=237
> Dec 15 20:30:42 cnode2 fenced[2529]: fence cnode1 dev 0.0 agent
> fence_pcmk result: error from agent
> Dec 15 20:30:42 cnode2 fenced[2529]: fence cnode1 failed
> Dec 15 20:30:45 cnode2 fenced[2529]: fencing node cnode1
> Dec 15 20:30:45 cnode2 fence_pcmk[2825]: Requesting Pacemaker fence
> cnode1 (reset)
> Dec 15 20:30:45 cnode2 stonith-ng[2774]:   notice: Client
> stonith_admin.cman.2826.f2c208fe wants to fence (reboot) 'cnode1' with
> device '(any)'
> Dec 15 20:30:45 cnode2 stonith-ng[2774]:   notice: Initiating remote
> operation reboot for cnode1: b5df8517-d8a7-4f33-8cd2-d41c512d13ae (0)
> Dec 15 20:30:45 cnode2 stonith-ng[2774]:   notice: Couldn't find anyone
> to fence (reboot) cnode1 with any device
> Dec 15 20:30:45 cnode2 stonith-ng[2774]:    error: Operation reboot of
> cnode1 by <no-one> for stonith_admin.cman.2826 at cnode2.b5df8517: No such
> device
> Dec 15 20:30:48 cnode2 crmd[2778]:   notice: Peer cnode1 was not
> terminated (reboot) by <anyone> for cnode2: No such device
> (ref=aff3eb58-4777-4fca-9802-eb084dc56ad4) by client stonith_admin.cman.2846
> Dec 15 20:30:48 cnode2 fence_pcmk[2845]: Call to fence cnode1 (reset)
> failed with rc=237
> Dec 15 20:30:48 cnode2 fenced[2529]: fence cnode1 dev 0.0 agent
> fence_pcmk result: error from agent
> Dec 15 20:30:48 cnode2 fenced[2529]: fence cnode1 failed
> Dec 15 20:30:51 cnode2 fence_pcmk[2869]: Requesting Pacemaker fence
> cnode1 (reset)
> Dec 15 20:30:51 cnode2 stonith-ng[2774]:   notice: Client
> stonith_admin.cman.2870.1c9e3d98 wants to fence (reboot) 'cnode1' with
> device '(any)'
> Dec 15 20:30:51 cnode2 stonith-ng[2774]:   notice: Initiating remote
> operation reboot for cnode1: b2435128-3702-44a0-a42e-52b642278686 (0)
> Dec 15 20:30:51 cnode2 stonith-ng[2774]:   notice: Couldn't find anyone
> to fence (reboot) cnode1 with any device
> Dec 15 20:30:51 cnode2 stonith-ng[2774]:    error: Operation reboot of
> cnode1 by <no-one> for stonith_admin.cman.2870 at cnode2.b2435128: No such
> device
> 
> ================================================================
> 
> Please help to solve this problem.