[Pacemaker] Issues with fence and corosync crash

Mon Dec 27 04:32:39 EST 2010

Hello,

I know my mail is really long, btw is there someone could help me at least with the error '[22670]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Resource temporarily unavailable (11) Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: 
ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11)' and could point me 
to right place where understood how the unfence procedure should work? (i.e. automatic.) for now I've to manually remove the 
'location' directive every time.

Thanks a lot!

Simon

Il 24/12/2010 12:05, Simone Felici ha scritto:
>
> Hi to all!
>
> I've an issue with my cluster env. First of all my config:
>
> Two Cluster CentOS5.5 Active+Standby with one DRBD partition managing a Nagios service, ip, and storage.
> The config files at the bottom.
>
> I'm trying to test fence option to prevent split brain and problems on double access on drbd partition.
> Starting on a sane situation, manual switching of the resources or simulating kernel-panic, crash of process or whatever, all
> works well. If I try to shutdown the eth1 (192.168.100.0 as well as cross cable to drbd mirroring) the active stay as it is, it
> calls the fence option adding the entry to crm config:
> location drbd-fence-by-handler-ServerData ServerData \
> rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: #uname ne opsview-core01-tn
>
> But the standby node kills the corosync process:
>
> *** STANDBY NODE LOG ***
> Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14158 iface 192.168.100.12 to [1 of 10]
> Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14160 iface 192.168.100.12 to [2 of 10]
> Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14162 iface 192.168.100.12 to [3 of 10]
> Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14164 iface 192.168.100.12 to [4 of 10]
> Dec 24 11:00:06 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [3 of 10]
> Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14166 iface 192.168.100.12 to [4 of 10]
> Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14168 iface 192.168.100.12 to [5 of 10]
> Dec 24 11:00:07 corosync [TOTEM ] Incrementing problem counter for seqid 14170 iface 192.168.100.12 to [6 of 10]
> Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14172 iface 192.168.100.12 to [7 of 10]
> Dec 24 11:00:08 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [6 of 10]
> Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14174 iface 192.168.100.12 to [7 of 10]
> Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14176 iface 192.168.100.12 to [8 of 10]
> Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14178 iface 192.168.100.12 to [9 of 10]
> Dec 24 11:00:10 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [8 of 10]
> Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14180 iface 192.168.100.12 to [9 of 10]
> Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14182 iface 192.168.100.12 to [10 of 10]
> Dec 24 11:00:10 corosync [TOTEM ] Marking seqid 14182 ringid 0 interface 192.168.100.12 FAULTY - adminisrtative intervention
> required.
> Dec 24 11:00:11 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: No such
> file or directory (2)
> Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: AIS connection failed
> Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
> temporarily unavailable (11)
> Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: AIS connection terminated
> Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: AIS connection failed
> Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: crm_ais_destroy: AIS connection terminated
> Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
> temporarily unavailable (11)
> Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
> temporarily unavailable (11)
> Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: AIS connection failed
> Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: AIS connection failed
> Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: cib_ais_destroy: AIS connection terminated
> Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: CRIT: attrd_ais_destroy: Lost connection to OpenAIS service!
> Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: info: main: Exiting...
> Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: attrd_cib_connection_destroy: Connection to the CIB terminated...
> *** STANDBY NODE LOG ***
>
> The issues are not finished.
> If I put up back the interface eth1, start corosync again and check that the ring are both online (corosync-cfgtool -r) the
> cluster-standby tries to take the services even if resource-stickiness is set. It goes into error maybe due fence script.
>
> crm status:
> ============
> Last updated: Fri Dec 24 11:06:40 2010
> Stack: openais
> Current DC: opsview-core01-tn - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> ============
>
> Online: [ opsview-core01-tn opsview-core02-tn ]
>
> Master/Slave Set: ServerData
> drbd_data:0 (ocf::linbit:drbd): Slave opsview-core02-tn (unmanaged) FAILED
> Stopped: [ drbd_data:1 ]
>
> Failed actions:
> drbd_data:0_stop_0 (node=opsview-core02-tn, call=9, rc=6, status=complete): not configured
>
> LOGS on slave:
> ****************************************
> Dec 24 11:06:13 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'): started and ready to provide service.
> Dec 24 11:06:13 corosync [MAIN ] Corosync built-in features: nss rdma
> Dec 24 11:06:13 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
> Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
> Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
> Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Dec 24 11:06:13 corosync [TOTEM ] The network interface [192.168.100.12] is now up.
> Dec 24 11:06:13 corosync [pcmk ] info: process_ais_conf: Reading configure
> Set r/w permissions for uid=0, gid=0 on /var/log/cluster/corosync.log
> Dec 24 11:06:13 corosync [pcmk ] info: config_find_init: Local handle: 4730966301143465986 for logging
> Dec 24 11:06:13 corosync [pcmk ] info: config_find_next: Processing additional logging options...
> Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'off' for option: debug
> Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_logfile
> Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found '/var/log/cluster/corosync.log' for option: logfile
> Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog
> Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility
> Dec 24 11:06:13 corosync [pcmk ] info: config_find_init: Local handle: 7739444317642555395 for service
> Dec 24 11:06:13 corosync [pcmk ] info: config_find_next: Processing additional service options...
> Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for option: clustername
> Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_logd
> Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_mgmtd
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: CRM: Initialized
> Dec 24 11:06:13 corosync [pcmk ] Logging: Initialized pcmk_startup
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Service: 9
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Local hostname: opsview-core02-tn
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_update_nodeid: Local node id: 207923392
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: Creating entry for node 207923392 born on 0
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x2aaaac000920 Node 207923392 now known as opsview-core02-tn (was: (null))
> Dec 24 11:06:13 opsview-core02-tn lrmd: [5153]: info: lrmd is shutting down
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: G_main_add_SignalHandler: Added signal handler for signal 10
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: Invoked: /usr/lib64/heartbeat/attrd
> Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Signal sent to pid=5153, waiting for process to exit
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn now has 1 quorum votes (was 0)
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: G_main_add_SignalHandler: Added signal handler for signal 12
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting up
> Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: Added signal handler for signal 15
> Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: info: Invoked: /usr/lib64/heartbeat/pengine
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node 207923392/opsview-core02-tn is now: member
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: Invoked: /usr/lib64/heartbeat/cib
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_cluster_connect: Connecting to OpenAIS
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_cluster_connect: Connecting to OpenAIS
> Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: Invoked: /usr/lib64/heartbeat/crmd
> Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6762 for process stonithd
> Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: WARN: main: Terminating previous PE instance
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_TriggerHandler: Added signal manual handler
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: init_ais_connection_once: Creating connection to our AIS plugin
> Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: Added signal handler for signal 17
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: init_ais_connection_once: Creating connection to our AIS plugin
> Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: main: CRM Hg Version: da7075976b5ff0bee71074385f8fd02f296ec8a3
>
> Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6763 for process cib
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_SignalHandler: Added signal handler for signal 17
> Dec 24 11:06:13 opsview-core02-tn pengine: [5155]: WARN: process_pe_message: Received quit message, terminating
> Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: enabling coredumps
> Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6764 for process lrmd
> Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting crmd
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: retrieveCib: Reading cluster configuration from:
> /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)
> Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: Added signal handler for signal 10
> Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6765 for process attrd
> Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: G_main_add_SignalHandler: Added signal handler for signal 17
> Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: Added signal handler for signal 12
> Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6766 for process pengine
> Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Started.
> Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6767 for process crmd
> Dec 24 11:06:13 corosync [SERV ] Service engine loaded: Pacemaker Cluster Manager 1.0.9
> Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync extended virtual synchrony service
> Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync configuration service
> Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
> Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster config database access v1.01
> Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync profile loading service
> Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1
> Dec 24 11:06:13 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
> Dec 24 11:06:13 corosync [TOTEM ] The network interface [172.18.17.12] is now up.
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: init_ais_connection_once: AIS connection established
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: init_ais_connection_once: AIS connection established
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x868c90 for attrd/6765
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn
> cname=pcmk
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x86d0a0 for stonithd/6762
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node opsview-core02-tn now has id: 207923392
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 207923392 is now known as opsview-core02-tn
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Cluster connection active
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn
> cname=pcmk
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node opsview-core02-tn now has id: 207923392
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Accepting attribute updates
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node 207923392 is now known as opsview-core02-tn
> Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting mainloop...
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: notice: /usr/lib64/heartbeat/stonithd start up successfully.
> Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: G_main_add_SignalHandler: Added signal handler for signal 17
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: startCib: CIB Initialization completed successfully
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_cluster_connect: Connecting to OpenAIS
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once: Creating connection to our AIS plugin
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once: AIS connection established
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x872fa0 for cib/6763
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn now has process list:
> 00000000000000000000000000013312 (78610)
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Sending membership update 0 to cib
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn cname=pcmk
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node opsview-core02-tn now has id: 207923392
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node 207923392 is now known as opsview-core02-tn
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_init: Starting cib mainloop
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership 0: quorum still lost
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member (new)
> addr=(null) votes=1 (new) born=0 seen=0 proc=00000000000000000000000000013312 (new)
> Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Archived previous version as
> /var/lib/heartbeat/crm/cib-26.raw
> Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Wrote version 0.473.0 of the CIB to disk (digest:
> 3c7be90920e86222ad6102a0f01d9efd)
> Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: retrieveCib: Reading cluster configuration from:
> /var/lib/heartbeat/crm/cib.UxVZY6 (digest: /var/lib/heartbeat/crm/cib.76RIND)
> Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 1 iface 172.18.17.12 to [1 of 10]
> Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 13032: memb=0, new=0, lost=0
> Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 13032: memb=1, new=1, lost=0
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: NEW: opsview-core02-tn 207923392
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: opsview-core02-tn 207923392
> Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Dec 24 11:06:13 corosync [MAIN ] Completed service synchronization, ready to provide service.
> Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 2 iface 192.168.100.12 to [1 of 10]
> Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 13036: memb=1, new=0, lost=0
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: memb: opsview-core02-tn 207923392
> Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 13036: memb=2, new=1, lost=0
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: Creating entry for node 191146176 born on 13036
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node 191146176/unknown is now: member
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: NEW: .pending. 191146176
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: .pending. 191146176
> Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: opsview-core02-tn 207923392
> Dec 24 11:06:13 corosync [pcmk ] info: send_member_notification: Sending membership update 13036 to 1 children
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x2aaaac000920 Node 207923392 ((null)) born on: 13036
> Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership 13036: quorum still lost
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node <null> now has id: 191146176
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node (null): id=191146176 state=member (new) addr=r(0)
> ip(192.168.100.11) r(1) ip(172.18.17.11) votes=0 born=0 seen=13036 proc=00000000000000000000000000000000
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member addr=r(0)
> ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 born=0 seen=13036 proc=00000000000000000000000000013312
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x825ef0 Node 191146176 (opsview-core01-tn) born on: 13028
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: notice: ais_dispatch: Membership 13036: quorum acquired
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x825ef0 Node 191146176 now known as opsview-core01-tn (was: (null))
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_get_peer: Node 191146176 is now known as opsview-core01-tn
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn now has process list:
> 00000000000000000000000000013312 (78610)
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core01-tn: id=191146176 state=member addr=r(0)
> ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 (new) born=13028 seen=13036 proc=00000000000000000000000000013312 (new)
> Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn now has 1 quorum votes (was 0)
> Dec 24 11:06:13 corosync [pcmk ] info: send_member_notification: Sending membership update 13036 to 1 children
> Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: unknown (rc=-2)
> Dec 24 11:06:13 corosync [MAIN ] Completed service synchronization, ready to provide service.
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_process_diff: Diff 0.475.1 -> 0.475.2 not applied to 0.473.0: current
> "epoch" is less than required
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_server_process_diff: Requesting re-sync from peer
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_diff_notify: Local-only Change (client:crmd, call: 105): -1.-1.-1
> (Application of an update diff failed, requesting a full refresh)
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not applying diff 0.475.2 -> 0.475.3 (sync in progress)
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not applying diff 0.475.3 -> 0.475.4 (sync in progress)
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not applying diff 0.475.4 -> 0.476.1 (sync in progress)
> Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: unknown (rc=-2)
> Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_replace_notify: Local-only Replace: -1.-1.-1 from opsview-core01-tn
> Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Archived previous version as
> /var/lib/heartbeat/crm/cib-27.raw
> Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Wrote version 0.476.0 of the CIB to disk (digest:
> c348ac643cfe3b370e5eca03ff7f180c)
> Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: retrieveCib: Reading cluster configuration from:
> /var/lib/heartbeat/crm/cib.FYgzJ8 (digest: /var/lib/heartbeat/crm/cib.VrDRiH)
> Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: unknown (rc=-2)
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_cib_control: CIB connection established
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_cluster_connect: Connecting to OpenAIS
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once: Creating connection to our AIS plugin
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once: AIS connection established
> Dec 24 11:06:14 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x878020 for crmd/6767
> Dec 24 11:06:14 corosync [pcmk ] info: pcmk_ipc: Sending membership update 13036 to crmd
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn cname=pcmk
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node opsview-core02-tn now has id: 207923392
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 207923392 is now known as opsview-core02-tn
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_ha_control: Connected to the cluster
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: Delaying start, CCM (0000000000100000) not connected
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting crmd's mainloop
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback: Checking for expired actions every 900000ms
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback: Sending expected-votes=2 to corosync
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: notice: ais_dispatch: Membership 13036: quorum acquired
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node opsview-core01-tn now has id: 191146176
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 191146176 is now known as opsview-core01-tn
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node opsview-core01-tn: id=191146176 state=member (new)
> addr=r(0) ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 born=13028 seen=13036 proc=00000000000000000000000000013312
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member (new)
> addr=r(0) ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 (new) born=13036 seen=13036 proc=00000000000000000000000000013312
> (new)
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: The local CRM is operational
> Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_state_transition: State transition S_STARTING -> S_PENDING [
> input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
> Dec 24 11:06:15 opsview-core02-tn pengine: [6766]: info: main: Starting pengine
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: ais_dispatch: Membership 13036: quorum retained
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_dc: Set DC to opsview-core01-tn (3.0.1)
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_attrd: Connecting to attrd...
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC
> cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for terminate
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for shutdown
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_local_callback: Sending full refresh (origin=crmd)
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation terminate=<null>: cib not connected
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation shutdown=<null>: cib not connected
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation terminate=<null>: cib not connected
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation shutdown=<null>: cib not connected
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: erase_xpath_callback: Deletion of
> "//node_state[@uname='opsview-core02-tn']/transient_attributes": ok (rc=0)
> Dec 24 11:06:15 corosync [TOTEM ] ring 0 active with no faults
> Dec 24 11:06:15 corosync [TOTEM ] ring 1 active with no faults
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node opsview-core01-tn now has id: 191146176
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 191146176 is now known as opsview-core01-tn
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for master-drbd_data:0
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:0=<null>: cib not
> connected
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for probe_complete
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation probe_complete=<null>: cib not
> connected
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=9:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
> op=drbd_data:0_monitor_0 )
> Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:2: probe
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=10:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
> op=ServerFS_monitor_0 )
> Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ServerFS:3: probe
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=11:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
> op=ClusterIP01_monitor_0 )
> Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ClusterIP01:4: probe
> Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No lrm_rprovider field in message
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=12:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
> op=opsview-core_lsb_monitor_0 )
> Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-core_lsb:5: probe
> Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No lrm_rprovider field in message
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=13:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
> op=opsview-web_lsb_monitor_0 )
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=14:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
> op=WebSite_monitor_0 )
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for master-drbd_data:1
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:1=<null>: cib not
> connected
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation terminate=<null>: cib not connected
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation shutdown=<null>: cib not connected
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation ClusterIP01_monitor_0 (call=4, rc=7,
> cib-update=7, confirmed=true) not running
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation ServerFS_monitor_0 (call=3, rc=7,
> cib-update=8, confirmed=true) not running
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:0
> (1000)
> Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:0=1000: cib not
> connected
> Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_monitor_0 (call=2, rc=0,
> cib-update=9, confirmed=true) ok
> Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-web_lsb:6: probe
> Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:WebSite:7: probe
> Dec 24 11:06:16 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation WebSite_monitor_0 (call=7, rc=7,
> cib-update=10, confirmed=true) not running
> Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Connected to the CIB after 1 signon attempts
> Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Sending full refresh
> Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:0
> (1000)
> Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Sent update 4: master-drbd_data:0=1000
> Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete
> (<null>)
> Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:1
> (<null>)
> Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
> Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
> Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-core_lsb:probe:stderr) su: warning: cannot change
> directory to /var/log/nagios: No such file or directory
>
> Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-core_lsb:probe:stderr) /etc/init.d/opsview: line 262:
> /usr/local/nagios/bin/profile: No such file or directory
>
> Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-web_lsb:probe:stderr) su: warning: cannot change
> directory to /var/log/nagios: No such file or directory
>
> Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-web_lsb:probe:stderr) /etc/init.d/opsview-web: line 171:
> /usr/local/nagios/bin/opsview.sh: No such file or directory
>
> Dec 24 11:06:27 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation opsview-core_lsb_monitor_0 (call=5, rc=7,
> cib-update=11, confirmed=true) not running
> Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation opsview-web_lsb_monitor_0 (call=6, rc=7,
> cib-update=12, confirmed=true) not running
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Sent update 15: probe_complete=true
> Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=61:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
> op=drbd_data:0_notify_0 )
> Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:8: notify
> Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_notify_0 (call=8, rc=0,
> cib-update=13, confirmed=true) ok
> Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=13:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
> op=drbd_data:0_stop_0 )
> Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:9: stop
> Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_stop_0 (call=9, rc=6,
> cib-update=14, confirmed=true) not configured
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch: Update relayed from opsview-core01-tn
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for fail-count-drbd_data:0
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-drbd_data:0 (INFINITY)
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Sent update 18: fail-count-drbd_data:0=INFINITY
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch: Update relayed from opsview-core01-tn
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for last-failure-drbd_data:0
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-drbd_data:0 (1293185188)
> Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Sent update 21: last-failure-drbd_data:0=1293185188
> ****************************************
>
> Now the services are all DOWN.
> At this point my only way to solve is to reboot cluster02, then starting corosync it does NOT try to take the services again.
> The unfence option is still there!
> Now the drbd is in this state:
> Master/Slave Set: ServerData
> Masters: [ opsview-core01-tn ]
> Stopped: [ drbd_data:1 ]
> due the fence option.
> If I try 'drbdadm -- --discard-my-data connect all' on the cluster02 I obtain:
> [root at core02-tn ~]# drbdadm -- --discard-my-data connect all
> Could not stat("/proc/drbd"): No such file or directory
> do you need to load the module?
> try: modprobe drbd
> Command 'drbdsetup 1 net 192.168.100.12:7789 192.168.100.11:7789 C --set-defaults --create-device --rr-conflict=disconnect
> --after-sb-2pri=disconnect --after-sb-1pri=disconnect --after-sb-0pri=disconnect --discard-my-data' terminated with exit code 20
> drbdadm connect cluster_data: exited with code 20
>
> I've to remove manually the entry:
>
> location drbd-fence-by-handler-ServerData ServerData \
> rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: #uname ne opsview-core01-tn
>
> Because I've no idea HOW to unfence the cluster to permit the auto-remove of the above line.
>
> Removing the line, the cluster02 connects back to drbd:
>
> Master/Slave Set: ServerData
> Masters: [ opsview-core01-tn ]
> Slaves: [ opsview-core02-tn ]
>
>
> Writing here I've tested that the inverse situation works on half. It means, if the cluster02 is master, i disconnect eth1, then
> fence entry is added to crm, but cluster01 does *NOT* crash. So I've to start removing "location
> drbd-fence-by-handler-ServerData..." to go back to a standard situation. BTW, removing the entry, on cluster01 the same error and
> corosync kills:
>
> ********* cluster01 logs **********
> Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: info: update_dc: Unset DC opsview-core01-tn
> Dec 24 12:01:31 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 12:01:31 opsview-core01-tn cib: [22670]: info: cib_process_request: Operation complete: op cib_modify for section nodes
> (origin=local/crmd/165, version=0.491.1): ok (rc=0)
> Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
> temporarily unavailable (11)
> Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
> temporarily unavailable (11)
> Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: AIS connection failed
> Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: AIS connection failed
> Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: cib_ais_destroy: AIS connection terminated
> Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: crm_ais_destroy: AIS connection terminated
> Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error:
> Resource temporarily unavailable (11)
> Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: AIS connection failed
> Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: AIS connection terminated
> Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Archived previous version as
> /var/lib/heartbeat/crm/cib-23.raw
> Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Wrote version 0.491.0 of the CIB to disk (digest:
> ad222fed7ff40dc7093ffc6411079df4)
> Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: retrieveCib: Reading cluster configuration from:
> /var/lib/heartbeat/crm/cib.R3dVbk (digest: /var/lib/heartbeat/crm/cib.EllYEu)
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_text: Sending message 44: FAILED (rc=2): Library error:
> Connection timed out (110)
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete
> (true)
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC Channel to 22670 is not connected
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op: Sending message to CIB service FAILED
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: Sent update -5: probe_complete=true
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: attrd_cib_callback: Update -5 for probe_complete=true failed: send failed
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_message: Not connected to AIS
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: Sending flush op to all hosts for:
> master-drbd_data:1 (<null>)
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC Channel to 22670 is not connected
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op: Sending message to CIB service FAILED
> Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: Delete operation failed: node=opsview-core01-tn,
> attr=master-drbd_data:1, id=<n/a>, set=(null), section=status: send failed (-5)
>
> ***********************
>
> So, the questions:
>
> What's wrong? Seems all starts when the corosyng on secondary node crash (or stop) disconnecting the cable (due "Library error"?!?!?)
>
> If I solve the issues with crashes, then, how (/when) should the unfence option be executed? Should it not done automatically?
>
> Do I have always to remove manually the entry (location ...) on crm?
>
> Sorry for the long mail and thanks for the support!
>
>
> Simon
>
> Config files:
>
> *************************************
> cat /etc/corosync/corosync.conf
>
>
> compatibility: whitetank
>
> totem {
> version: 2
> # How long before declaring a token lost (ms)
> token: 2000
> # How many token retransmits before forming a new configuration
> token_retransmits_before_loss_const: 10
> # How long to wait for join messages in the membership protocol (ms)
> join: 200
> # How long wait for consensus to be achieved before starting a new round of membership configuration (ms)
> consensus: 1000
> vsftype: none
> # Number of messages that may be sent by one processor on receipt of the token
> max_messages: 20
> send_join: 0
> # Limit generated nodeids to 31-bits (positive signed integers)
> clear_node_high_bit: yes
> secauth: off
> threads: 0
> rrp_mode: active
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.100.0
> mcastaddr: 226.100.1.1
> mcastport: 4000
> }
> interface {
> ringnumber: 1
> bindnetaddr: 172.18.17.0
> #broadcast: yes
> mcastaddr: 227.100.1.2
> mcastport: 4001
> }
> }
>
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> to_syslog: yes
> logfile: /var/log/cluster/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> aisexec {
> user: root
> group: root
> }
>
> service {
> # Load the Pacemaker Cluster Resource Manager
> name: pacemaker
> ver: 0
> }
>
> *************************************
> cat /etc/drbd.conf
>
> global {
> usage-count no;
> }
>
> common {
> protocol C;
>
> syncer {
> rate 70M;
> verify-alg sha1;
> }
>
> net {
> after-sb-0pri disconnect;
> after-sb-1pri disconnect;
> after-sb-2pri disconnect;
> rr-conflict disconnect;
> }
>
> handlers {
> pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
> pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";
> local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
>
> startup {
> degr-wfc-timeout 120; # 2 minutes.
> }
>
> disk {
> fencing resource-only;
> on-io-error call-local-io-error;
> }
> }
>
> resource cluster_data {
> device /dev/drbd1;
> disk /dev/sda4;
> meta-disk internal;
>
> on opsview-core01-tn {
> address 192.168.100.11:7789;
> }
>
> on opsview-core02-tn {
> address 192.168.100.12:7789;
> }
> }
>
> *************************************
>
> crm configure show
> node opsview-core01-tn \
> attributes standby="off"
> node opsview-core02-tn \
> attributes standby="off"
> primitive ClusterIP01 ocf:heartbeat:IPaddr2 \
> params ip="172.18.17.10" cidr_netmask="32" \
> op monitor interval="30"
> primitive ServerFS ocf:heartbeat:Filesystem \
> params device="/dev/drbd1" directory="/data" fstype="ext3"
> primitive WebSite ocf:heartbeat:apache \
> params configfile="/etc/httpd/conf/httpd.conf" \
> op monitor interval="1min" \
> meta target-role="Started"
> primitive drbd_data ocf:linbit:drbd \
> params drbd_resource="cluster_data" \
> op monitor interval="60s"
> primitive opsview-core_lsb lsb:opsview \
> op start interval="0" timeout="350s" \
> op stop interval="0" timeout="350s" \
> op monitor interval="60s" timeout="350s"
> primitive opsview-web_lsb lsb:opsview-web \
> op start interval="0" timeout="350s" start-delay="15s" \
> op stop interval="0" timeout="350s" \
> op monitor interval="60s" timeout="350s" \
> meta target-role="Started"
> group OPSView-Apps ServerFS ClusterIP01 opsview-core_lsb opsview-web_lsb WebSite \
> meta target-role="Started"
> ms ServerData drbd_data \
> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Master"
> colocation fs_on_drbd inf: OPSView-Apps ServerData:Master
> order ServerFS-after-ServerData inf: ServerData:promote OPSView-Apps:start
> property $id="cib-bootstrap-options" \
> dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-- 
Simone Felici
Divisione Tecnica: Progettazione e Sviluppo

tel. +39 0461.030.111
fax. +39 0461 030.112
Via Fersina, 23 - 38123 Trento

-------------
MC-link S.p.A.
Sede Direzionale e Amministrativa
Via Carlo Perrier, 9/a - 00157 Roma
Sede Legale
Via Fersina, 23 - 38123 Trento

http://www.mclink.it

Save a tree. Don't print this e-mail unless it's really necessary

Informativa ai sensi del Codice della proprietà industriale e del Codice dei dati personali.
Le informazioni contenute in questa e-mail e negli eventuali allegati, possono contenere informazioni confidenziali e coperte da 
segreto commerciale/industriale. Esse vengono comunicate nei limiti giuridici dei rapporti in essere fra le parti e pertanto 
nessun ulteriore diritto di proprietà intellettuale o industriale  può essere rivendicato dal ricevente.
Le informazioni contenute in questa e-mail e negli eventuali allegati sono indirizzate esclusivamente a coloro che figurano come 
destinatari.
Se avete ricevuto per errore questa e-mail siete pregati di informarci (rispedendola al mittente) e di provvedere alla sua 
rimozione, a non farne utilizzo e a non conservarne alcuna copia.