[ClusterLabs] dlm_controld and fencing issue
Daniel Dehennin
daniel.dehennin at baby-gnu.org
Wed Apr 1 12:47:30 UTC 2015
Hello,
On a 4 nodes OpenNebula cluster, running Ubuntu Trusty 14.04.2, with:
- corosync 2.3.3-1ubuntu1
- pacemaker 1.1.10+git20130802-1ubuntu2.3
- dlm 4.0.1-0ubuntu1
Here is the node list with their IDs, to follow the logs:
- 1084811137 nebula1
- 1084811138 nebula2
- 1084811139 nebula3
- 1084811140 nebula4 (the actual DC)
I have an issue where fencing is working but dlm always wait for
fencing, I needed to manually run “dlm_tool fence_ack 1084811138” this
morning, here are the logs:
Apr 1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811137 walltime 1427844569 local 50759
Apr 1 01:29:29 nebula4 kernel: [50799.162381] dlm: closing connection to node 1084811138
Apr 1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811139 walltime 1427844569 local 50759
Apr 1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 pid 44527 nodedown time 1427844569 fence_all dlm_stonith
Apr 1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence result 1084811138 pid 44527 result 1 exit status
Apr 1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811140 walltime 1427844569 local 50759
Apr 1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 no actor
[...]
Apr 1 01:30:25 nebula4 dlm_controld[6737]: 50815 datastores wait for fencing
Apr 1 01:30:25 nebula4 dlm_controld[6737]: 50815 clvmd wait for fencing
The stonith actually worked:
Apr 1 01:29:30 nebula4 stonith-ng[6486]: notice: handle_request: Client crmd.6490.2707e557 wants to fence (reboot) 'nebula2' with device '(any)'
Apr 1 01:29:30 nebula4 stonith-ng[6486]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for nebula2: 39eaf3a2-d7e0-417d-8a01-d2f373973d6b (0)
Apr 1 01:29:30 nebula4 stonith-ng[6486]: notice: can_fence_host_with_device: stonith-nebula1-IPMILAN can not fence nebula2: static-list
Apr 1 01:29:30 nebula4 stonith-ng[6486]: notice: can_fence_host_with_device: stonith-nebula2-IPMILAN can fence nebula2: static-list
Apr 1 01:29:30 nebula4 stonith-ng[6486]: notice: can_fence_host_with_device: stonith-one-frontend can not fence nebula2: static-list
Apr 1 01:29:30 nebula4 stonith-ng[6486]: notice: can_fence_host_with_device: stonith-nebula3-IPMILAN can not fence nebula2: static-list
Apr 1 01:29:32 nebula4 stonith-ng[6486]: notice: remote_op_done: Operation reboot of nebula2 by nebula3 for crmd.6490 at nebula4.39eaf3a2: OK
I attache the logs of the DC nebula4 around from 01:29:03, where
everything worked fine (Got 4 replies, expecting: 4) to a little bit
after.
To me, it looks like:
- dlm ask for fencing directly at 01:29:29, the node was fenced since it
had garbage in its /var/log/syslog exactely at 01:29.29, plus its
uptime, but did not get a good response
- pacemaker fence nebula2 at 01:29:30 because it's not part of the
cluster anymore (since 01:29:26 [TOTEM ] ... Members left: 1084811138)
This fencing works.
Do you have any idea?
Regards.
--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nebula2-down-2015-01-04.log
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150401/0acf6bbd/attachment-0001.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150401/0acf6bbd/attachment-0003.sig>
More information about the Users
mailing list