[ClusterLabs] dlm_controld and fencing issue

Wed Apr 1 12:47:30 UTC 2015

Hello,

On a 4 nodes OpenNebula cluster, running Ubuntu Trusty 14.04.2, with:

- corosync 2.3.3-1ubuntu1
- pacemaker 1.1.10+git20130802-1ubuntu2.3
- dlm 4.0.1-0ubuntu1

Here is the node list with their IDs, to follow the logs:

- 1084811137 nebula1
- 1084811138 nebula2
- 1084811139 nebula3
- 1084811140 nebula4 (the actual DC)

I have an issue where fencing is working but dlm always wait for
fencing, I needed to manually run “dlm_tool fence_ack 1084811138” this
morning, here are the logs:

Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811137 walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 kernel: [50799.162381] dlm: closing connection to node 1084811138
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811139 walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 pid 44527 nodedown time 1427844569 fence_all dlm_stonith
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence result 1084811138 pid 44527 result 1 exit status
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811140 walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 no actor
[...]
Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 datastores wait for fencing
Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 clvmd wait for fencing

The stonith actually worked:

Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: handle_request: Client crmd.6490.2707e557 wants to fence (reboot) 'nebula2' with device '(any)'
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for nebula2: 39eaf3a2-d7e0-417d-8a01-d2f373973d6b (0)
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula1-IPMILAN can not fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula2-IPMILAN can fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-one-frontend can not fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula3-IPMILAN can not fence nebula2: static-list
Apr  1 01:29:32 nebula4 stonith-ng[6486]:   notice: remote_op_done: Operation reboot of nebula2 by nebula3 for crmd.6490 at nebula4.39eaf3a2: OK

I attache the logs of the DC nebula4 around from 01:29:03, where
everything worked fine (Got 4 replies, expecting: 4) to a little bit
after.

To me, it looks like:

- dlm ask for fencing directly at 01:29:29, the node was fenced since it
  had garbage in its /var/log/syslog exactely at 01:29.29, plus its
  uptime, but did not get a good response

- pacemaker fence nebula2 at 01:29:30 because it's not part of the
  cluster anymore (since 01:29:26 [TOTEM ] ... Members left: 1084811138)
  This fencing works.

Do you have any idea?

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nebula2-down-2015-01-04.log
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150401/0acf6bbd/attachment-0001.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150401/0acf6bbd/attachment-0003.sig>