[ClusterLabs] Problem using IPMI for fencing

Jose Manuel Martínez jose.martinez at fcsc.es
Wed Mar 4 08:45:41 UTC 2015


Explained here also:

http://clusterlabs.org/doc/crm_fencing.html

/The lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming 
increasingly popular and in future they may even become standard 
equipment of of-the-shelf computers. They are, however, inferior to UPS 
devices, because they share a power supply with their host (a cluster 
node). If a node stays without power, the device supposed to control it 
would be just as useless. Even though this is obvious to us, the cluster 
manager is not in the know and will try to fence the node in vain. This 
will continue forever because all other resource operations would wait 
for the fencing/stonith operation to succeed./


Regards.

El 04/03/15 a las 09:28, Jose Manuel Martínez escribió:
> I understand the problem.
>
> Fencing and reallocation of resources is not possible if the fencing 
> resource is not able to know the real status of the other node. So, 
> I'm going to try to remove this single point of failure adding a 
> second fencing method (thanks for the links). Our PDU's are not able 
> to disconnect a sinble power bank, so this is not actually an option 
> for us, but I think we have other options.
>
> I'll post if I can solve the problem. Thank you so much. You have been 
> very helpfull.
>
> JoseM
>
>
> El 03/03/15 a las 19:51, Digimer escribió:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 03/03/15 01:42 PM, Ken Gaillot wrote:
>>> On 03/03/2015 01:14 PM, Jose Manuel Martínez wrote:
>>>> Hello everybody.
>>>>
>>>> I'm trying to build an active/passive cluster for the Lustre
>>>> filesystem. Pacemaker is working fine in most situations except
>>>> one: If a node goes out of power in a 2-node cluster, and I am
>>>> using fence_ipmilan as fencing resource (for HP iLO2), the alive
>>>> node is not able to takeover the resources of the failed node. It
>>>> tries to check the fencing device trying to reboot it, but as the
>>>> node is dead (no power), the IPMI interface does not answer.
>>> Correct, IPMI that shares power with its host should not be used as
>>> the sole fencing device for this very reason. There is no way for
>>> the cluster to be certain that the host is down and not just the
>>> IPMI.
>>>
>>> IPMI is fine as the first-attempt fencing device, but there should
>>> be a fallback fencing device that is independent of the host (such
>>> as a remotely controllable power switch).
>> I agree 100%, and do this myself.
>>
>> IPMI fencing, when it works, is best because when it returns "off", we
>> can be *very* sure the node is actually off. As you said though, it is
>> electrically and mechanically coupled to the host, so it's vulnerable
>> to certain failure cases (ie: total loss of power, mechanical
>> destruction, etc).
>>
>> For this, I always use a pair of switched PDUs (APC and Raritan both
>> work well, TrippLite works but is slow). I use a pair because I also
>> have power redundancy (separate UPSes, PDUs, etc). So to hand this,
>> you need a fairly complex stonith configuration to make it work.
>>
>> In 'pcs', this is called 'STONITH Levels' and you can see how how to
>> build this here:
>>
>> http://clusterlabs.org/wiki/STONITH_Levels
>>
>> In 'crm', this is called "Fencing Topology" and you can see how to
>> configure it here:
>>
>> http://clusterlabs.org/wiki/Fencing_topology
>>
>> In my opinion, this is the only proper configuration for fencing if
>> you want to remove all single points of failure.
>>
>> - -- 
>> Digimer
>> Papers and Projects:https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1
>>
>> iQIcBAEBAgAGBQJU9gLDAAoJECChztQA3mh0SUUQAJ89YOoBH3m1/jR0hUsS5VV2
>> sgwi2DuCXPKamHDzMNL8mFnozZxCD5QMs5+yDjWJZWxXCXEz6VB4aR4zVz31URi5
>> iZdN0RmaUVbdVqgJrY0KH8QaxuCg440H2mE41Qj/8OKYmK9RW0fhErU59Ydud5wX
>> jTuTqBRhfniMr4Qd2myYTmkm7+AwEwy1NfthimweTOTLib/11G8/esJ5AMz6Upeq
>> dyKbDxoOJuPODJfglKCrytqJnWuFrfzUWSbVnpRf4pMaRIdeL/Ko9Vsi4zIB3UD3
>> TWxbWUS/MM3QopzV9ruFX1yvu0B+YHKhmecgEGtXAxgyWI6zj7RQNHiJ/rBRi8Rk
>> Dld5bdAnTzADQeHsvU3PIK+ilrwFjZsCoK8dgK5eSr0jQrKRGUhkTOF6LtMP7HYA
>> xtWu3kXE/YbVrBT8BhdFTWSGTBvnCGIfzGNY+/wm45uLXf4lMg2fWW5OCKlgAj0K
>> W/srPAU5M8tJesrPXiDY//V2DkQhAsurrNUwVjL+e6mA8LQyyH79bNcP0cN+gyIo
>> LHqmK2OVEdwr7uOjijtA5y49iyreR92nfVLOZUzxjrpjXs36eSzJ+DdUulzx3cJJ
>> 49BQPlT/+Hb5V7hIVUSFTneyLGrJOLG9g9hFtx4nL9sNTddzpJhXeeEyfy/q9+yy
>> DEigN8nBiILHLfHdGMIh
>> =1UnT
>> -----END PGP SIGNATURE-----
>>
>> _______________________________________________
>> Users mailing list:Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home:http://www.clusterlabs.org
>> Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:http://bugs.clusterlabs.org
>
>
> -- 
>
> *Jose Manuel Martínez García / Tel. 987 293 174 *
>
> *Coordinador de Sistemas*
>
> Fundación Centro de Supercomputación de Castilla y León
>
> Edificio CRAI-TIC, Campus de Vegazana, s/n
>
> Universidad de León
>
> 24071 León, España
>
> www.fcsc.es
>
> logoFCSCL jcyl
>
> _________________________________
>
> Este correo va dirigido, de manera exclusiva, a su destinatario y 
> puede contener información confidencial, cuya divulgación no está 
> permitida por la ley. Si usted no es su destinatario notifíquelo 
> urgentemente al remitente y borre este correo de su sistema.
> Proteja el Medio Ambiente. Evite imprimir este mensaje si no es 
> estrictamente necesario.
>


-- 

*Jose Manuel Martínez García / Tel. 987 293 174 *

*Coordinador de Sistemas*

Fundación Centro de Supercomputación de Castilla y León

Edificio CRAI-TIC, Campus de Vegazana, s/n

Universidad de León

24071 León, España

www.fcsc.es

logoFCSCL jcyl

_________________________________

Este correo va dirigido, de manera exclusiva, a su destinatario y puede 
contener información confidencial, cuya divulgación no está permitida 
por la ley. Si usted no es su destinatario notifíquelo urgentemente al 
remitente y borre este correo de su sistema.
Proteja el Medio Ambiente. Evite imprimir este mensaje si no es 
estrictamente necesario.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150304/6f99df15/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 11930 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150304/6f99df15/attachment-0016.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 8376 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150304/6f99df15/attachment-0017.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: edbijcij.png
Type: image/png
Size: 11930 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150304/6f99df15/attachment-0018.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gbdadace.png
Type: image/png
Size: 8376 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150304/6f99df15/attachment-0019.png>


More information about the Users mailing list