[ClusterLabs] Problem using IPMI for fencing
Jose Manuel Martínez
jose.martinez at fcsc.es
Tue Mar 3 19:14:14 CET 2015
Hello everybody.
I'm trying to build an active/passive cluster for the Lustre filesystem.
Pacemaker is working fine in most situations except one: If a node goes
out of power in a 2-node cluster, and I am using fence_ipmilan as
fencing resource (for HP iLO2), the alive node is not able to takeover
the resources of the failed node. It tries to check the fencing device
trying to reboot it, but as the node is dead (no power), the IPMI
interface does not answer.
Log says:
/Mar 03 18:16:18 [20355] lustre03 stonith-ng: error: remote_op_done:
Operation reboot of lustre04 by lustre03 for
crmd.20359 at lustre03.7a198338: No route to host/
Log knows what should do ('lustre04' node is the dead one and 'lustre03'
is the alive one):
/
warning: stage6: Scheduling Node lustre04 for STONITH
Mar 03 18:16:18 [20358] lustre03 pengine: info:
native_stop_constraints: Fencing_Lustre03_stop_0 is implicit
after lustre04 is fenced
Mar 03 18:16:18 [20358] lustre03 pengine: info:
native_stop_constraints: Resource_OST09_stop_0 is implicit after
lustre04 is fenced
Mar 03 18:16:18 [20358] lustre03 pengine: info:
native_stop_constraints: Resource_OST06_stop_0 is implicit after
lustre04 is fenced
Mar 03 18:16:18 [20358] lustre03 pengine: info:
native_stop_constraints: Resource_OST07_stop_0 is implicit after
lustre04 is fenced
Mar 03 18:16:18 [20358] lustre03 pengine: info:
native_stop_constraints: Resource_OST08_stop_0 is implicit after
lustre04 is fenced
Mar 03 18:16:18 [20358] lustre03 pengine: notice: LogActions:
*Move Fencing_Lustre03 (Started lustre04 -> lustre03)*
Mar 03 18:16:18 [20358] lustre03 pengine: info: LogActions:
Leave Fencing_Lustre04 (Started lustre03)
Mar 03 18:16:18 [20358] lustre03 pengine: notice: LogActions:
*Move Resource_OST09 (Started lustre04 -> lustre03)*
Mar 03 18:16:18 [20358] lustre03 pengine: notice: LogActions:
*Move Resource_OST06 (Started lustre04 -> lustre03)*
Mar 03 18:16:18 [20358] lustre03 pengine: info: LogActions:
Leave Resource_OST04 (Started lustre03)
Mar 03 18:16:18 [20358] lustre03 pengine: info: LogActions:
Leave Resource_OST05 (Started lustre03)
Mar 03 18:16:18 [20358] lustre03 pengine: notice: LogActions:
*Move Resource_OST07 (Started lustre04 -> lustre03)*
Mar 03 18:16:18 [20358] lustre03 pengine: notice: LogActions:
*Move Resource_OST08 (Started lustre04 -> lustre03)*/
...but these operations never happen. If it can't fence the dead node,
the resources are not takeovered.
This is an infinite loop and resources are never takeovered.
Is there a way to say the cluster what to do in this case?.
Best regards
--
*Jose Manuel Martínez García / Tel. 987 293 174 *
*Coordinador de Sistemas*
Fundación Centro de Supercomputación de Castilla y León
Edificio CRAI-TIC, Campus de Vegazana, s/n
Universidad de León
24071 León, España
www.fcsc.es
logoFCSCL jcyl
_________________________________
Este correo va dirigido, de manera exclusiva, a su destinatario y puede
contener información confidencial, cuya divulgación no está permitida
por la ley. Si usted no es su destinatario notifíquelo urgentemente al
remitente y borre este correo de su sistema.
Proteja el Medio Ambiente. Evite imprimir este mensaje si no es
estrictamente necesario.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://clusterlabs.org/pipermail/users/attachments/20150303/95f3e95f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jafibdcj.png
Type: image/png
Size: 11930 bytes
Desc: not available
URL: <http://clusterlabs.org/pipermail/users/attachments/20150303/95f3e95f/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eedjhdja.png
Type: image/png
Size: 8376 bytes
Desc: not available
URL: <http://clusterlabs.org/pipermail/users/attachments/20150303/95f3e95f/attachment-0003.png>
More information about the Users
mailing list