[ClusterLabs] Problem using IPMI for fencing

Tue Mar 3 13:42:18 EST 2015

On 03/03/2015 01:14 PM, Jose Manuel Martínez wrote:
> Hello everybody.
> 
> I'm trying to build an active/passive cluster for the Lustre filesystem.
> Pacemaker is working fine in most situations except one: If a node goes out of 
> power in a 2-node cluster, and I am using fence_ipmilan as fencing resource (for 
> HP iLO2), the alive node is not able to takeover the resources of the failed 
> node. It tries to check the fencing device trying to reboot it, but as the node 
> is dead (no power), the IPMI interface does not answer.

Correct, IPMI that shares power with its host should not be used as the
sole fencing device for this very reason. There is no way for the
cluster to be certain that the host is down and not just the IPMI.

IPMI is fine as the first-attempt fencing device, but there should be a
fallback fencing device that is independent of the host (such as a
remotely controllable power switch).

> Log says:
> /Mar 03 18:16:18 [20355] lustre03 stonith-ng:    error: remote_op_done:  
> Operation reboot of lustre04 by lustre03 for crmd.20359 at lustre03.7a198338: No 
> route to host/
> 
> 
> Log knows what should do ('lustre04' node is the dead one and 'lustre03' is the 
> alive one):
> /
>   warning: stage6:  Scheduling Node lustre04 for STONITH
> Mar 03 18:16:18 [20358] lustre03    pengine:     info: 
> native_stop_constraints:         Fencing_Lustre03_stop_0 is implicit after 
> lustre04 is fenced
> Mar 03 18:16:18 [20358] lustre03    pengine:     info: 
> native_stop_constraints:         Resource_OST09_stop_0 is implicit after 
> lustre04 is fenced
> Mar 03 18:16:18 [20358] lustre03    pengine:     info: 
> native_stop_constraints:         Resource_OST06_stop_0 is implicit after 
> lustre04 is fenced
> Mar 03 18:16:18 [20358] lustre03    pengine:     info: 
> native_stop_constraints:         Resource_OST07_stop_0 is implicit after 
> lustre04 is fenced
> Mar 03 18:16:18 [20358] lustre03    pengine:     info: 
> native_stop_constraints:         Resource_OST08_stop_0 is implicit after 
> lustre04 is fenced
> Mar 03 18:16:18 [20358] lustre03    pengine:   notice: LogActions: *Move    
> Fencing_Lustre03        (Started lustre04 -> lustre03)*
> Mar 03 18:16:18 [20358] lustre03    pengine:     info: LogActions:      Leave   
> Fencing_Lustre04        (Started lustre03)
> Mar 03 18:16:18 [20358] lustre03    pengine:   notice: LogActions: *Move    
> Resource_OST09  (Started lustre04 -> lustre03)*
> Mar 03 18:16:18 [20358] lustre03    pengine:   notice: LogActions: *Move    
> Resource_OST06  (Started lustre04 -> lustre03)*
> Mar 03 18:16:18 [20358] lustre03    pengine:     info: LogActions:      Leave   
> Resource_OST04  (Started lustre03)
> Mar 03 18:16:18 [20358] lustre03    pengine:     info: LogActions:      Leave   
> Resource_OST05  (Started lustre03)
> Mar 03 18:16:18 [20358] lustre03    pengine:   notice: LogActions: *Move    
> Resource_OST07  (Started lustre04 -> lustre03)*
> Mar 03 18:16:18 [20358] lustre03    pengine:   notice: LogActions: *Move    
> Resource_OST08  (Started lustre04 -> lustre03)*/
> 
> ...but these operations never happen. If it can't fence the dead node, the 
> resources are not takeovered.
> 
> This is an infinite loop and resources are never takeovered.
> 
> Is there a way to say the cluster what to do in this case?.
> 
> Best regards
> 
> 
> 
> -- 
> 
> *Jose Manuel Martínez García / Tel. 987 293 174 *
> 
> *Coordinador de Sistemas*
> 
> Fundación Centro de Supercomputación de Castilla y León
> 
> Edificio CRAI-TIC, Campus de Vegazana, s/n
> 
> Universidad de León
> 
> 24071 León, España
> 
> www.fcsc.es
> 
> logoFCSCL jcyl
> 
> _________________________________
> 
> Este correo va dirigido, de manera exclusiva, a su destinatario y puede contener 
> información confidencial, cuya divulgación no está permitida por la ley. Si 
> usted no es su destinatario notifíquelo urgentemente al remitente y borre este 
> correo de su sistema.
> Proteja el Medio Ambiente. Evite imprimir este mensaje si no es estrictamente 
> necesario.
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>