[Pacemaker] crm resource cleanup ignored

Dejan Muhamedagic dejanmm at fastmail.fm
Fri Jul 2 09:32:02 EDT 2010


Hi,

On Fri, Jul 02, 2010 at 02:56:04PM +0200, Bernd Schubert wrote:
> Hello all,
> 
> after the update 1.0.9 on our test cluster, new weird stonith issues 
> come up. 
> 
> 1) It fails to start stonith resources on *some* nodes
> =======================================================
> 
> Jul 02 14:43:23 phys-oss3 pengine: [18077]: WARN: unpack_rsc_op: Processing failed op st-riloe-phys-oss1_start_0 on phys-oss3: unknown error 
> (1)
> 
> Failed actions:
>     st-riloe-phys-oss1_start_0 (node=phys-oss3, call=25, rc=1, status=complete): unknown error
>     st-riloe-phys-oss2_start_0 (node=phys-oss0, call=25, rc=1, status=complete): unknown error
> 
> 
> On other nodes it properly starts it:
> 
> Node phys-oss0 (d8b9b1c6-fdf4-40f1-be3d-9158237ad4cb): online                                                                                
>         st-riloe-phys-oss1      (stonith:external/riloe) Started                                                                             
> 
> 
> 2) When I try to clean it, it does not work:
> ============================================
> 
> root at rhel5-nfs@phys-oss3:~# date
> Fri Jul  2 14:50:15 CEST 2010
> 
> 
> root at rhel5-nfs@phys-oss3:~# crm resource cleanup st-riloe-phys-oss1 phys-oss3
> Cleaning up st-riloe-phys-oss1 on phys-oss3
> 
> crm_mon:
> 
> Failed actions:
>     st-riloe-phys-oss1_start_0 (node=phys-oss3, call=25, rc=1, status=complete): unknown error
>     st-riloe-phys-oss2_start_0 (node=phys-oss0, call=25, rc=1, status=complete): unknown error
> Failed actions:
>     st-riloe-phys-oss1_start_0 (node=phys-oss3, call=25, rc=1, status=complete): unknown error
>     st-riloe-phys-oss2_start_0 (node=phys-oss0, call=25, rc=1, status=complete): unknown error
> 
> 
> root at rhel5-nfs@phys-oss3:~# tail /var/log/ha-log
> Jul 02 14:48:40 phys-oss3 crmd: [18056]: info: ais_status_callback: status: phys-oss2 is now lost (was member)

Why did the node disappear? Any coredumps around?

> Jul 02 14:48:40 phys-oss3 crmd: [18056]: info: crm_update_peer: Node phys-oss2: id=4 state=lost (new) addr=(null) votes=-1 born=5 seen=6 
> proc=00000000000000000000000000000200
> Jul 02 14:48:40 phys-oss3 crmd: [18056]: info: erase_node_from_join: Removed node phys-oss1 from join calculations: welcomed=0 itegrated=0 
> finalized=0 confirmed=1
> Jul 02 14:48:40 phys-oss3 crmd: [18056]: info: erase_node_from_join: Removed node phys-oss2 from join calculations: welcomed=0 itegrated=0 
> finalized=0 confirmed=1
> Jul 02 14:48:40 phys-oss3 crmd: [18056]: info: populate_cib_nodes_ha: Requesting the list of configured nodes
> Jul 02 14:48:40 phys-oss3 cib: [18052]: info: cib_process_request: Operation complete: op cib_modify for section nodes 
> (origin=local/crmd/133, version=0.735.1): ok (rc=0)
> Jul 02 14:50:23 phys-oss3 crmd: [18056]: notice: do_lrm_invoke: Not creating resource for a delete event: (null)
> Jul 02 14:50:23 phys-oss3 crmd: [18056]: info: send_direct_ack: ACK'ing resource op st-riloe-phys-oss1_delete_60000 from 0:0:crm-
> resource-21728: lrm_invoke-lrmd-1278075023-300
> Jul 02 14:51:14 phys-oss3 crmd: [18056]: notice: do_lrm_invoke: Not creating resource for a delete event: (null)
> Jul 02 14:51:14 phys-oss3 crmd: [18056]: info: send_direct_ack: ACK'ing resource op st-riloe-phys-oss1_delete_60000 from 0:0:crm-
> resource-21797: lrm_invoke-lrmd-1278075074-302
> 
> 
> 
> Any ideas?

There should be more logs and some showing the actual error. If
you can't find it, then please open a bugzilla with hb_report.

Thanks,

Dejan

> Thanks,
> Bernd
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




More information about the Pacemaker mailing list