[Pacemaker] resource starts but then fails right away

Sun May 12 19:13:25 EDT 2013

On 10/05/2013, at 9:23 PM, Brian J. Murrell <brian at interlinx.bc.ca> wrote:

> On 13-05-09 09:53 PM, Andrew Beekhof wrote:
>> 
>> May  7 02:36:16 node1 crmd[16836]:     info: delete_resource: Removing resource testfs-resource1 for 18002_crm_resource (internal) on node1
>> May  7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, flush delayed
>> May  7 02:36:16 node1 crmd[16836]:     info: lrm_remove_deleted_op: Removing op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1
>> 
>> So apparently a badly timed cleanup was run.
> 
> :-(  I didn't know there could such timing problems.  I might have to
> change my process a bit then perhaps.
> 
>> Did you do that or was it the crm shell?
> 
> That was "me" doing a "crm resource cleanup" (soon to become
> "crm_resource -r ... --cleanup").  The process is typically:
> 
> - create resource
> - start resource
> - wait for resource to start
> 
> where "start resource" is:
> - "clean it to start with a known clean resource"
>  (crm resource cleanup)
> - "start resource"
>  (crm_resource -r ... -p target-role -m -v Started)
> 
> and "wait for resource" is a loop of "crm resource status ..." (soon to
> be "crm_resource -r ... --locate")
> 
> So the create, clean, start operations happen in quite quick succession
> (i.e. scripted).  Is that pathological?  Is a clean between create and
> start known to be problematic?

Its certainly known to be unnecessary.
In some older versions it is also problematic.

> 
> FWIW, the reason for clean before the start, even after just creating
> the resource is that "clean" and "start" are lumped together into a
> function that is called after create, but can also be called at other
> times during the life-cycle, so it could be needed to clean a resource
> before trying to start it.  I was hoping the cleaning of a just created
> resource was going to be effectively a NOOP.

Its never a no-op, and at that particular point the cluster is trying to discover the status of the resource.
Running a clean in the middle of that interferes with this.

> 
> I guess for completeness, I should add here that creating the resource
> is a "cibadmin -o resource -C ..." operation.
> 
>> If the machine is heavily loaded, or just very busy with file I/O, that can still take quite a long time.
> 
> Yeah, not very loaded at all, especially at this point.  This is all
> happening before anything really gets started on the machine... this is
> the process of getting the resources up and running and the machine is
> dedicated to running the tasks associated with these resources.
> 
> Cheers,
> b.
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org