[ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Thu May 31 21:22:13 UTC 2018

> Quick look at PAF manual gives
> 
> you need to rebuild the PostgreSQL instance on the failed node
> 
> did you do it? I am not intimately familiar with Postgres, but in this
> case I expect that you need to make database on node B secondary (slave,
> whatever it is called) to new master on node A. That is exactly what I
> described as "manually fixing configuration outside of pacemaker".

I did not see this prior to today, but was pointed to this a little while ago.  I did not realize that this would be necessary, so I have written a script to rebuild the db and then do the `pcs cluster start` afterwards, which I'll make part of our standard recovery procedure.

I guess I expected that pacemaker would be able to handle this case automatically - if the resource agent reported a resource in a potentially-corrupt state, pacemaker could then call the resource agent to start the rebuild.  But there are probably some reasons that's not a great idea, and I think that I understand things enough now to be confident in just using a custom script for this purpose when necessary.

When I set up clusters in the past with heartbeat, I had put the database on a DRBD partition, so this simplified matters since there was never a possibility of some new writes to the master not yet being replicated to the slave.  In development testing, I found that I did not need to rebuild the database, just start it up manually in slave mode.  But now that I've thought this through better, I realize that in a production environment, should the master crash, it is quite likely that it will have some data that has not yet replicated to the slaves, so it could not cleanly come up as a standby since it would have some data that was too new.

> pacemaker is too old. The error most likely comes from missing
> OCF_RESKEY_crm_feature_set which is exported by crm_resource starting
> with 1.1.17. I am not that familiar with debian packaging, but I'd
> expect resource-agents-paf require suitable pacemaker version. Of course
> Ubuntu package may be patched to include necessary code ...

I'm not sure why that would be - the resource agent works fine with this version of pacemaker, and according to https://github.com/ClusterLabs/PAF/releases, it only requires pacemaker >=1.1.13.  I think that something is wrong with the command that I was trying to run, as pacemaker 1.1.14 successfully uses this resource agent to start/stop/monitor the service generally speaking, outside of the manual debugging context.

Thank you!
-- 
Casey