From kgaillot at redhat.com Thu Oct 3 19:11:45 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Thu, 03 Oct 2024 14:11:45 -0500 Subject: [ClusterLabs] Pacemaker 2.1.9-rc1 released Message-ID: Hi all, The first release candidate for Pacemaker 2.1.9 is available at: https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.9-rc1 This is primarily a bug-fix release to give a clean separation point with the upcoming 3.0.0 release. It also supports the ability to build with the latest version of libxml2, and introduces no-quorum- policy="fence" as a synonym for the newly deprecated "suicide". For details, see the above link. Many thanks to all contributors to this release, including Chris Lumens, Hideo Yamauchi, Ken Gaillot, and Reid Wahl. -- Ken Gaillot From mrt_nl at hotmail.com Sun Oct 6 19:25:21 2024 From: mrt_nl at hotmail.com (Murat Inal) Date: Sun, 6 Oct 2024 22:25:21 +0300 Subject: [ClusterLabs] corosync won't start after node failure In-Reply-To: References: Message-ID: More progress on this issue; I have noticed that a corosync start initiates PTR queries for all of the local IP addresses. My production cluster node has many: ?1. area0: 172.30.1.1/27 ?2. ctd: 10.1.5.16/31 ?3. dep: 10.1.4.2/24 ?4. docker0: 172.17.0.1/16 ?5. fast: 10.1.5.1/28 ?6. gst: 192.168.5.1/24 ?7. ha: 10.1.5.25/29 ?8. inet: 100.64.64.10/29 ?9. iscsi1: 10.1.8.195/28 10. iscsi2: 10.1.8.211/28 11. iscsi3: 10.1.8.227/28 12. knet: 10.1.5.33/28 13. lo0: 10.1.255.1/32 14. lo: 127.0.0.1/8 15. mgmt: 10.1.3.4/24 16. nfpeeringout: 10.1.102.64/31 I created entries at /etc/hosts for all of the above. Corosync freeze NEVER happened after that. I have two production clusters, 5 nodes in total. I did the same for remaining nodes. Not a single freeze. BTW, I deleted the (false) DNS=1.2.3.4 entry at /etc/systemd/resolved.conf. There is no "workaround" at cluster configurations. Based on the above, I guess that corosync somehow crashes after an accumulated period of PTR query timeouts. Please note that there is NO name server at the time of cluster launch. So there is no response to these queries. If you think this is a bug, please lead me on how to proceed for creating a report. Thanks, On 9/12/24 00:27, Murat Inal wrote: > Hello Ken, > > I think I have resolved the problem on my own. > > Yes, right after the boot, corosync fails to come up. Problem appears > to be related to name resolution. I ran corosync foreground and did a > stack trace: corosync froze and strace output was suspicious with many > name resolution-like calls. > > In my failing cluster, I am running containerized BIND9 for regular > name resolution services. Both nodes are running systemd-resolved for > localhost's name resolution. Below are relevant directives of > resolved.conf: > > DNS=10.1.5.30 > #DNS=1.2.3.4 > #FallbackDNS= > > 10.1.5.30/29 is the virtual IP address for the nodes where BIND9 can > be queried. This VIP and BIND9 container are managed by pacemaker, so > after a reboot, node does NOT have the VIP and there is NO container > running. > > When I changed the directives as; > > #DNS=10.1.5.30 > DNS=1.2.3.4 > #FallbackDNS= > > corosync runs perfectly, successful cluster launch follows. 1.2.3.4 is > a false address. Node does NOT have a default route before cluster > launch. Obviously node does NOT receive any replies to its name > queries while corosync is coming up. However, both nodes have a valid > address, 10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a fact > that 10.1.5.24/29 subnet is locally attached at both nodes. > > Last discovery to mention is that I monitored LOCAL name resolutions > while corosync starts ("sudo resolvectl monitor"). Monitoring > immediately displayed PTR queries for ALL LOCAL IP addresses of the node. > > Based on the above, my conclusion is -there is something going bad > with name resolutions using non-existent VIP address-. In my first > message, I mentioned that I was only able to recover corosync by > REINSTALLING it from the repo. In order to reinstall, I was setting > the default route and name server address (8.8.8.8) manually in order > to run an effective "apt reinstall corosync". Hence, I was > unintentionally configuring a DNS server for systemd-resolved. So it > was NOT about reinstalling corosync but letting systemd-resolved use > some non-local name server address. > > I am using corosync/pacemaker for a couple of years in production, > probably since Ubuntu Server release 21.10 and never encountered such > a problem until now. I wrote an ansible playbook to toggle > systemd-resolved's DNS directive, however I think this glitch SHOULD > NOT exist. > > I will be glad if I receive comments on the above. > > Regards, > > > On 8/20/24 21:55, Ken Gaillot wrote: >> On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote: >>> [Resending the below due to message format problem] >>> >>> >>> Dear List, >>> >>> I have been running two different 3-node clusters for some time. I >>> am >>> having a fatal problem with corosync: After a node failure, rebooted >>> node does NOT start corosync. >>> >>> Clusters; >>> >>> ?? * All nodes are running Ubuntu Server 24.04 >>> ?? * corosync is 3.1.7 >>> ?? * corosync-qdevice is 3.0.3 >>> ?? * pacemaker is 2.1.6 >>> ?? * The third node at both clusters is a quorum device. Cluster is on >>> ???? ffsplit algorithm. >>> ?? * All nodes are baremetal & attached to a dedicated kronosnet >>> network. >>> ?? * STONITH is enabled in one of the clusters and disabled for the >>> other. >>> >>> corosync & pacemaker service starts (systemd) are disabled. I am >>> starting any cluster with the command pcs cluster start. >>> >>> corosync NEVER starts AFTER a node failure (node is rebooted). There >> Do you mean that the first time you run "pcs cluster start" after a >> node reboot, corosync does not come up completely? >> >> Try adding "debug: on" to the logging section of >> /etc/corosync/corosync.conf >> >>> is >>> nothing in /var/log/corosync/corosync.log, service freezes as: >>> >>> Aug 01 12:54:56 [3193] charon corosync notice? [MAIN? ] Corosync >>> Cluster >>> Engine 3.1.7 starting up >>> Aug 01 12:54:56 [3193] charon corosync info??? [MAIN? ] Corosync >>> built-in features: dbus monitoring watchdog augeas systemd xmlconf >>> vqsim >>> nozzle snmp pie relro bindnow >>> >>> corosync never starts kronosnet. I checked kronosnet interfaces, all >>> OK, >>> there is IP connectivity in between. If I do corosync -t, it is the >>> same >>> freeze. >>> >>> I could ONLY manage to start corosync by reinstalling it: apt >>> reinstall >>> corosync ; pcs cluster start. >>> >>> The above issue repeated itself at least 5-6 times. I do NOT see >>> anything in syslog either. I will be glad if you lead me on how to >>> solve >>> this. >>> >>> Thanks, >>> >>> Murat >>> > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ From mrt_nl at hotmail.com Sun Oct 6 19:46:34 2024 From: mrt_nl at hotmail.com (Murat Inal) Date: Sun, 6 Oct 2024 22:46:34 +0300 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock Message-ID: Hello, I'd like to confirm with you the mechanism of ocf:heartbeat:portblock. Given a resource definition; Resource: r41_LIO (class=ocf provider=heartbeat type=portblock) ? Attributes: r41_LIO-instance_attributes ??? action=unblock ??? ip=10.1.8.194 ??? portno=3260 ??? protocol=tcp - If resource starts, TCP:3260 is UNBLOCKED. - If resource is stopped, TCP:3260 is BLOCKED. Is that correct? If action=block, it will run just the opposite, correct? To toggle a port, a single portblock resource is enough, correct? Thanks, From oalbrigt at redhat.com Wed Oct 9 08:26:05 2024 From: oalbrigt at redhat.com (Oyvind Albrigtsen) Date: Wed, 9 Oct 2024 10:26:05 +0200 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock In-Reply-To: References: Message-ID: Correct. That should block the port when the resource is stopped on a node (e.g. if you have it grouped with the service you're using on the port). I would do some testing to ensure it works exactly as you expect. E.g. you can telnet to the port, or you can run nc/socat on the port and telnet to it from the node it blocks/unblocks. If it doesnt accept the connection you know it's blocked. Oyvind Albrigtsen On 06/10/24 22:46 GMT, Murat Inal wrote: >Hello, > >I'd like to confirm with you the mechanism of ocf:heartbeat:portblock. > >Given a resource definition; > >Resource: r41_LIO (class=ocf provider=heartbeat type=portblock) >? Attributes: r41_LIO-instance_attributes >??? action=unblock >??? ip=10.1.8.194 >??? portno=3260 >??? protocol=tcp > >- If resource starts, TCP:3260 is UNBLOCKED. > >- If resource is stopped, TCP:3260 is BLOCKED. > >Is that correct? If action=block, it will run just the opposite, correct? > >To toggle a port, a single portblock resource is enough, correct? > >Thanks, > >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ From angeloruggiero at yahoo.com Wed Oct 9 13:07:30 2024 From: angeloruggiero at yahoo.com (Angelo Ruggiero) Date: Wed, 9 Oct 2024 13:07:30 +0000 Subject: [ClusterLabs] Fencing Approach References: Message-ID: Hello, My setup.... * We are setting up a pacemaker cluster to run SAP runnig on RHEL on Vmware virtual machines. * We will have two nodes for the application server of SAP and 2 nodes for the Hana database. SAP/RHEL provide good support on how to setup the cluster. ? * SAP will need a number of floating Ips to be moved around as well mountin/unmounted NFS file system coming from a NetApp device. SAP will need processes switching on and off when something happens planned or unplanned.I am not clear if the netapp devic is active and the other site is DR but what i know is the ip addresses just get moved during a DR incident. Just to be complete the HANA data sync is done by HANA itself most probably async with an RPO of 15mins or so. * We will have a quorum node also with hopefully a seperate network not sure if it will be on a seperate vmware infra though. * I am hoping to be allowed to use the vmware watchdog although it might take some persuading as declared as "non standard" for us by our infra people. I have it already in DEV to play with now. I managed to set the above working just using a floating ip and a nfs mount as my resources and I can see the following. The self fencing approach works fine i.e the servers reboot when they loose network connectivity and/or become in quorate as long as they are offering resources. So my questions are in relation to further fencing .... I did a lot of reading and saw various reference... 1. Use of sbd shared storage The question is what does using sbd with a shated storage really give me. I need to justify why i need this shared storage again to the infra guys but to be honest also to myself. I have been given this infra and will play with it next few days. 1. Use of fence vmware In addition there is the ability of course to fence using the fence_vmware agents and I again I need to justify why i need this. In this particular cases it will be a very hard sell because the dev/test and prod environments run on the same vmware infra so to use fence_vmware would effectively mean dev is connected to prod i.e the user id for a dev or test box is being provided by a production environment. I do not have this ability at all so cannot play with it. My current thought train...i.e the typical things i think about... Perhaps someone can help me be clear on the benefits of 1 and 2 over and above the setup i think it doable. 1. gives me the ability to use poison pill But what scenarios does poison pill really help why would the other parts of the cluster want to fence the node if the node itself has not killed it self as it lost quorum either because quorum devcice gone or network connectivity failed and resources needs to be switched off. What i get is that it is very explict i.e the others nodes tell the other server to die. So it must be a case initiated by the other nodes. I am struggling to think of a scenarios where the other nodes would want to fence it. Possible Scenarios, did i miss any? * Loss of network connection to the node. But that is covered by the node self fencing * If some monitoring said the node was not healthly or responding... Maybe this is the case it is good for but then it must be a partial failure where the node is still part fof the cluster and can respond. I.e not OS freeze or only it looses connection as then the watchdog or the self fencing will kick in. * HW failures, cpu, memory, disk For virtual hardware does that actually ever fail? Sorry if stupid question. I could ask our infra guys but...., So is virtual hardware so reliable that hw failures can be ignored. * Loss of shared storage SAP uses a lot of shared storage via NFS. Not sure what happens when that fails need to research it a bit but each node will sort that out itself I am presuming. * Human error: but no cluster will fix that and the human who makes a change will realise it and revert. ? 2. Fence vmware I see this as a better poision pill as it works at the hardware level. But if I do not need poision pill then i do not need this. In general OS freezes or even panics if take took long are covered by the watchdog. regards Angelo -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwenning at redhat.com Wed Oct 9 17:03:09 2024 From: kwenning at redhat.com (Klaus Wenninger) Date: Wed, 9 Oct 2024 19:03:09 +0200 Subject: [ClusterLabs] Fencing Approach In-Reply-To: References: Message-ID: On Wed, Oct 9, 2024 at 3:08?PM Angelo Ruggiero via Users < users at clusterlabs.org> wrote: > Hello, > > My setup.... > > > - We are setting up a pacemaker cluster to run SAP runnig on RHEL on > Vmware virtual machines. > - We will have two nodes for the application server of SAP and 2 nodes > for the Hana database. SAP/RHEL provide good support on how to setup the > cluster. ? > - SAP will need a number of floating Ips to be moved around as well > mountin/unmounted NFS file system coming from a NetApp device. SAP will > need processes switching on and off when something happens planned or > unplanned.I am not clear if the netapp devic is active and the other site > is DR but what i know is the ip addresses just get moved during a DR > incident. Just to be complete the HANA data sync is done by HANA itself > most probably async with an RPO of 15mins or so. > - We will have a quorum node also with hopefully a seperate network > not sure if it will be on a seperate vmware infra though. > - I am hoping to be allowed to use the vmware watchdog although it > might take some persuading as declared as "non standard" for us by our > infra people. I have it already in DEV to play with now. > > I managed to set the above working just using a floating ip and a nfs > mount as my resources and I can see the following. The self fencing > approach works fine i.e the servers reboot when they loose network > connectivity and/or become in quorate as long as they are offering > resources. > > So my questions are in relation to further fencing .... I did a lot of > reading and saw various reference... > > > 1. Use of sbd shared storage > > The question is what does using sbd with a shated storage really give me. > I need to justify why i need this shared storage again to the infra guys > but to be honest also to myself. I have been given this infra and will > play with it next few days. > > > 2. Use of fence vmware > > In addition there is the ability of course to fence using the fence_vmware > agents and I again I need to justify why i need this. In this particular > cases it will be a very hard sell because the dev/test and prod > environments run on the same vmware infra so to use fence_vmware would > effectively mean dev is connected to prod i.e the user id for a dev or test > box is being provided by a production environment. I do not have this > ability at all so cannot play with it. > > > > My current thought train...i.e the typical things i think about... > > Perhaps someone can help me be clear on the benefits of 1 and 2 over and > above the setup i think it doable. > > > 1. gives me the ability to use poison pill > > But what scenarios does poison pill really help why would the other > parts of the cluster want to fence the node if the node itself has not > killed it self as it lost quorum either because quorum devcice gone or > network connectivity failed and resources needs to be switched off. > > What i get is that it is very explict i.e the others nodes > tell the other server to die. So it must be a case initiated by the other > nodes. > I am struggling to think of a scenarios where the other > nodes would want to fence it. > Main scenario where poison pill shines is 2-node-clusters where you don't have usable quorum for watchdog-fencing. Configured with pacemaker-awareness - default - availability of the shared-disk doesn't become an issue as, due to fallback to availability of the 2nd node, the disk is no spof (single point of failure) in these clusters. Other nodes btw. can still kill a node with watchdog-fencing. If the node isn't able to accept that wish of another node for it to die it will have lost quorum, have stopped triggering the watchdog anyway. Regards, Klaus > > Possible Scenarios, did i miss any? > > - Loss of network connection to the node. But that is covered by the > node self fencing > - If some monitoring said the node was not healthly or responding... > Maybe this is the case it is good for but then it must be a partial failure > where the node is still part fof the cluster and can respond. I.e not OS > freeze or only it looses connection as then the watchdog or the self > fencing will kick in. > - HW failures, cpu, memory, disk For virtual hardware does that > actually ever fail? Sorry if stupid question. I could ask our infra guys > but...., > So is virtual hardware so reliable that hw failures can be ignored. > - Loss of shared storage SAP uses a lot of shared storage via NFS. Not > sure what happens when that fails need to research it a bit but each node > will sort that out itself I am presuming. > - Human error: but no cluster will fix that and the human who makes a > change will realise it and revert. ? > > 2. Fence vmware > > I see this as a better poision pill as it works at the hardware > level. But if I do not need poision pill then i do not need this. > > In general OS freezes or even panics if take took long are covered by the > watchdog. > > regards > Angelo > > > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From angeloruggiero at yahoo.com Thu Oct 10 13:58:13 2024 From: angeloruggiero at yahoo.com (Angelo Ruggiero) Date: Thu, 10 Oct 2024 13:58:13 +0000 Subject: [ClusterLabs] Users Digest, Vol 117, Issue 5 In-Reply-To: References: Message-ID: Thanks for answering. It helps. >Main scenario where poison pill shines is 2-node-clusters where you don't >have usable quorum for watchdog-fencing. Not sure i understand. As if just 2 node and one node fails it cannot respond to the poision pilll. Maybe i mis your point. This also begs the followup question, what defines "usable quroum". Do you mean for example on seperate independent network hardware and power supply? >Configured with pacemaker-awareness - default - availability of the shared-disk doesn't become an issue as, due to fallback to availability of the 2nd node, the disk is >no spof (single point of failure) in these clusters. I did not get the jist of what you are trying to say here. ? >Other nodes btw. can still kill a node with watchdog-fencing. I How does that work when would the killing node tell the other node not to keep triggering its watchdog? Having written the above sentence maybe it should go and read up when does the poison pill get sent by the killing node! >If the node isn't able to accept that wish of another >node for it to die it will have lost quorum, have stopped triggering the watchdog anyway. Yes that is clear to mean the self-fencing is quite powerful. Thanks for the response. ________________________________ From: Users on behalf of users-request at clusterlabs.org Sent: 10 October 2024 2:00 PM To: users at clusterlabs.org Subject: Users Digest, Vol 117, Issue 5 Send Users mailing list submissions to users at clusterlabs.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.clusterlabs.org/mailman/listinfo/users or, via email, send a message with subject or body 'help' to users-request at clusterlabs.org You can reach the person managing the list at users-owner at clusterlabs.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Users digest..." Today's Topics: 1. Re: Fencing Approach (Klaus Wenninger) ---------------------------------------------------------------------- Message: 1 Date: Wed, 9 Oct 2024 19:03:09 +0200 From: Klaus Wenninger To: Cluster Labs - All topics related to open-source clustering welcomed Cc: Angelo Ruggiero Subject: Re: [ClusterLabs] Fencing Approach Message-ID: Content-Type: text/plain; charset="utf-8" On Wed, Oct 9, 2024 at 3:08?PM Angelo Ruggiero via Users < users at clusterlabs.org> wrote: > Hello, > > My setup.... > > > - We are setting up a pacemaker cluster to run SAP runnig on RHEL on > Vmware virtual machines. > - We will have two nodes for the application server of SAP and 2 nodes > for the Hana database. SAP/RHEL provide good support on how to setup the > cluster. ? > - SAP will need a number of floating Ips to be moved around as well > mountin/unmounted NFS file system coming from a NetApp device. SAP will > need processes switching on and off when something happens planned or > unplanned.I am not clear if the netapp devic is active and the other site > is DR but what i know is the ip addresses just get moved during a DR > incident. Just to be complete the HANA data sync is done by HANA itself > most probably async with an RPO of 15mins or so. > - We will have a quorum node also with hopefully a seperate network > not sure if it will be on a seperate vmware infra though. > - I am hoping to be allowed to use the vmware watchdog although it > might take some persuading as declared as "non standard" for us by our > infra people. I have it already in DEV to play with now. > > I managed to set the above working just using a floating ip and a nfs > mount as my resources and I can see the following. The self fencing > approach works fine i.e the servers reboot when they loose network > connectivity and/or become in quorate as long as they are offering > resources. > > So my questions are in relation to further fencing .... I did a lot of > reading and saw various reference... > > > 1. Use of sbd shared storage > > The question is what does using sbd with a shated storage really give me. > I need to justify why i need this shared storage again to the infra guys > but to be honest also to myself. I have been given this infra and will > play with it next few days. > > > 2. Use of fence vmware > > In addition there is the ability of course to fence using the fence_vmware > agents and I again I need to justify why i need this. In this particular > cases it will be a very hard sell because the dev/test and prod > environments run on the same vmware infra so to use fence_vmware would > effectively mean dev is connected to prod i.e the user id for a dev or test > box is being provided by a production environment. I do not have this > ability at all so cannot play with it. > > > > My current thought train...i.e the typical things i think about... > > Perhaps someone can help me be clear on the benefits of 1 and 2 over and > above the setup i think it doable. > > > 1. gives me the ability to use poison pill > > But what scenarios does poison pill really help why would the other > parts of the cluster want to fence the node if the node itself has not > killed it self as it lost quorum either because quorum devcice gone or > network connectivity failed and resources needs to be switched off. > > What i get is that it is very explict i.e the others nodes > tell the other server to die. So it must be a case initiated by the other > nodes. > I am struggling to think of a scenarios where the other > nodes would want to fence it. > Main scenario where poison pill shines is 2-node-clusters where you don't have usable quorum for watchdog-fencing. Configured with pacemaker-awareness - default - availability of the shared-disk doesn't become an issue as, due to fallback to availability of the 2nd node, the disk is no spof (single point of failure) in these clusters. Other nodes btw. can still kill a node with watchdog-fencing. If the node isn't able to accept that wish of another node for it to die it will have lost quorum, have stopped triggering the watchdog anyway. Regards, Klaus > > Possible Scenarios, did i miss any? > > - Loss of network connection to the node. But that is covered by the > node self fencing > - If some monitoring said the node was not healthly or responding... > Maybe this is the case it is good for but then it must be a partial failure > where the node is still part fof the cluster and can respond. I.e not OS > freeze or only it looses connection as then the watchdog or the self > fencing will kick in. > - HW failures, cpu, memory, disk For virtual hardware does that > actually ever fail? Sorry if stupid question. I could ask our infra guys > but...., > So is virtual hardware so reliable that hw failures can be ignored. > - Loss of shared storage SAP uses a lot of shared storage via NFS. Not > sure what happens when that fails need to research it a bit but each node > will sort that out itself I am presuming. > - Human error: but no cluster will fix that and the human who makes a > change will realise it and revert. ? > > 2. Fence vmware > > I see this as a better poision pill as it works at the hardware > level. But if I do not need poision pill then i do not need this. > > In general OS freezes or even panics if take took long are covered by the > watchdog. > > regards > Angelo > > > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ------------------------------ End of Users Digest, Vol 117, Issue 5 ************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From angeloruggiero at yahoo.com Thu Oct 10 14:03:45 2024 From: angeloruggiero at yahoo.com (Angelo Ruggiero) Date: Thu, 10 Oct 2024 14:03:45 +0000 Subject: [ClusterLabs] Resource Fencing NetApp References: Message-ID: Hello, Parellel to my other thread... Hope ok to ask something a bit more specific Has anyone had any experience of resource fencing netapp storage. I did a bit of googling and i think the netapp itself might use pacemaker or can be integrated using there own resource/fence agents. Could not find a clear answer. regards Angelo -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwenning at redhat.com Thu Oct 10 14:52:54 2024 From: kwenning at redhat.com (Klaus Wenninger) Date: Thu, 10 Oct 2024 16:52:54 +0200 Subject: [ClusterLabs] Users Digest, Vol 117, Issue 5 In-Reply-To: References: Message-ID: On Thu, Oct 10, 2024 at 3:58?PM Angelo Ruggiero via Users < users at clusterlabs.org> wrote: > Thanks for answering. It helps. > > >Main scenario where poison pill shines is 2-node-clusters where you don't > >have usable quorum for watchdog-fencing. > > Not sure i understand. As if just 2 node and one node fails it cannot > respond to the poision pilll. Maybe i mis your point. > If in a 2 node setup one node loses contact to the other or sees some other reason why it would like the partner-node to be fenced it will try to write the poison-pill message to the shared disk and if that goes Ok and after a configured wait time for the other node to read the message, respond or the watchdog to kick in it will assume the other node to be fenced. > > This also begs the followup question, what defines "usable quroum". Do > you mean for example on seperate independent network hardware and power > supply? > Quorum in 2 node clusters is a bit different as they will stay quorate when losing connection. To prevent split brain there if they reboot on top they will just regain quorum once they've seen each other (search for 'wait-for-all' to read more). This behavior is of course not usable for watchdog-fencing and thus SBD automatically switches to not relying on quorum in those 2-node setups. > > >Configured with pacemaker-awareness - default - availability of the > shared-disk doesn't become an issue as, due to fallback to availability of > the 2nd node, the disk is >no spof (single point of failure) in these > clusters. > > I did not get the jist of what you are trying to say here. ? > > I was suggesting a scenario that has 2 cluster nodes + a single shared disk. With kind of 'pure' SBD this would mean that a node that is losing connection to the disk would have to self fence which would mean that this disk would become a so called single-point-of-failure - meaning that available of resources in the cluster would be reduced to availability of this single disk. So I tried to explain why you don't have to fear this reduction of availability using pacemaker-awareness. > >Other nodes btw. can still kill a node with watchdog-fencing. I > > How does that work when would the killing node tell the other node not to > keep triggering its watchdog? > Having written the above sentence maybe it should go and read up when does > the poison pill get sent by the killing node! > > It would either use cluster-communication to tell the node to self-fence and if that isn't available the case below kicks in. Hope that makes things a bit clearer. Regards, Klaus > >If the node isn't able to accept that wish of another > >node for it to die it will have lost quorum, have stopped triggering the > watchdog anyway. > > Yes that is clear to mean the self-fencing is quite powerful. > > Thanks for the response. > > ------------------------------ > *From:* Users on behalf of > users-request at clusterlabs.org > *Sent:* 10 October 2024 2:00 PM > *To:* users at clusterlabs.org > *Subject:* Users Digest, Vol 117, Issue 5 > > Send Users mailing list submissions to > users at clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-request at clusterlabs.org > > You can reach the person managing the list at > users-owner at clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Users digest..." > > > Today's Topics: > > 1. Re: Fencing Approach (Klaus Wenninger) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 9 Oct 2024 19:03:09 +0200 > From: Klaus Wenninger > To: Cluster Labs - All topics related to open-source clustering > welcomed > Cc: Angelo Ruggiero > Subject: Re: [ClusterLabs] Fencing Approach > Message-ID: > Ea4n_citY71HLamSOv3Kw-cA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > On Wed, Oct 9, 2024 at 3:08?PM Angelo Ruggiero via Users < > users at clusterlabs.org> wrote: > > > Hello, > > > > My setup.... > > > > > > - We are setting up a pacemaker cluster to run SAP runnig on RHEL on > > Vmware virtual machines. > > - We will have two nodes for the application server of SAP and 2 nodes > > for the Hana database. SAP/RHEL provide good support on how to setup > the > > cluster. ? > > - SAP will need a number of floating Ips to be moved around as well > > mountin/unmounted NFS file system coming from a NetApp device. SAP > will > > need processes switching on and off when something happens planned or > > unplanned.I am not clear if the netapp devic is active and the other > site > > is DR but what i know is the ip addresses just get moved during a DR > > incident. Just to be complete the HANA data sync is done by HANA > itself > > most probably async with an RPO of 15mins or so. > > - We will have a quorum node also with hopefully a seperate network > > not sure if it will be on a seperate vmware infra though. > > - I am hoping to be allowed to use the vmware watchdog although it > > might take some persuading as declared as "non standard" for us by our > > infra people. I have it already in DEV to play with now. > > > > I managed to set the above working just using a floating ip and a nfs > > mount as my resources and I can see the following. The self fencing > > approach works fine i.e the servers reboot when they loose network > > connectivity and/or become in quorate as long as they are offering > > resources. > > > > So my questions are in relation to further fencing .... I did a lot of > > reading and saw various reference... > > > > > > 1. Use of sbd shared storage > > > > The question is what does using sbd with a shated storage really give me. > > I need to justify why i need this shared storage again to the infra guys > > but to be honest also to myself. I have been given this infra and will > > play with it next few days. > > > > > > 2. Use of fence vmware > > > > In addition there is the ability of course to fence using the > fence_vmware > > agents and I again I need to justify why i need this. In this particular > > cases it will be a very hard sell because the dev/test and prod > > environments run on the same vmware infra so to use fence_vmware would > > effectively mean dev is connected to prod i.e the user id for a dev or > test > > box is being provided by a production environment. I do not have this > > ability at all so cannot play with it. > > > > > > > > My current thought train...i.e the typical things i think about... > > > > Perhaps someone can help me be clear on the benefits of 1 and 2 over and > > above the setup i think it doable. > > > > > > 1. gives me the ability to use poison pill > > > > But what scenarios does poison pill really help why would the other > > parts of the cluster want to fence the node if the node itself has not > > killed it self as it lost quorum either because quorum devcice gone or > > network connectivity failed and resources needs to be switched off. > > > > What i get is that it is very explict i.e the others nodes > > tell the other server to die. So it must be a case initiated by the other > > nodes. > > I am struggling to think of a scenarios where the other > > nodes would want to fence it. > > > > Main scenario where poison pill shines is 2-node-clusters where you don't > have usable quorum for watchdog-fencing. > Configured with pacemaker-awareness - default - availability of the > shared-disk doesn't become an issue as, due to > fallback to availability of the 2nd node, the disk is no spof (single > point of failure) in these clusters. > Other nodes btw. can still kill a node with watchdog-fencing. If the node > isn't able to accept that wish of another > node for it to die it will have lost quorum, have stopped triggering the > watchdog anyway. > > Regards, > Klaus > > > > > Possible Scenarios, did i miss any? > > > > - Loss of network connection to the node. But that is covered by the > > node self fencing > > - If some monitoring said the node was not healthly or responding... > > Maybe this is the case it is good for but then it must be a partial > failure > > where the node is still part fof the cluster and can respond. I.e not > OS > > freeze or only it looses connection as then the watchdog or the self > > fencing will kick in. > > - HW failures, cpu, memory, disk For virtual hardware does that > > actually ever fail? Sorry if stupid question. I could ask our infra > guys > > but...., > > So is virtual hardware so reliable that hw failures can be ignored. > > - Loss of shared storage SAP uses a lot of shared storage via NFS. Not > > sure what happens when that fails need to research it a bit but each > node > > will sort that out itself I am presuming. > > - Human error: but no cluster will fix that and the human who makes a > > change will realise it and revert. ? > > > > 2. Fence vmware > > > > I see this as a better poision pill as it works at the hardware > > level. But if I do not need poision pill then i do not need this. > > > > In general OS freezes or even panics if take took long are covered by the > > watchdog. > > > > regards > > Angelo > > > > > > > > > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://lists.clusterlabs.org/pipermail/users/attachments/20241009/b9a58eb1/attachment-0001.htm > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > > ------------------------------ > > End of Users Digest, Vol 117, Issue 5 > ************************************* > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arvidjaar at gmail.com Thu Oct 10 18:58:14 2024 From: arvidjaar at gmail.com (Andrei Borzenkov) Date: Thu, 10 Oct 2024 21:58:14 +0300 Subject: [ClusterLabs] Resource Fencing NetApp In-Reply-To: References: Message-ID: <574e7a98-2edc-4bcb-9208-a21c36bdf6fc@gmail.com> 10.10.2024 17:03, Angelo Ruggiero via Users wrote: > Hello, > > Parellel to my other thread... Hope ok to ask something a bit more specific > > Has anyone had any experience of resource fencing netapp storage. > > I did a bit of googling and i think the netapp itself might use pacemaker or can be integrated using there own resource/fence agents. > > Could not find a clear answer. > NetApp offers a lot of different products and technologies. You need to be more specific what you are talking about. From kwenning at redhat.com Fri Oct 11 08:18:53 2024 From: kwenning at redhat.com (Klaus Wenninger) Date: Fri, 11 Oct 2024 10:18:53 +0200 Subject: [ClusterLabs] Users Digest, Vol 117, Issue 5 In-Reply-To: References: Message-ID: On Thu, Oct 10, 2024 at 9:52?PM Angelo Ruggiero wrote: > > > ------------------------------ > *From:* Klaus Wenninger > *Sent:* 10 October 2024 4:52 PM > *To:* Cluster Labs - All topics related to open-source clustering > welcomed > *Cc:* Angelo Ruggiero > *Subject:* Re: [ClusterLabs] Users Digest, Vol 117, Issue 5 > > > > On Thu, Oct 10, 2024 at 3:58?PM Angelo Ruggiero via Users < > users at clusterlabs.org> wrote: > > Thanks for answering. It helps. > > >Main scenario where poison pill shines is 2-node-clusters where you don't > >have usable quorum for watchdog-fencing. > > Not sure i understand. As if just 2 node and one node fails it cannot > respond to the poision pilll. Maybe i mis your point. > > > If in a 2 node setup one node loses contact to the other or sees some > other reason why it would like > the partner-node to be fenced it will try to write the poison-pill message > to the shared disk and if that > goes Ok and after a configured wait time for the other node to read the > message, respond or the > watchdog to kick in it will assume the other node to be fenced. > > AR: Yes, understood. > AR: I guess i am looking for the killer requirement for my setup that say > for 2 node cluster with an usable quorum device (usable to be defined > later). Does poison pill via SBD or even fence_vnware give me anything. I > am struggling to find a scenario. See my final comment in this reply on > monitoring below. > > > This also begs the followup question, what defines "usable quroum". Do > you mean for example on seperate independent network hardware and power > supply? > > > Quorum in 2 node clusters is a bit different as they will stay quorate > when losing connection. To prevent split brain there if they > reboot on top they will just regain quorum once they've seen each other > (search for 'wait-for-all' to read more). > This behavior is of course not usable for watchdog-fencing and thus SBD > automatically switches to not relying on quorum in > those 2-node setups. > > > > >Configured with pacemaker-awareness - default - availability of the > shared-disk doesn't become an issue as, due to fallback to availability of > the 2nd node, the disk is >no spof (single point of failure) in these > clusters. > > I did not get the jist of what you are trying to say here. ? > > > I was suggesting a scenario that has 2 cluster nodes + a single shared > disk. With kind of 'pure' SBD this would mean that a node > that is losing connection to the disk would have to self fence which would > mean that this disk would become a so called > single-point-of-failure - meaning that available of resources in the > cluster would be reduced to availability of this single disk. > So I tried to explain why you don't have to fear this reduction of > availability using pacemaker-awareness. > > > >Other nodes btw. can still kill a node with watchdog-fencing. I > > How does that work when would the killing node tell the other node not to > keep triggering its watchdog? > Having written the above sentence maybe it should go and read up when does > the poison pill get sent by the killing node! > > > It would either use cluster-communication to tell the node to self-fence > and if that isn't available the case > below kicks in. > > AR: ok > > >Quorum in 2 node clusters is a bit different as they will stay quorate > when losing connection. > AR: here you refer to 2 node cluster without a quorum device right? > AR: futhermore are you saying that poison pill and maybe even node fencing > from the cluster is not needed when you do not have a quroum device for 2 > node clusters. > No that is a misunderstanding. For all I described some sort of SBD setup is needed. And yes - when I was talking about 2-node-clusters I meant those without a quorum device - those which have the 2-node config set in the corosync-config-file. I was just saying that without quorum device (or of course 3 and up full cluster nodes) you can't use watchdog-fencing. What you still can use is poison-pill fencing if you want to go for SBD. If it is viable for you considering other aspects like credentials or accessibility over the network I guess it is alway worth while looking into fencing via the hypervisor. There are definitely benefits in getting a response from the hypervisor that a node is down instead of having to wait some time - including some safety addon - for it to self-fence. There are as well benefits if pacemaker can explicitly turn a node off and on afterwards instead of triggering a reboot (out of obvious reasons the only way it works with SBD). If working with hypervisors using their maintenance features (pausing, migration, ...) together with their virtual watchdog implementation or softdog you as well have to consider situations where the watchdog timeout might not happen reliably within the specified timeout. Regards, Klaus > > > Hope that makes things a bit clearer. > AR: always ? such discussions are hard in both ways to be clear. > > AR: As mentioned in an earlier reply. I think I need to dwell on what > failure cases i could have and i should go and research the monitoring the > resource agents i intened to use offer > I.e IPAddr2, FileSytems and the SAP instance agents as i guess they are > the ones that would decide to fence another node. The general case where > nodes cannot communicate via the network is builtin. > > Regards, > Klaus > > > >If the node isn't able to accept that wish of another > >node for it to die it will have lost quorum, have stopped triggering the > watchdog anyway. > > Yes that is clear to mean the self-fencing is quite powerful. > > Thanks for the response. > > ------------------------------ > *From:* Users on behalf of > users-request at clusterlabs.org > *Sent:* 10 October 2024 2:00 PM > *To:* users at clusterlabs.org > *Subject:* Users Digest, Vol 117, Issue 5 > > Send Users mailing list submissions to > users at clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-request at clusterlabs.org > > You can reach the person managing the list at > users-owner at clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Users digest..." > > > Today's Topics: > > 1. Re: Fencing Approach (Klaus Wenninger) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 9 Oct 2024 19:03:09 +0200 > From: Klaus Wenninger > To: Cluster Labs - All topics related to open-source clustering > welcomed > Cc: Angelo Ruggiero > Subject: Re: [ClusterLabs] Fencing Approach > Message-ID: > Ea4n_citY71HLamSOv3Kw-cA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > On Wed, Oct 9, 2024 at 3:08?PM Angelo Ruggiero via Users < > users at clusterlabs.org> wrote: > > > Hello, > > > > My setup.... > > > > > > - We are setting up a pacemaker cluster to run SAP runnig on RHEL on > > Vmware virtual machines. > > - We will have two nodes for the application server of SAP and 2 nodes > > for the Hana database. SAP/RHEL provide good support on how to setup > the > > cluster. ? > > - SAP will need a number of floating Ips to be moved around as well > > mountin/unmounted NFS file system coming from a NetApp device. SAP > will > > need processes switching on and off when something happens planned or > > unplanned.I am not clear if the netapp devic is active and the other > site > > is DR but what i know is the ip addresses just get moved during a DR > > incident. Just to be complete the HANA data sync is done by HANA > itself > > most probably async with an RPO of 15mins or so. > > - We will have a quorum node also with hopefully a seperate network > > not sure if it will be on a seperate vmware infra though. > > - I am hoping to be allowed to use the vmware watchdog although it > > might take some persuading as declared as "non standard" for us by our > > infra people. I have it already in DEV to play with now. > > > > I managed to set the above working just using a floating ip and a nfs > > mount as my resources and I can see the following. The self fencing > > approach works fine i.e the servers reboot when they loose network > > connectivity and/or become in quorate as long as they are offering > > resources. > > > > So my questions are in relation to further fencing .... I did a lot of > > reading and saw various reference... > > > > > > 1. Use of sbd shared storage > > > > The question is what does using sbd with a shated storage really give me. > > I need to justify why i need this shared storage again to the infra guys > > but to be honest also to myself. I have been given this infra and will > > play with it next few days. > > > > > > 2. Use of fence vmware > > > > In addition there is the ability of course to fence using the > fence_vmware > > agents and I again I need to justify why i need this. In this particular > > cases it will be a very hard sell because the dev/test and prod > > environments run on the same vmware infra so to use fence_vmware would > > effectively mean dev is connected to prod i.e the user id for a dev or > test > > box is being provided by a production environment. I do not have this > > ability at all so cannot play with it. > > > > > > > > My current thought train...i.e the typical things i think about... > > > > Perhaps someone can help me be clear on the benefits of 1 and 2 over and > > above the setup i think it doable. > > > > > > 1. gives me the ability to use poison pill > > > > But what scenarios does poison pill really help why would the other > > parts of the cluster want to fence the node if the node itself has not > > killed it self as it lost quorum either because quorum devcice gone or > > network connectivity failed and resources needs to be switched off. > > > > What i get is that it is very explict i.e the others nodes > > tell the other server to die. So it must be a case initiated by the other > > nodes. > > I am struggling to think of a scenarios where the other > > nodes would want to fence it. > > > > Main scenario where poison pill shines is 2-node-clusters where you don't > have usable quorum for watchdog-fencing. > Configured with pacemaker-awareness - default - availability of the > shared-disk doesn't become an issue as, due to > fallback to availability of the 2nd node, the disk is no spof (single > point of failure) in these clusters. > Other nodes btw. can still kill a node with watchdog-fencing. If the node > isn't able to accept that wish of another > node for it to die it will have lost quorum, have stopped triggering the > watchdog anyway. > > Regards, > Klaus > > > > > Possible Scenarios, did i miss any? > > > > - Loss of network connection to the node. But that is covered by the > > node self fencing > > - If some monitoring said the node was not healthly or responding... > > Maybe this is the case it is good for but then it must be a partial > failure > > where the node is still part fof the cluster and can respond. I.e not > OS > > freeze or only it looses connection as then the watchdog or the self > > fencing will kick in. > > - HW failures, cpu, memory, disk For virtual hardware does that > > actually ever fail? Sorry if stupid question. I could ask our infra > guys > > but...., > > So is virtual hardware so reliable that hw failures can be ignored. > > - Loss of shared storage SAP uses a lot of shared storage via NFS. Not > > sure what happens when that fails need to research it a bit but each > node > > will sort that out itself I am presuming. > > - Human error: but no cluster will fix that and the human who makes a > > change will realise it and revert. ? > > > > 2. Fence vmware > > > > I see this as a better poision pill as it works at the hardware > > level. But if I do not need poision pill then i do not need this. > > > > In general OS freezes or even panics if take took long are covered by the > > watchdog. > > > > regards > > Angelo > > > > > > > > > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://lists.clusterlabs.org/pipermail/users/attachments/20241009/b9a58eb1/attachment-0001.htm > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > > ------------------------------ > > End of Users Digest, Vol 117, Issue 5 > ************************************* > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ns.lokesh at ericsson.com Sun Oct 13 04:40:42 2024 From: ns.lokesh at ericsson.com (NS Lokesh) Date: Sun, 13 Oct 2024 04:40:42 +0000 Subject: [ClusterLabs] Fix for CVE-2024-41123, CVE-2024-41946, CVE-2024-43398 Message-ID: Hi Team, Please be informed, we have got notified from our security tool that our pcs version 0.10 is affected by the CVE-2024-41123,CVE-2024-41946,CVE-2024-43398 It would be great if we help to get answers for the below queries. 1. Is clusterlab pcs affected by the above mention CVE's? 2. Is there any fix planned/available for this affection version (0.10.x) of pcs ? 3. Let us know in which release this CVEs fix are planned ? We are currently in RHEL 8.6 OS and using pcs 0.10 version, Our system Details:- OS Version: RHEL 8.6 Name : pcs Version : 0.10.16 Release : 1.el8 Architecture: x86_64 Regards, Lokesh NS -------------- next part -------------- An HTML attachment was scrubbed... URL: From piled.email at gmail.com Mon Oct 14 16:49:57 2024 From: piled.email at gmail.com (Jochen) Date: Mon, 14 Oct 2024 18:49:57 +0200 Subject: [ClusterLabs] Interleaving clones with different number of instances per node Message-ID: <0560ECFB-1C73-4903-BE4C-CD7B4A4CE192@gmail.com> Hi, I have two cloned resources in my cluster that have the following properties: * There are a maximum of two instances of R1 in the cluster, with a maximum of two per node * When any instance of R1 is started on a node, exactly one instance of R2 should run on that node When I configure this, and verify the configuration with "crm_verify -LV", I get the following error: clone_rsc_colocation_rh) error: Cannot interleave R2-clone and R1-clone because they do not support the same number of instances per node How can I make this work? Any help would be greatly appreciated. Current configuration is as follows: From kgaillot at redhat.com Tue Oct 15 14:08:00 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Tue, 15 Oct 2024 09:08:00 -0500 Subject: [ClusterLabs] Pacemaker 2.1.9-rc2 released Message-ID: Hi all, The second (and possibly final) release candidate for Pacemaker 2.1.9 is available at: https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.9-rc2 This adds a few more bug fixes. For details, see the above link. Everyone is encouraged to download, compile and test the new release. We do many regression tests and simulations, but we can't cover all possible use cases, so your feedback is important and appreciated. If no one reports any issues with this candidate, it will likely become the final release in a couple of weeks. Many thanks to all contributors to this release, including Chris Lumens, Ken Gaillot, and Reid Wahl. -- Ken Gaillot From kgaillot at redhat.com Wed Oct 16 14:22:46 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Wed, 16 Oct 2024 09:22:46 -0500 Subject: [ClusterLabs] Interleaving clones with different number of instances per node In-Reply-To: <0560ECFB-1C73-4903-BE4C-CD7B4A4CE192@gmail.com> References: <0560ECFB-1C73-4903-BE4C-CD7B4A4CE192@gmail.com> Message-ID: On Mon, 2024-10-14 at 18:49 +0200, Jochen wrote: > Hi, I have two cloned resources in my cluster that have the following > properties: > > * There are a maximum of two instances of R1 in the cluster, with a > maximum of two per node > * When any instance of R1 is started on a node, exactly one instance > of R2 should run on that node > > When I configure this, and verify the configuration with "crm_verify > -LV", I get the following error: > > clone_rsc_colocation_rh) error: Cannot interleave R2-clone and > R1-clone because they do not support the same number of instances per > node > > How can I make this work? Any help would be greatly appreciated. Hi, I believe the number of instances has to be the same because each instance pair on a single node is interleaved. There's no direct way to configure what you want, but it might be possible with a custom OCF agent for R1 and attribute-based rules. On start, the R1 agent could set a custom node attribute to some value. On stop, it could check whether any other instances are active (assuming that's possible), and if not, clear the attribute. Then, R2 could have a location rule enabling it only on nodes where the attribute has the desired value. R2 wouldn't stop until *after* the last instance of R1 stops, which could be a problem depending on the particulars of the service. There might also be a race condition if two instances are stopping at the same time, so it might be worthwhile to set ordered=true on the clone. > > > Current configuration is as follows: > > > > > > > > > > > > > > > > > > > > > > > > > > > then="R2-clone"/> > > > > -- Ken Gaillot From piled.email at gmail.com Thu Oct 17 14:34:52 2024 From: piled.email at gmail.com (Jochen) Date: Thu, 17 Oct 2024 16:34:52 +0200 Subject: [ClusterLabs] Interleaving clones with different number of instances per node In-Reply-To: References: <0560ECFB-1C73-4903-BE4C-CD7B4A4CE192@gmail.com> Message-ID: <953EDA77-ADF2-4F45-A7A0-A1ACD648F5C5@gmail.com> Thanks for the help! Before I break out my editor and start writing custom resource agents, one question: Is there a way to use a cloned ocf:pacemaker:attribute resource to set a clone-specific attribute on a node? I.e. attribute "started-0=1" and "started-1=1", depending on the clone ID? For this I would need e.g. a rule to configure a clone specific resource parameter, or is there something like variable substitution in resource parameters? > On 16. Oct 2024, at 16:22, Ken Gaillot wrote: > > On Mon, 2024-10-14 at 18:49 +0200, Jochen wrote: >> Hi, I have two cloned resources in my cluster that have the following >> properties: >> >> * There are a maximum of two instances of R1 in the cluster, with a >> maximum of two per node >> * When any instance of R1 is started on a node, exactly one instance >> of R2 should run on that node >> >> When I configure this, and verify the configuration with "crm_verify >> -LV", I get the following error: >> >> clone_rsc_colocation_rh) error: Cannot interleave R2-clone and >> R1-clone because they do not support the same number of instances per >> node >> >> How can I make this work? Any help would be greatly appreciated. > > Hi, > > I believe the number of instances has to be the same because each > instance pair on a single node is interleaved. > > There's no direct way to configure what you want, but it might be > possible with a custom OCF agent for R1 and attribute-based rules. > > On start, the R1 agent could set a custom node attribute to some value. > On stop, it could check whether any other instances are active > (assuming that's possible), and if not, clear the attribute. Then, R2 > could have a location rule enabling it only on nodes where the > attribute has the desired value. > > R2 wouldn't stop until *after* the last instance of R1 stops, which > could be a problem depending on the particulars of the service. There > might also be a race condition if two instances are stopping at the > same time, so it might be worthwhile to set ordered=true on the clone. > >> >> >> Current configuration is as follows: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > then="R2-clone"/> >> >> >> >> > -- > Ken Gaillot > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From kgaillot at redhat.com Thu Oct 17 14:50:36 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Thu, 17 Oct 2024 09:50:36 -0500 Subject: [ClusterLabs] Interleaving clones with different number of instances per node In-Reply-To: <953EDA77-ADF2-4F45-A7A0-A1ACD648F5C5@gmail.com> References: <0560ECFB-1C73-4903-BE4C-CD7B4A4CE192@gmail.com> <953EDA77-ADF2-4F45-A7A0-A1ACD648F5C5@gmail.com> Message-ID: <75a23421beec634ebcda390e84320eee42769e6d.camel@redhat.com> On Thu, 2024-10-17 at 16:34 +0200, Jochen wrote: > Thanks for the help! > > Before I break out my editor and start writing custom resource > agents, one question: Is there a way to use a cloned > ocf:pacemaker:attribute resource to set a clone-specific attribute on > a node? I.e. attribute "started-0=1" and "started-1=1", depending on > the clone ID? For this I would need e.g. a rule to configure a clone > specific resource parameter, or is there something like variable > substitution in resource parameters? No, ocf:pacemaker:attribute won't work properly as a unique clone. If one instance is started and another stopped, it will get the status of one of them wrong. I noticed that yesterday and came up with an idea for a general solution if you feel like tackling it: https://projects.clusterlabs.org/T899 > > > On 16. Oct 2024, at 16:22, Ken Gaillot wrote: > > > > On Mon, 2024-10-14 at 18:49 +0200, Jochen wrote: > > > Hi, I have two cloned resources in my cluster that have the > > > following > > > properties: > > > > > > * There are a maximum of two instances of R1 in the cluster, with > > > a > > > maximum of two per node > > > * When any instance of R1 is started on a node, exactly one > > > instance > > > of R2 should run on that node > > > > > > When I configure this, and verify the configuration with > > > "crm_verify > > > -LV", I get the following error: > > > > > > clone_rsc_colocation_rh) error: Cannot interleave R2-clone and > > > R1-clone because they do not support the same number of instances > > > per > > > node > > > > > > How can I make this work? Any help would be greatly appreciated. > > > > Hi, > > > > I believe the number of instances has to be the same because each > > instance pair on a single node is interleaved. > > > > There's no direct way to configure what you want, but it might be > > possible with a custom OCF agent for R1 and attribute-based rules. > > > > On start, the R1 agent could set a custom node attribute to some > > value. > > On stop, it could check whether any other instances are active > > (assuming that's possible), and if not, clear the attribute. Then, > > R2 > > could have a location rule enabling it only on nodes where the > > attribute has the desired value. > > > > R2 wouldn't stop until *after* the last instance of R1 stops, which > > could be a problem depending on the particulars of the service. > > There > > might also be a race condition if two instances are stopping at the > > same time, so it might be worthwhile to set ordered=true on the > > clone. > > > > > > > > Current configuration is as follows: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="target- > > > role" value="Stopped"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="target- > > > role" value="Stopped"/> > > > > > > > > > > > > > > > > > > > > then="R2-clone"/> > > > > > rsc="R2- > > > clone" with-rsc="R1-clone"/> > > > > > > > > > > > -- > > Ken Gaillot > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot From piled.email at gmail.com Thu Oct 17 15:51:12 2024 From: piled.email at gmail.com (Jochen) Date: Thu, 17 Oct 2024 17:51:12 +0200 Subject: [ClusterLabs] Interleaving clones with different number of instances per node In-Reply-To: <75a23421beec634ebcda390e84320eee42769e6d.camel@redhat.com> References: <0560ECFB-1C73-4903-BE4C-CD7B4A4CE192@gmail.com> <953EDA77-ADF2-4F45-A7A0-A1ACD648F5C5@gmail.com> <75a23421beec634ebcda390e84320eee42769e6d.camel@redhat.com> Message-ID: <2CE5F4B2-3D87-4FF7-8D94-4213A9B3449C@gmail.com> Thanks very much for the info and the help. For my problem I have decided to use systemd and its dependency management to create an "instance" service for each clone instance, that then starts the "main" service on a node if any instance service is started. Hopefully works without any coding required... I still might have a try at extending the attribute resource. One question though: Is incrementing and decrementing the count robust enough? I would think a solution that actually counts the running instances each time so we don't get any drift would be preferable. But what would be the best way for the agent to get this count? > On 17. Oct 2024, at 16:50, Ken Gaillot wrote: > > On Thu, 2024-10-17 at 16:34 +0200, Jochen wrote: >> Thanks for the help! >> >> Before I break out my editor and start writing custom resource >> agents, one question: Is there a way to use a cloned >> ocf:pacemaker:attribute resource to set a clone-specific attribute on >> a node? I.e. attribute "started-0=1" and "started-1=1", depending on >> the clone ID? For this I would need e.g. a rule to configure a clone >> specific resource parameter, or is there something like variable >> substitution in resource parameters? > > No, ocf:pacemaker:attribute won't work properly as a unique clone. If > one instance is started and another stopped, it will get the status of > one of them wrong. > > I noticed that yesterday and came up with an idea for a general > solution if you feel like tackling it: > > https://projects.clusterlabs.org/T899 > >> >>> On 16. Oct 2024, at 16:22, Ken Gaillot wrote: >>> >>> On Mon, 2024-10-14 at 18:49 +0200, Jochen wrote: >>>> Hi, I have two cloned resources in my cluster that have the >>>> following >>>> properties: >>>> >>>> * There are a maximum of two instances of R1 in the cluster, with >>>> a >>>> maximum of two per node >>>> * When any instance of R1 is started on a node, exactly one >>>> instance >>>> of R2 should run on that node >>>> >>>> When I configure this, and verify the configuration with >>>> "crm_verify >>>> -LV", I get the following error: >>>> >>>> clone_rsc_colocation_rh) error: Cannot interleave R2-clone and >>>> R1-clone because they do not support the same number of instances >>>> per >>>> node >>>> >>>> How can I make this work? Any help would be greatly appreciated. >>> >>> Hi, >>> >>> I believe the number of instances has to be the same because each >>> instance pair on a single node is interleaved. >>> >>> There's no direct way to configure what you want, but it might be >>> possible with a custom OCF agent for R1 and attribute-based rules. >>> >>> On start, the R1 agent could set a custom node attribute to some >>> value. >>> On stop, it could check whether any other instances are active >>> (assuming that's possible), and if not, clear the attribute. Then, >>> R2 >>> could have a location rule enabling it only on nodes where the >>> attribute has the desired value. >>> >>> R2 wouldn't stop until *after* the last instance of R1 stops, which >>> could be a problem depending on the particulars of the service. >>> There >>> might also be a race condition if two instances are stopping at the >>> same time, so it might be worthwhile to set ordered=true on the >>> clone. >>> >>>> >>>> Current configuration is as follows: >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> name="target- >>>> role" value="Stopped"/> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> name="target- >>>> role" value="Stopped"/> >>>> >>>> >>>> >>>> >>>> >>>> >>> then="R2-clone"/> >>>> >>> rsc="R2- >>>> clone" with-rsc="R1-clone"/> >>>> >>>> >>>> >>> -- >>> Ken Gaillot >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From kgaillot at redhat.com Thu Oct 17 16:21:34 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Thu, 17 Oct 2024 11:21:34 -0500 Subject: [ClusterLabs] Interleaving clones with different number of instances per node In-Reply-To: <2CE5F4B2-3D87-4FF7-8D94-4213A9B3449C@gmail.com> References: <0560ECFB-1C73-4903-BE4C-CD7B4A4CE192@gmail.com> <953EDA77-ADF2-4F45-A7A0-A1ACD648F5C5@gmail.com> <75a23421beec634ebcda390e84320eee42769e6d.camel@redhat.com> <2CE5F4B2-3D87-4FF7-8D94-4213A9B3449C@gmail.com> Message-ID: On Thu, 2024-10-17 at 17:51 +0200, Jochen wrote: > Thanks very much for the info and the help. > > For my problem I have decided to use systemd and its dependency > management to create an "instance" service for each clone instance, > that then starts the "main" service on a node if any instance service > is started. Hopefully works without any coding required... > > I still might have a try at extending the attribute resource. One > question though: Is incrementing and decrementing the count robust > enough? I would think a solution that actually counts the running > instances each time so we don't get any drift would be preferable. > But what would be the best way for the agent to get this count? Incrementing and decrementing should be sufficient in general. If an instance crashes without decrementing the count, Pacemaker will stop it as part of recovery. The main opportunity for trouble would be an instance started outside Pacemaker control. Pacemaker would detect it and either stop it (decrementing when we shouldn't) or leave it alone (not incrementing when we should). To count each time instead, probably the best way would be to look for state files with instance numbers. > > > On 17. Oct 2024, at 16:50, Ken Gaillot wrote: > > > > On Thu, 2024-10-17 at 16:34 +0200, Jochen wrote: > > > Thanks for the help! > > > > > > Before I break out my editor and start writing custom resource > > > agents, one question: Is there a way to use a cloned > > > ocf:pacemaker:attribute resource to set a clone-specific > > > attribute on > > > a node? I.e. attribute "started-0=1" and "started-1=1", depending > > > on > > > the clone ID? For this I would need e.g. a rule to configure a > > > clone > > > specific resource parameter, or is there something like variable > > > substitution in resource parameters? > > > > No, ocf:pacemaker:attribute won't work properly as a unique clone. > > If > > one instance is started and another stopped, it will get the status > > of > > one of them wrong. > > > > I noticed that yesterday and came up with an idea for a general > > solution if you feel like tackling it: > > > > https://projects.clusterlabs.org/T899 > > > > > > On 16. Oct 2024, at 16:22, Ken Gaillot > > > > wrote: > > > > > > > > On Mon, 2024-10-14 at 18:49 +0200, Jochen wrote: > > > > > Hi, I have two cloned resources in my cluster that have the > > > > > following > > > > > properties: > > > > > > > > > > * There are a maximum of two instances of R1 in the cluster, > > > > > with > > > > > a > > > > > maximum of two per node > > > > > * When any instance of R1 is started on a node, exactly one > > > > > instance > > > > > of R2 should run on that node > > > > > > > > > > When I configure this, and verify the configuration with > > > > > "crm_verify > > > > > -LV", I get the following error: > > > > > > > > > > clone_rsc_colocation_rh) error: Cannot interleave R2- > > > > > clone and > > > > > R1-clone because they do not support the same number of > > > > > instances > > > > > per > > > > > node > > > > > > > > > > How can I make this work? Any help would be greatly > > > > > appreciated. > > > > > > > > Hi, > > > > > > > > I believe the number of instances has to be the same because > > > > each > > > > instance pair on a single node is interleaved. > > > > > > > > There's no direct way to configure what you want, but it might > > > > be > > > > possible with a custom OCF agent for R1 and attribute-based > > > > rules. > > > > > > > > On start, the R1 agent could set a custom node attribute to > > > > some > > > > value. > > > > On stop, it could check whether any other instances are active > > > > (assuming that's possible), and if not, clear the attribute. > > > > Then, > > > > R2 > > > > could have a location rule enabling it only on nodes where the > > > > attribute has the desired value. > > > > > > > > R2 wouldn't stop until *after* the last instance of R1 stops, > > > > which > > > > could be a problem depending on the particulars of the service. > > > > There > > > > might also be a race condition if two instances are stopping at > > > > the > > > > same time, so it might be worthwhile to set ordered=true on the > > > > clone. > > > > > > > > > Current configuration is as follows: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="target- > > > > > role" value="Stopped"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="target- > > > > > role" value="Stopped"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > then="R2-clone"/> > > > > > > > > > rsc="R2- > > > > > clone" with-rsc="R1-clone"/> > > > > > > > > > > > > > > > > > > > -- > > > > Ken Gaillot > > > > > > > > _______________________________________________ > > > > Manage your subscription: > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > -- > > Ken Gaillot > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot From s-sumomozawa at sumomozawa.com Thu Oct 17 11:19:23 2024 From: s-sumomozawa at sumomozawa.com (=?UTF-8?B?566h55CG6ICF?=) Date: Thu, 17 Oct 2024 11:19:23 +0000 Subject: [ClusterLabs] pacemaker & rsyslog References: <07923881-a746-47cd-981d-2d951c1c565c@sumomozawa.com> Message-ID: <010001929a342a79-8d02558d-f1c3-4c21-89c4-21f0f5ebfc66-000000@email.amazonses.com> Nice to meet you. Thank you for your help. I would like your opinion on setting up rsyslog on a spacemaker resource and giving it VIP. I am aware that if rsyslog is clustered with pacemaker, rsyslog will be active standby, so the logs will not be output on the standby machine. Is there any way to make only certain logs active standby? If so, should we consider other means instead of using pacemaker? -- ?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/ ?? ?? (Seiji Sumomozawa) TEL?080-5099-4247 Mail?s-sumomozawa at sumomozawa.com ?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/ From mrt_nl at hotmail.com Fri Oct 18 18:45:48 2024 From: mrt_nl at hotmail.com (Murat Inal) Date: Fri, 18 Oct 2024 21:45:48 +0300 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock In-Reply-To: References: Message-ID: Hi Oyvind, Probably current portblock has a bug. It CREATES netfilter rule on start(), however DOES NOT DELETE the rule on stop(). Here is the configuration of my simple 2 node + 1 qdevice cluster; node 1: node-a-knet \ ??? attributes standby=off node 2: node-b-knet \ ??? attributes standby=off primitive r-porttoggle portblock \ ??? params action=block direction=out ip=172.16.0.1 portno=1234 protocol=udp \ ??? op monitor interval=10s timeout=10s \ ??? op start interval=0s timeout=20s \ ??? op stop interval=0s timeout=20s primitive r-vip IPaddr2 \ ??? params cidr_netmask=24 ip=10.1.6.253 \ ??? op monitor interval=10s timeout=20s \ ??? op start interval=0s timeout=20s \ ??? op stop interval=0s timeout=20s colocation c1 inf: r-porttoggle r-vip order o1 r-vip r-porttoggle property cib-bootstrap-options: \ ??? have-watchdog=false \ ??? dc-version=2.1.6-6fdc9deea29 \ ??? cluster-infrastructure=corosync \ ??? cluster-name=testcluster \ ??? stonith-enabled=false \ ??? last-lrm-refresh=1729272215 - I checked the switchover and observed netfilter chain (watch sudo iptables -L OUTPUT) real-time, - Tried portblock with parameter direction=out & both. - Checked if the relevant functions IptablesBLOCK() & IptablesUNBLOCK() are executing (by inserting syslog mark messages inside). They do run. However rule is ONLY created, NEVER deleted. Any suggestions? On 10/9/24 11:26, Oyvind Albrigtsen wrote: > Correct. That should block the port when the resource is stopped on a > node (e.g. if you have it grouped with the service you're using on the > port). > > I would do some testing to ensure it works exactly as you expect. E.g. > you can telnet to the port, or you can run nc/socat on the port and > telnet to it from the node it blocks/unblocks. If it doesnt accept > the connection you know it's blocked. > > > Oyvind Albrigtsen > > On 06/10/24 22:46 GMT, Murat Inal wrote: >> Hello, >> >> I'd like to confirm with you the mechanism of ocf:heartbeat:portblock. >> >> Given a resource definition; >> >> Resource: r41_LIO (class=ocf provider=heartbeat type=portblock) >> ? Attributes: r41_LIO-instance_attributes >> ??? action=unblock >> ??? ip=10.1.8.194 >> ??? portno=3260 >> ??? protocol=tcp >> >> - If resource starts, TCP:3260 is UNBLOCKED. >> >> - If resource is stopped, TCP:3260 is BLOCKED. >> >> Is that correct? If action=block, it will run just the opposite, >> correct? >> >> To toggle a port, a single portblock resource is enough, correct? >> >> Thanks, >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ From oalbrigt at redhat.com Mon Oct 21 09:25:22 2024 From: oalbrigt at redhat.com (Oyvind Albrigtsen) Date: Mon, 21 Oct 2024 11:25:22 +0200 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock In-Reply-To: References: Message-ID: I would try running "pcs resource debug-stop --full " to see what's happening, and try to run the "iptables -D" line manually if it doesnt show you an error. Oyvind On 18/10/24 21:45 +0300, Murat Inal wrote: >Hi Oyvind, > >Probably current portblock has a bug. It CREATES netfilter rule on >start(), however DOES NOT DELETE the rule on stop(). > >Here is the configuration of my simple 2 node + 1 qdevice cluster; > > >node 1: node-a-knet \ >??? attributes standby=off >node 2: node-b-knet \ >??? attributes standby=off >primitive r-porttoggle portblock \ >??? params action=block direction=out ip=172.16.0.1 portno=1234 >protocol=udp \ >??? op monitor interval=10s timeout=10s \ >??? op start interval=0s timeout=20s \ >??? op stop interval=0s timeout=20s >primitive r-vip IPaddr2 \ >??? params cidr_netmask=24 ip=10.1.6.253 \ >??? op monitor interval=10s timeout=20s \ >??? op start interval=0s timeout=20s \ >??? op stop interval=0s timeout=20s >colocation c1 inf: r-porttoggle r-vip >order o1 r-vip r-porttoggle >property cib-bootstrap-options: \ >??? have-watchdog=false \ >??? dc-version=2.1.6-6fdc9deea29 \ >??? cluster-infrastructure=corosync \ >??? cluster-name=testcluster \ >??? stonith-enabled=false \ >??? last-lrm-refresh=1729272215 > > >- I checked the switchover and observed netfilter chain (watch sudo >iptables -L OUTPUT) real-time, > >- Tried portblock with parameter direction=out & both. > >- Checked if the relevant functions IptablesBLOCK() & >IptablesUNBLOCK() are executing (by inserting syslog mark messages >inside). They do run. > >However rule is ONLY created, NEVER deleted. > >Any suggestions? > > >On 10/9/24 11:26, Oyvind Albrigtsen wrote: > >>Correct. That should block the port when the resource is stopped on a >>node (e.g. if you have it grouped with the service you're using on the >>port). >> >>I would do some testing to ensure it works exactly as you expect. E.g. >>you can telnet to the port, or you can run nc/socat on the port and >>telnet to it from the node it blocks/unblocks. If it doesnt accept >>the connection you know it's blocked. >> >> >>Oyvind Albrigtsen >> >>On 06/10/24 22:46 GMT, Murat Inal wrote: >>>Hello, >>> >>>I'd like to confirm with you the mechanism of ocf:heartbeat:portblock. >>> >>>Given a resource definition; >>> >>>Resource: r41_LIO (class=ocf provider=heartbeat type=portblock) >>>? Attributes: r41_LIO-instance_attributes >>>??? action=unblock >>>??? ip=10.1.8.194 >>>??? portno=3260 >>>??? protocol=tcp >>> >>>- If resource starts, TCP:3260 is UNBLOCKED. >>> >>>- If resource is stopped, TCP:3260 is BLOCKED. >>> >>>Is that correct? If action=block, it will run just the opposite, >>>correct? >>> >>>To toggle a port, a single portblock resource is enough, correct? >>> >>>Thanks, >>> >>>_______________________________________________ >>>Manage your subscription: >>>https://lists.clusterlabs.org/mailman/listinfo/users >>> >>>ClusterLabs home: https://www.clusterlabs.org/ >> >>_______________________________________________ >>Manage your subscription: >>https://lists.clusterlabs.org/mailman/listinfo/users >> >>ClusterLabs home: https://www.clusterlabs.org/ >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ From oalbrigt at redhat.com Mon Oct 21 09:44:08 2024 From: oalbrigt at redhat.com (Oyvind Albrigtsen) Date: Mon, 21 Oct 2024 11:44:08 +0200 Subject: [ClusterLabs] announcement: schedule for resource-agents release 4.16.0 Message-ID: Hi, This is a tentative schedule for resource-agents v4.16.0: 4.16.0-rc1: Oct 30. 4.16.0: Nov 6. Full list of changes: https://github.com/ClusterLabs/resource-agents/compare/v4.15.1...main I've modified the corresponding milestones at: https://github.com/ClusterLabs/resource-agents/milestones If there's anything you think should be part of the release please open an issue, a pull request, or a bugzilla, as you see fit. If there's anything that hasn't received due attention, please let us know. Finally, if you can help with resolving issues consider yourself invited to do so. There are currently 160 issues and 49 pull requests still open. Cheers, Oyvind Albrigtsen From chenzufei at gmail.com Mon Oct 21 11:07:01 2024 From: chenzufei at gmail.com (zufei chen) Date: Mon, 21 Oct 2024 19:07:01 +0800 Subject: [ClusterLabs] poor performance for large resource configuration Message-ID: Hi all, background? 1. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) 2. there are 11 nodes in total, divided into 3 groups. If a node fails within a group, the resources can only be taken over by nodes within that group. 3. Each node has 2 MDTs and 16 OSTs. Issues: 1. The resource configuration time progressively increases. the second mdt-0 cost only 8s?the last ost-175 cost 1min:37s 2. The total time taken for the configuration is approximately 2 hours and 31 minutes. Is there a way to improve it? attachment: create bash: pcs_create.sh create log: pcs_create.log -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pcs_create.sh Type: text/x-sh Size: 5571 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pcs_create.log Type: application/octet-stream Size: 28129 bytes Desc: not available URL: From social at bohboh.info Mon Oct 21 12:54:26 2024 From: social at bohboh.info (Social Boh) Date: Mon, 21 Oct 2024 07:54:26 -0500 Subject: [ClusterLabs] announcement: schedule for resource-agents release 4.16.0 In-Reply-To: References: Message-ID: thank you very much!!! --- I'm SoCIaL, MayBe El 21/10/2024 a las 4:44 a.?m., Oyvind Albrigtsen escribi?: > Hi, > > This is a tentative schedule for resource-agents v4.16.0: > 4.16.0-rc1: Oct 30. > 4.16.0: Nov 6. > > Full list of changes: > https://github.com/ClusterLabs/resource-agents/compare/v4.15.1...main > > I've modified the corresponding milestones at: > https://github.com/ClusterLabs/resource-agents/milestones > > If there's anything you think should be part of the release > please open an issue, a pull request, or a bugzilla, as you see > fit. > > If there's anything that hasn't received due attention, please > let us know. > > Finally, if you can help with resolving issues consider yourself > invited to do so. There are currently 160 issues and 49 pull > requests still open. > > > Cheers, > Oyvind Albrigtsen > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ From kgaillot at redhat.com Mon Oct 21 14:30:42 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Mon, 21 Oct 2024 09:30:42 -0500 Subject: [ClusterLabs] pacemaker & rsyslog In-Reply-To: <010001929a342a79-8d02558d-f1c3-4c21-89c4-21f0f5ebfc66-000000@email.amazonses.com> References: <07923881-a746-47cd-981d-2d951c1c565c@sumomozawa.com> <010001929a342a79-8d02558d-f1c3-4c21-89c4-21f0f5ebfc66-000000@email.amazonses.com> Message-ID: <4f5a0b24ec5c5b77b04dba4598bae4d14798ed92.camel@redhat.com> On Thu, 2024-10-17 at 11:19 +0000, ??? via Users wrote: > Nice to meet you. Thank you for your help. > > I would like your opinion on setting up rsyslog on a spacemaker > resource > and giving it VIP. > I am aware that if rsyslog is clustered with pacemaker, rsyslog will > be > active standby, so the logs will not be output on the standby > machine. Hi, Welcome to the community. What is your use case for rsyslog? Do you want local logging on each node, or should one node be an aggregator for all the other nodes, or is this cluster providing a rsyslog aggregator for hosts elsewhere? > Is there any way to make only certain logs active standby? > If so, should we consider other means instead of using pacemaker? > > > -- > ?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/ > ?? ?? (Seiji Sumomozawa) > TEL?080-5099-4247 > Mail?s-sumomozawa at sumomozawa.com > ?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/?/ > -- Ken Gaillot From tojeline at redhat.com Mon Oct 21 15:19:00 2024 From: tojeline at redhat.com (Tomas Jelinek) Date: Mon, 21 Oct 2024 17:19:00 +0200 Subject: [ClusterLabs] Fix for CVE-2024-41123, CVE-2024-41946, CVE-2024-43398 In-Reply-To: References: Message-ID: Hi, The listed CVEs describe vulnerabilities in REXML library. Pcs source code is not affected. Therefore, no fix is available / planned in pcs source code to address these. However, if you are using rexml packages or pcs packages which contain a copy of REXML, I suggest to keep them upgraded to the latest available version. Regards, Tomas Dne 13. 10. 24 v 6:40 NS Lokesh via Users napsal(a): > Hi Team, > > Please be informed, we have got notified from our security tool that our > pcs version 0.10 is affected by the > *CVE-2024-41123,CVE-2024-41946,CVE-2024-43398* > > It would be great if we help to get answers for the below queries. > > 1. Is clusterlab pcs affected by the above mention CVE?s? > 2. Is there any fix planned/available for this affection version > (0.10.x) of pcs ? > 3. Let us know in which release this CVEs fix are planned ? > > We are currently in RHEL 8.6 OS and using pcs 0.10 version, ** > > *Our system Details:-* > > OS Version: RHEL 8.6 > > Name??????? : pcs > > Version???? : 0.10.16 > > Release???? : 1.el8 > > Architecture: x86_64 > > Regards, > > Lokesh NS > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ From kgaillot at redhat.com Mon Oct 21 20:03:54 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Mon, 21 Oct 2024 15:03:54 -0500 Subject: [ClusterLabs] Pacemaker 2.1.9-rc3 released Message-ID: <815826572613072a7957152a92d712f7ad4f8967.camel@redhat.com> Hi all, The third (and likely final) release candidate for Pacemaker 2.1.9 is now available at: https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.9-rc3 I decided to squeeze in a couple more minor fixes. For details, see the above link. Everyone is encouraged to download, compile and test the new release. We do many regression tests and simulations, but we can't cover all possible use cases, so your feedback is important and appreciated. If no one reports any issues with this candidate, it will likely become the final release around the end of the month. Many thanks to all contributors to this release, including Aleksei Burlakov and Ken Gaillot. -- Ken Gaillot From fatcharly at gmx.de Tue Oct 22 12:18:44 2024 From: fatcharly at gmx.de (Testuser SST) Date: Tue, 22 Oct 2024 12:18:44 +0000 Subject: [ClusterLabs] Problem with a new cluster with drbd on AlmaLinux 9 Message-ID: Hi, I'm running a 2-node-web-cluster on Almalinux-9, pacemaker 2.1.7, drbd9 and corosync 3.1. I have trouble with the promoting and mounting of the drbd-device. After activating the cluster, the drbd-device is not getting mounted and is showing quite fast an error message: pacemaker-schedulerd[4879]: warning: Unexpected result (error: Couldn't mount device [/dev/drbd1] as /mnt/clusterfs) was recorded for start of Webcontent_FS on ... pacemaker-schedulerd[4879]: warning: Webcontent_FS cannot run on kathie3 due to reaching migration threshold (clean up resource to allow again) It's like it's trying to mount the device, but the device is not ready yet. The device is the drbd1 and I'm trying to mount it on /mnt/clusterfs. After the error occoured, and I do a "pcs resource cleanup" the cluster is able to mount it. the drbd-resource is named webcontend_DRBD the mounted filesystem is named webcontend_FS All other resources like httpd and HA-IP's working like a charm. This is the log from the start of the cluster: Oct 22 11:48:12 kathie3 pacemaker-controld[4880]: notice: State transition S_ELECTION -> S_INTEGRATION Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_1 ( kathie3 ) Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_2 ( kathie3 ) Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_3 ( kathie3 ) Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start Webcontent_DRBD:0 ( kathie3 ) Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start Webcontent_FS ( kathie3 ) Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start ping_fw:0 ( kathie3 ) Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Calculated transition 1106, saving inputs in /var/lib/pacemaker/pengine/pe-input-336.bz2 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation HA-IP_1_start_0 locally on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation Webcontent_FS_start_0 locally on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation ping_fw_start_0 locally on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation Webcontent_DRBD_start_0 locally on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for HA-IP_1 on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for ping_fw on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for Webcontent_DRBD on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for Webcontent_FS on kathie3 Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682892]: INFO: Adding inet address 192.168.16.75/24 with broadcast address 192.168.16.255 to device ens3 Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682912]: INFO: Bringing device ens3 up Oct 22 11:48:13 kathie3 Filesystem(Webcontent_FS)[1682923]: INFO: Running start for /dev/drbd1 on /mnt/clusterfs Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682929]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-192.168.16.75 ens3 192.168.16.75 auto not_used not_used Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Starting worker thread (node-id 0) Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for HA-IP_1 on kathie3: ok Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating monitor operation HA-IP_1_monitor_30000 locally on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of monitor operation for HA-IP_1 on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation HA-IP_2_start_0 locally on kathie3 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for HA-IP_2 on kathie3 Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Auto-promote failed: Need access to UpToDate data (-2) Oct 22 11:48:13 kathie3 kernel: /dev/drbd1: Can't open blockdev Oct 22 11:48:13 kathie3 kernel: /dev/drbd1: Can't open blockdev Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: meta-data IO uses: blk-bio Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: disk( Diskless -> Attaching ) [attach] Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: Maximum number of peer devices = 1 Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Method to ensure write ordering: flush Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: drbd_bm_resize called with capacity == 104854328 Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: resync bitmap: bits=13106791 words=204794 pages=400 Oct 22 11:48:13 kathie3 kernel: drbd1: detected capacity change from 0 to 104854328 Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: size = 50 GB (52427164 KB) Oct 22 11:48:13 kathie3 Filesystem(Webcontent_FS)[1683017]: ERROR: Couldn't mount device [/dev/drbd1] as /mnt/clusterfs Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for Webcontent_FS on kathie3: error (Couldn't mount device [/dev/drbd1] as /mnt/clusterfs) Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Webcontent_FS_start_0 at kathie3 output [ blockdev: cannot open /dev/drbd1: No data available\nmount: /mnt/clusterfs: mount(2) system call failed: No data available.\nocf-exit-reason:Couldn't mount device [/dev/drbd1] as /mnt/clusterfs\n ] Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 aborted by operation Webcontent_FS_start_0 'modify' on kathie3: Event failed Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 action 37 (Webcontent_FS_start_0 on kathie3): expected 'ok' but got 'error' Oct 22 11:48:13 kathie3 pacemaker-attrd[4878]: notice: Setting last-failure-Webcontent_FS#start_0[kathie3] in instance_attributes: (unset) -> 1729590493 Oct 22 11:48:13 kathie3 pacemaker-attrd[4878]: notice: Setting fail-count-Webcontent_FS#start_0[kathie3] in instance_attributes: (unset) -> INFINITY Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 aborted by status-1-last-failure-Webcontent_FS.start_0 doing create last-failure-Webcontent_FS#start_0=1729590493: Transient attribute change Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: bitmap READ of 400 pages took 34 ms Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: disk( Attaching -> UpToDate ) [attach] Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: attached to current UUID: 826E8850CF10C812 Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: Setting exposed data uuid: 826E8850CF10C812 Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of monitor operation for HA-IP_1 on kathie3: ok Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: Starting sender thread (peer-node-id 1) Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: conn( StandAlone -> Unconnected ) [connect] Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: Starting receiver thread (peer-node-id 1) Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: conn( Unconnected -> Connecting ) [connecting] Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683100]: INFO: Adding inet address 192.168.16.76/24 with broadcast address 192.168.16.255 to device ens3 Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683106]: INFO: Bringing device ens3 up Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683112]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-192.168.16.76 ens3 192.168.16.76 auto not_used not_used Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for HA-IP_2 on kathie3: ok Oct 22 11:48:15 kathie3 pacemaker-attrd[4878]: notice: Setting pingd[kathie3] in instance_attributes: (unset) -> 1000 Oct 22 11:48:15 kathie3 pacemaker-controld[4880]: notice: Result of start operation for ping_fw on kathie3: ok Oct 22 11:48:17 kathie3 IPaddr2(HA-IP_1)[1683126]: INFO: ARPING 192.168.16.75 from 192.168.16.75 ens3#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) Oct 22 11:48:17 kathie3 IPaddr2(HA-IP_2)[1683130]: INFO: ARPING 192.168.16.76 from 192.168.16.76 ens3#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683138]: INFO: webcontent_data: Called drbdsetup wait-connect-resource webcontent_data --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5 Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683142]: INFO: webcontent_data: Exit code 5 Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683146]: INFO: webcontent_data: Command output: Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683150]: INFO: webcontent_data: Command stderr: Oct 22 11:48:19 kathie3 pacemaker-attrd[4878]: notice: Setting master-Webcontent_DRBD[kathie3] in instance_attributes: (unset) -> 1000 Oct 22 11:48:19 kathie3 pacemaker-controld[4880]: notice: Result of start operation for Webcontent_DRBD on kathie3: ok Oct 22 11:48:19 kathie3 pacemaker-controld[4880]: notice: Initiating notify operation Webcontent_DRBD_post_notify_start_0 locally on kathie3 ... Is there some kind of timeout wrong or what am I missing ? Any suggestions are welcome Kind regards fatcharly From arvidjaar at gmail.com Tue Oct 22 13:41:09 2024 From: arvidjaar at gmail.com (Andrei Borzenkov) Date: Tue, 22 Oct 2024 16:41:09 +0300 Subject: [ClusterLabs] Problem with a new cluster with drbd on AlmaLinux 9 In-Reply-To: References: Message-ID: On Tue, Oct 22, 2024 at 3:18?PM Testuser SST via Users wrote: > > Hi, > I'm running a 2-node-web-cluster on Almalinux-9, pacemaker 2.1.7, drbd9 and corosync 3.1. > I have trouble with the promoting and mounting of the drbd-device. After activating the cluster, > the drbd-device is not getting mounted and is showing quite fast an error message: > > pacemaker-schedulerd[4879]: warning: Unexpected result (error: Couldn't mount device [/dev/drbd1] as /mnt/clusterfs) was recorded for start of Webcontent_FS on ... > pacemaker-schedulerd[4879]: warning: Webcontent_FS cannot run on kathie3 due to reaching migration threshold (clean up resource to allow again) > Do you have any ordering constraints between Webcontent_DRBD and Webcontent_FS? > It's like it's trying to mount the device, but the device is not ready yet. > The device is the drbd1 and I'm trying to mount it on /mnt/clusterfs. After the error occoured, and I do a "pcs resource cleanup" the cluster is able to mount it. > the drbd-resource is named webcontend_DRBD > the mounted filesystem is named webcontend_FS > All other resources like httpd and HA-IP's working like a charm. > > This is the log from the start of the cluster: > > Oct 22 11:48:12 kathie3 pacemaker-controld[4880]: notice: State transition S_ELECTION -> S_INTEGRATION > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_1 ( kathie3 ) > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_2 ( kathie3 ) > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_3 ( kathie3 ) > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start Webcontent_DRBD:0 ( kathie3 ) > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start Webcontent_FS ( kathie3 ) > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start ping_fw:0 ( kathie3 ) > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Calculated transition 1106, saving inputs in /var/lib/pacemaker/pengine/pe-input-336.bz2 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation HA-IP_1_start_0 locally on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation Webcontent_FS_start_0 locally on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation ping_fw_start_0 locally on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation Webcontent_DRBD_start_0 locally on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for HA-IP_1 on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for ping_fw on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for Webcontent_DRBD on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for Webcontent_FS on kathie3 > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682892]: INFO: Adding inet address 192.168.16.75/24 with broadcast address 192.168.16.255 to device ens3 > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682912]: INFO: Bringing device ens3 up > Oct 22 11:48:13 kathie3 Filesystem(Webcontent_FS)[1682923]: INFO: Running start for /dev/drbd1 on /mnt/clusterfs > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682929]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-192.168.16.75 ens3 192.168.16.75 auto not_used not_used > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Starting worker thread (node-id 0) > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for HA-IP_1 on kathie3: ok > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating monitor operation HA-IP_1_monitor_30000 locally on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of monitor operation for HA-IP_1 on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation HA-IP_2_start_0 locally on kathie3 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for HA-IP_2 on kathie3 > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Auto-promote failed: Need access to UpToDate data (-2) > Oct 22 11:48:13 kathie3 kernel: /dev/drbd1: Can't open blockdev > Oct 22 11:48:13 kathie3 kernel: /dev/drbd1: Can't open blockdev > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: meta-data IO uses: blk-bio > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: disk( Diskless -> Attaching ) [attach] > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: Maximum number of peer devices = 1 > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Method to ensure write ordering: flush > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: drbd_bm_resize called with capacity == 104854328 > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: resync bitmap: bits=13106791 words=204794 pages=400 > Oct 22 11:48:13 kathie3 kernel: drbd1: detected capacity change from 0 to 104854328 > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: size = 50 GB (52427164 KB) > Oct 22 11:48:13 kathie3 Filesystem(Webcontent_FS)[1683017]: ERROR: Couldn't mount device [/dev/drbd1] as /mnt/clusterfs > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for Webcontent_FS on kathie3: error (Couldn't mount device [/dev/drbd1] as /mnt/clusterfs) > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Webcontent_FS_start_0 at kathie3 output [ blockdev: cannot open /dev/drbd1: No data available\nmount: /mnt/clusterfs: mount(2) system call failed: No data available.\nocf-exit-reason:Couldn't mount device [/dev/drbd1] as /mnt/clusterfs\n ] > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 aborted by operation Webcontent_FS_start_0 'modify' on kathie3: Event failed > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 action 37 (Webcontent_FS_start_0 on kathie3): expected 'ok' but got 'error' > Oct 22 11:48:13 kathie3 pacemaker-attrd[4878]: notice: Setting last-failure-Webcontent_FS#start_0[kathie3] in instance_attributes: (unset) -> 1729590493 > Oct 22 11:48:13 kathie3 pacemaker-attrd[4878]: notice: Setting fail-count-Webcontent_FS#start_0[kathie3] in instance_attributes: (unset) -> INFINITY > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 aborted by status-1-last-failure-Webcontent_FS.start_0 doing create last-failure-Webcontent_FS#start_0=1729590493: Transient attribute change > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: bitmap READ of 400 pages took 34 ms > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: disk( Attaching -> UpToDate ) [attach] > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: attached to current UUID: 826E8850CF10C812 > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: Setting exposed data uuid: 826E8850CF10C812 > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of monitor operation for HA-IP_1 on kathie3: ok > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: Starting sender thread (peer-node-id 1) > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: conn( StandAlone -> Unconnected ) [connect] > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: Starting receiver thread (peer-node-id 1) > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: conn( Unconnected -> Connecting ) [connecting] > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683100]: INFO: Adding inet address 192.168.16.76/24 with broadcast address 192.168.16.255 to device ens3 > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683106]: INFO: Bringing device ens3 up > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683112]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-192.168.16.76 ens3 192.168.16.76 auto not_used not_used > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for HA-IP_2 on kathie3: ok > Oct 22 11:48:15 kathie3 pacemaker-attrd[4878]: notice: Setting pingd[kathie3] in instance_attributes: (unset) -> 1000 > Oct 22 11:48:15 kathie3 pacemaker-controld[4880]: notice: Result of start operation for ping_fw on kathie3: ok > Oct 22 11:48:17 kathie3 IPaddr2(HA-IP_1)[1683126]: INFO: ARPING 192.168.16.75 from 192.168.16.75 ens3#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) > Oct 22 11:48:17 kathie3 IPaddr2(HA-IP_2)[1683130]: INFO: ARPING 192.168.16.76 from 192.168.16.76 ens3#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683138]: INFO: webcontent_data: Called drbdsetup wait-connect-resource webcontent_data --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5 > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683142]: INFO: webcontent_data: Exit code 5 > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683146]: INFO: webcontent_data: Command output: > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683150]: INFO: webcontent_data: Command stderr: > Oct 22 11:48:19 kathie3 pacemaker-attrd[4878]: notice: Setting master-Webcontent_DRBD[kathie3] in instance_attributes: (unset) -> 1000 > Oct 22 11:48:19 kathie3 pacemaker-controld[4880]: notice: Result of start operation for Webcontent_DRBD on kathie3: ok > Oct 22 11:48:19 kathie3 pacemaker-controld[4880]: notice: Initiating notify operation Webcontent_DRBD_post_notify_start_0 locally on kathie3 > ... > > Is there some kind of timeout wrong or what am I missing ? > > Any suggestions are welcome > > Kind regards > > fatcharly > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ From fatcharly at gmx.de Tue Oct 22 13:44:53 2024 From: fatcharly at gmx.de (Testuser SST) Date: Tue, 22 Oct 2024 13:44:53 +0000 Subject: [ClusterLabs] Problem with a new cluster with drbd on AlmaLinux 9 In-Reply-To: References: Message-ID: Hi Andrei, no, this are the only ones: Location Constraints: resource 'Apache' (id: location-Apache) Rules: Rule: boolean-op=or score=-INFINITY (id: location-Apache-rule) Expression: pingd lt 1 (id: location-Apache-rule-expr) Expression: not_defined pingd (id: location-Apache-rule-expr-1) Colocation Constraints: resource 'Apache' with resource 'HA-IPs' (id: colocation-Apache-HA-IPs-INFINITY) score=INFINITY resource 'Apache' with resource 'Webcontent_FS' (id: colocation-Apache-Webcontent_FS-INFINITY) score=INFINITY Order Constraints: start resource 'HA-IPs' then start resource 'Apache' (id: order-HA-IPs-Apache-mandatory) start resource 'Webcontent_FS' then start resource 'Apache' (id: order-Webcontent_FS-Apache-mandatory) > Gesendet: Dienstag, 22. Oktober 2024 um 15:41 > Von: "Andrei Borzenkov" > An: "Cluster Labs - All topics related to open-source clustering welcomed" > CC: "Testuser SST" > Betreff: Re: [ClusterLabs] Problem with a new cluster with drbd on AlmaLinux 9 > > On Tue, Oct 22, 2024 at 3:18?PM Testuser SST via Users > wrote: > > > > Hi, > > I'm running a 2-node-web-cluster on Almalinux-9, pacemaker 2.1.7, drbd9 and corosync 3.1. > > I have trouble with the promoting and mounting of the drbd-device. After activating the cluster, > > the drbd-device is not getting mounted and is showing quite fast an error message: > > > > pacemaker-schedulerd[4879]: warning: Unexpected result (error: Couldn't mount device [/dev/drbd1] as /mnt/clusterfs) was recorded for start of Webcontent_FS on ... > > pacemaker-schedulerd[4879]: warning: Webcontent_FS cannot run on kathie3 due to reaching migration threshold (clean up resource to allow again) > > > > Do you have any ordering constraints between Webcontent_DRBD and Webcontent_FS? > > > It's like it's trying to mount the device, but the device is not ready yet. > > The device is the drbd1 and I'm trying to mount it on /mnt/clusterfs. After the error occoured, and I do a "pcs resource cleanup" the cluster is able to mount it. > > the drbd-resource is named webcontend_DRBD > > the mounted filesystem is named webcontend_FS > > All other resources like httpd and HA-IP's working like a charm. > > > > This is the log from the start of the cluster: > > > > Oct 22 11:48:12 kathie3 pacemaker-controld[4880]: notice: State transition S_ELECTION -> S_INTEGRATION > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_1 ( kathie3 ) > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_2 ( kathie3 ) > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_3 ( kathie3 ) > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start Webcontent_DRBD:0 ( kathie3 ) > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start Webcontent_FS ( kathie3 ) > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start ping_fw:0 ( kathie3 ) > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Calculated transition 1106, saving inputs in /var/lib/pacemaker/pengine/pe-input-336.bz2 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation HA-IP_1_start_0 locally on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation Webcontent_FS_start_0 locally on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation ping_fw_start_0 locally on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation Webcontent_DRBD_start_0 locally on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for HA-IP_1 on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for ping_fw on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for Webcontent_DRBD on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for Webcontent_FS on kathie3 > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682892]: INFO: Adding inet address 192.168.16.75/24 with broadcast address 192.168.16.255 to device ens3 > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682912]: INFO: Bringing device ens3 up > > Oct 22 11:48:13 kathie3 Filesystem(Webcontent_FS)[1682923]: INFO: Running start for /dev/drbd1 on /mnt/clusterfs > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682929]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-192.168.16.75 ens3 192.168.16.75 auto not_used not_used > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Starting worker thread (node-id 0) > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for HA-IP_1 on kathie3: ok > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating monitor operation HA-IP_1_monitor_30000 locally on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of monitor operation for HA-IP_1 on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation HA-IP_2_start_0 locally on kathie3 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for HA-IP_2 on kathie3 > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Auto-promote failed: Need access to UpToDate data (-2) > > Oct 22 11:48:13 kathie3 kernel: /dev/drbd1: Can't open blockdev > > Oct 22 11:48:13 kathie3 kernel: /dev/drbd1: Can't open blockdev > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: meta-data IO uses: blk-bio > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: disk( Diskless -> Attaching ) [attach] > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: Maximum number of peer devices = 1 > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Method to ensure write ordering: flush > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: drbd_bm_resize called with capacity == 104854328 > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: resync bitmap: bits=13106791 words=204794 pages=400 > > Oct 22 11:48:13 kathie3 kernel: drbd1: detected capacity change from 0 to 104854328 > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: size = 50 GB (52427164 KB) > > Oct 22 11:48:13 kathie3 Filesystem(Webcontent_FS)[1683017]: ERROR: Couldn't mount device [/dev/drbd1] as /mnt/clusterfs > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for Webcontent_FS on kathie3: error (Couldn't mount device [/dev/drbd1] as /mnt/clusterfs) > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Webcontent_FS_start_0 at kathie3 output [ blockdev: cannot open /dev/drbd1: No data available\nmount: /mnt/clusterfs: mount(2) system call failed: No data available.\nocf-exit-reason:Couldn't mount device [/dev/drbd1] as /mnt/clusterfs\n ] > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 aborted by operation Webcontent_FS_start_0 'modify' on kathie3: Event failed > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 action 37 (Webcontent_FS_start_0 on kathie3): expected 'ok' but got 'error' > > Oct 22 11:48:13 kathie3 pacemaker-attrd[4878]: notice: Setting last-failure-Webcontent_FS#start_0[kathie3] in instance_attributes: (unset) -> 1729590493 > > Oct 22 11:48:13 kathie3 pacemaker-attrd[4878]: notice: Setting fail-count-Webcontent_FS#start_0[kathie3] in instance_attributes: (unset) -> INFINITY > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 aborted by status-1-last-failure-Webcontent_FS.start_0 doing create last-failure-Webcontent_FS#start_0=1729590493: Transient attribute change > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: bitmap READ of 400 pages took 34 ms > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: disk( Attaching -> UpToDate ) [attach] > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: attached to current UUID: 826E8850CF10C812 > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: Setting exposed data uuid: 826E8850CF10C812 > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of monitor operation for HA-IP_1 on kathie3: ok > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: Starting sender thread (peer-node-id 1) > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: conn( StandAlone -> Unconnected ) [connect] > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: Starting receiver thread (peer-node-id 1) > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: conn( Unconnected -> Connecting ) [connecting] > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683100]: INFO: Adding inet address 192.168.16.76/24 with broadcast address 192.168.16.255 to device ens3 > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683106]: INFO: Bringing device ens3 up > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683112]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-192.168.16.76 ens3 192.168.16.76 auto not_used not_used > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for HA-IP_2 on kathie3: ok > > Oct 22 11:48:15 kathie3 pacemaker-attrd[4878]: notice: Setting pingd[kathie3] in instance_attributes: (unset) -> 1000 > > Oct 22 11:48:15 kathie3 pacemaker-controld[4880]: notice: Result of start operation for ping_fw on kathie3: ok > > Oct 22 11:48:17 kathie3 IPaddr2(HA-IP_1)[1683126]: INFO: ARPING 192.168.16.75 from 192.168.16.75 ens3#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) > > Oct 22 11:48:17 kathie3 IPaddr2(HA-IP_2)[1683130]: INFO: ARPING 192.168.16.76 from 192.168.16.76 ens3#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) > > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683138]: INFO: webcontent_data: Called drbdsetup wait-connect-resource webcontent_data --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5 > > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683142]: INFO: webcontent_data: Exit code 5 > > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683146]: INFO: webcontent_data: Command output: > > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683150]: INFO: webcontent_data: Command stderr: > > Oct 22 11:48:19 kathie3 pacemaker-attrd[4878]: notice: Setting master-Webcontent_DRBD[kathie3] in instance_attributes: (unset) -> 1000 > > Oct 22 11:48:19 kathie3 pacemaker-controld[4880]: notice: Result of start operation for Webcontent_DRBD on kathie3: ok > > Oct 22 11:48:19 kathie3 pacemaker-controld[4880]: notice: Initiating notify operation Webcontent_DRBD_post_notify_start_0 locally on kathie3 > > ... > > > > Is there some kind of timeout wrong or what am I missing ? > > > > Any suggestions are welcome > > > > Kind regards > > > > fatcharly > > > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > From fatcharly at gmx.de Tue Oct 22 16:24:11 2024 From: fatcharly at gmx.de (Testuser SST) Date: Tue, 22 Oct 2024 16:24:11 +0000 Subject: [ClusterLabs] Problem with a new cluster with drbd on AlmaLinux 9 In-Reply-To: References: Message-ID: Hi again, looks like to double apache order Constraint was the problem. Thanks for the hint ! Kind regards fatcharly > Gesendet: Dienstag, 22. Oktober 2024 um 15:44 > Von: "Testuser SST via Users" > An: arvidjaar at gmail.com, users at clusterlabs.org > CC: "Testuser SST" > Betreff: Re: [ClusterLabs] Problem with a new cluster with drbd on AlmaLinux 9 > > Hi Andrei, > > no, this are the only ones: > > > Location Constraints: > resource 'Apache' (id: location-Apache) > Rules: > Rule: boolean-op=or score=-INFINITY (id: location-Apache-rule) > Expression: pingd lt 1 (id: location-Apache-rule-expr) > Expression: not_defined pingd (id: location-Apache-rule-expr-1) > Colocation Constraints: > resource 'Apache' with resource 'HA-IPs' (id: colocation-Apache-HA-IPs-INFINITY) > score=INFINITY > resource 'Apache' with resource 'Webcontent_FS' (id: colocation-Apache-Webcontent_FS-INFINITY) > score=INFINITY > Order Constraints: > start resource 'HA-IPs' then start resource 'Apache' (id: order-HA-IPs-Apache-mandatory) > start resource 'Webcontent_FS' then start resource 'Apache' (id: order-Webcontent_FS-Apache-mandatory) > > > > > > > Gesendet: Dienstag, 22. Oktober 2024 um 15:41 > > Von: "Andrei Borzenkov" > > An: "Cluster Labs - All topics related to open-source clustering welcomed" > > CC: "Testuser SST" > > Betreff: Re: [ClusterLabs] Problem with a new cluster with drbd on AlmaLinux 9 > > > > On Tue, Oct 22, 2024 at 3:18?PM Testuser SST via Users > > wrote: > > > > > > Hi, > > > I'm running a 2-node-web-cluster on Almalinux-9, pacemaker 2.1.7, drbd9 and corosync 3.1. > > > I have trouble with the promoting and mounting of the drbd-device. After activating the cluster, > > > the drbd-device is not getting mounted and is showing quite fast an error message: > > > > > > pacemaker-schedulerd[4879]: warning: Unexpected result (error: Couldn't mount device [/dev/drbd1] as /mnt/clusterfs) was recorded for start of Webcontent_FS on ... > > > pacemaker-schedulerd[4879]: warning: Webcontent_FS cannot run on kathie3 due to reaching migration threshold (clean up resource to allow again) > > > > > > > Do you have any ordering constraints between Webcontent_DRBD and Webcontent_FS? > > > > > It's like it's trying to mount the device, but the device is not ready yet. > > > The device is the drbd1 and I'm trying to mount it on /mnt/clusterfs. After the error occoured, and I do a "pcs resource cleanup" the cluster is able to mount it. > > > the drbd-resource is named webcontend_DRBD > > > the mounted filesystem is named webcontend_FS > > > All other resources like httpd and HA-IP's working like a charm. > > > > > > This is the log from the start of the cluster: > > > > > > Oct 22 11:48:12 kathie3 pacemaker-controld[4880]: notice: State transition S_ELECTION -> S_INTEGRATION > > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_1 ( kathie3 ) > > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_2 ( kathie3 ) > > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start HA-IP_3 ( kathie3 ) > > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start Webcontent_DRBD:0 ( kathie3 ) > > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start Webcontent_FS ( kathie3 ) > > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Actions: Start ping_fw:0 ( kathie3 ) > > > Oct 22 11:48:13 kathie3 pacemaker-schedulerd[4879]: notice: Calculated transition 1106, saving inputs in /var/lib/pacemaker/pengine/pe-input-336.bz2 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation HA-IP_1_start_0 locally on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation Webcontent_FS_start_0 locally on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation ping_fw_start_0 locally on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation Webcontent_DRBD_start_0 locally on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for HA-IP_1 on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for ping_fw on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for Webcontent_DRBD on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for Webcontent_FS on kathie3 > > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682892]: INFO: Adding inet address 192.168.16.75/24 with broadcast address 192.168.16.255 to device ens3 > > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682912]: INFO: Bringing device ens3 up > > > Oct 22 11:48:13 kathie3 Filesystem(Webcontent_FS)[1682923]: INFO: Running start for /dev/drbd1 on /mnt/clusterfs > > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_1)[1682929]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-192.168.16.75 ens3 192.168.16.75 auto not_used not_used > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Starting worker thread (node-id 0) > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for HA-IP_1 on kathie3: ok > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating monitor operation HA-IP_1_monitor_30000 locally on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of monitor operation for HA-IP_1 on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Initiating start operation HA-IP_2_start_0 locally on kathie3 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Requesting local execution of start operation for HA-IP_2 on kathie3 > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Auto-promote failed: Need access to UpToDate data (-2) > > > Oct 22 11:48:13 kathie3 kernel: /dev/drbd1: Can't open blockdev > > > Oct 22 11:48:13 kathie3 kernel: /dev/drbd1: Can't open blockdev > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: meta-data IO uses: blk-bio > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: disk( Diskless -> Attaching ) [attach] > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: Maximum number of peer devices = 1 > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data: Method to ensure write ordering: flush > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: drbd_bm_resize called with capacity == 104854328 > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: resync bitmap: bits=13106791 words=204794 pages=400 > > > Oct 22 11:48:13 kathie3 kernel: drbd1: detected capacity change from 0 to 104854328 > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: size = 50 GB (52427164 KB) > > > Oct 22 11:48:13 kathie3 Filesystem(Webcontent_FS)[1683017]: ERROR: Couldn't mount device [/dev/drbd1] as /mnt/clusterfs > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for Webcontent_FS on kathie3: error (Couldn't mount device [/dev/drbd1] as /mnt/clusterfs) > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Webcontent_FS_start_0 at kathie3 output [ blockdev: cannot open /dev/drbd1: No data available\nmount: /mnt/clusterfs: mount(2) system call failed: No data available.\nocf-exit-reason:Couldn't mount device [/dev/drbd1] as /mnt/clusterfs\n ] > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 aborted by operation Webcontent_FS_start_0 'modify' on kathie3: Event failed > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 action 37 (Webcontent_FS_start_0 on kathie3): expected 'ok' but got 'error' > > > Oct 22 11:48:13 kathie3 pacemaker-attrd[4878]: notice: Setting last-failure-Webcontent_FS#start_0[kathie3] in instance_attributes: (unset) -> 1729590493 > > > Oct 22 11:48:13 kathie3 pacemaker-attrd[4878]: notice: Setting fail-count-Webcontent_FS#start_0[kathie3] in instance_attributes: (unset) -> INFINITY > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Transition 1106 aborted by status-1-last-failure-Webcontent_FS.start_0 doing create last-failure-Webcontent_FS#start_0=1729590493: Transient attribute change > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: bitmap READ of 400 pages took 34 ms > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: disk( Attaching -> UpToDate ) [attach] > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: attached to current UUID: 826E8850CF10C812 > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data/0 drbd1: Setting exposed data uuid: 826E8850CF10C812 > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of monitor operation for HA-IP_1 on kathie3: ok > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: Starting sender thread (peer-node-id 1) > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: conn( StandAlone -> Unconnected ) [connect] > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: Starting receiver thread (peer-node-id 1) > > > Oct 22 11:48:13 kathie3 kernel: drbd webcontent_data stacy3: conn( Unconnected -> Connecting ) [connecting] > > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683100]: INFO: Adding inet address 192.168.16.76/24 with broadcast address 192.168.16.255 to device ens3 > > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683106]: INFO: Bringing device ens3 up > > > Oct 22 11:48:13 kathie3 IPaddr2(HA-IP_2)[1683112]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-192.168.16.76 ens3 192.168.16.76 auto not_used not_used > > > Oct 22 11:48:13 kathie3 pacemaker-controld[4880]: notice: Result of start operation for HA-IP_2 on kathie3: ok > > > Oct 22 11:48:15 kathie3 pacemaker-attrd[4878]: notice: Setting pingd[kathie3] in instance_attributes: (unset) -> 1000 > > > Oct 22 11:48:15 kathie3 pacemaker-controld[4880]: notice: Result of start operation for ping_fw on kathie3: ok > > > Oct 22 11:48:17 kathie3 IPaddr2(HA-IP_1)[1683126]: INFO: ARPING 192.168.16.75 from 192.168.16.75 ens3#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) > > > Oct 22 11:48:17 kathie3 IPaddr2(HA-IP_2)[1683130]: INFO: ARPING 192.168.16.76 from 192.168.16.76 ens3#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) > > > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683138]: INFO: webcontent_data: Called drbdsetup wait-connect-resource webcontent_data --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5 > > > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683142]: INFO: webcontent_data: Exit code 5 > > > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683146]: INFO: webcontent_data: Command output: > > > Oct 22 11:48:18 kathie3 drbd(Webcontent_DRBD)[1683150]: INFO: webcontent_data: Command stderr: > > > Oct 22 11:48:19 kathie3 pacemaker-attrd[4878]: notice: Setting master-Webcontent_DRBD[kathie3] in instance_attributes: (unset) -> 1000 > > > Oct 22 11:48:19 kathie3 pacemaker-controld[4880]: notice: Result of start operation for Webcontent_DRBD on kathie3: ok > > > Oct 22 11:48:19 kathie3 pacemaker-controld[4880]: notice: Initiating notify operation Webcontent_DRBD_post_notify_start_0 locally on kathie3 > > > ... > > > > > > Is there some kind of timeout wrong or what am I missing ? > > > > > > Any suggestions are welcome > > > > > > Kind regards > > > > > > fatcharly > > > > > > > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > From mrt_nl at hotmail.com Tue Oct 22 18:44:42 2024 From: mrt_nl at hotmail.com (Murat Inal) Date: Tue, 22 Oct 2024 21:44:42 +0300 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock In-Reply-To: References: Message-ID: Hello Oyvind, Using your suggestion, I located the issue at function chain_isactive(). This function greps the "generated" rule string (via function active_grep_pat()) in the rule table. Generated string does NOT match with iptables output anymore. Consequently, RA decides that the rule is ABSENT, although it is PRESENT. I opted to use "iptables --check" command for rule existence detection. Below is the function with modification comments; #chain_isactive? {udp|tcp} portno,portno ip chain chain_isactive() { ?? ?[ "$4" = "OUTPUT" ] && ds="s" || ds="d" ??? #PAT=$(active_grep_pat "$1" "$2" "$3" "$ds") # grep pattern ??? #$IPTABLES $wait -n -L "$4" | grep "$PAT" >/dev/null ??? ??? ??? ??? ??? ??? ?? ? ? ? ? ? ?? ??? # old detection line ?? ?iptables -C "$4" -p "$1" -${ds} "$3" -m multiport --${ds}ports "$2" -j DROP??? ??? ??? ??? # new detection using iptables --check/-C } I tested the modified RA with both actions (block & unblock). It works. If you agree with the above, active_grep_pat() has NO use, it can be deleted from the script. On 10/21/24 12:25, Oyvind Albrigtsen wrote: > I would try running "pcs resource debug-stop --full " to see > what's happening, and try to run the "iptables -D" line manually if it > doesnt show you an error. > > > Oyvind > > On 18/10/24 21:45 +0300, Murat Inal wrote: >> Hi Oyvind, >> >> Probably current portblock has a bug. It CREATES netfilter rule on >> start(), however DOES NOT DELETE the rule on stop(). >> >> Here is the configuration of my simple 2 node + 1 qdevice cluster; >> >> >> node 1: node-a-knet \ >> ??? attributes standby=off >> node 2: node-b-knet \ >> ??? attributes standby=off >> primitive r-porttoggle portblock \ >> ??? params action=block direction=out ip=172.16.0.1 portno=1234 >> protocol=udp \ >> ??? op monitor interval=10s timeout=10s \ >> ??? op start interval=0s timeout=20s \ >> ??? op stop interval=0s timeout=20s >> primitive r-vip IPaddr2 \ >> ??? params cidr_netmask=24 ip=10.1.6.253 \ >> ??? op monitor interval=10s timeout=20s \ >> ??? op start interval=0s timeout=20s \ >> ??? op stop interval=0s timeout=20s >> colocation c1 inf: r-porttoggle r-vip >> order o1 r-vip r-porttoggle >> property cib-bootstrap-options: \ >> ??? have-watchdog=false \ >> ??? dc-version=2.1.6-6fdc9deea29 \ >> ??? cluster-infrastructure=corosync \ >> ??? cluster-name=testcluster \ >> ??? stonith-enabled=false \ >> ??? last-lrm-refresh=1729272215 >> >> >> - I checked the switchover and observed netfilter chain (watch sudo >> iptables -L OUTPUT) real-time, >> >> - Tried portblock with parameter direction=out & both. >> >> - Checked if the relevant functions IptablesBLOCK() & >> IptablesUNBLOCK() are executing (by inserting syslog mark messages >> inside). They do run. >> >> However rule is ONLY created, NEVER deleted. >> >> Any suggestions? >> >> >> On 10/9/24 11:26, Oyvind Albrigtsen wrote: >> >>> Correct. That should block the port when the resource is stopped on a >>> node (e.g. if you have it grouped with the service you're using on the >>> port). >>> >>> I would do some testing to ensure it works exactly as you expect. E.g. >>> you can telnet to the port, or you can run nc/socat on the port and >>> telnet to it from the node it blocks/unblocks. If it doesnt accept >>> the connection you know it's blocked. >>> >>> >>> Oyvind Albrigtsen >>> >>> On 06/10/24 22:46 GMT, Murat Inal wrote: >>>> Hello, >>>> >>>> I'd like to confirm with you the mechanism of ocf:heartbeat:portblock. >>>> >>>> Given a resource definition; >>>> >>>> Resource: r41_LIO (class=ocf provider=heartbeat type=portblock) >>>> ? Attributes: r41_LIO-instance_attributes >>>> ??? action=unblock >>>> ??? ip=10.1.8.194 >>>> ??? portno=3260 >>>> ??? protocol=tcp >>>> >>>> - If resource starts, TCP:3260 is UNBLOCKED. >>>> >>>> - If resource is stopped, TCP:3260 is BLOCKED. >>>> >>>> Is that correct? If action=block, it will run just the opposite, >>>> correct? >>>> >>>> To toggle a port, a single portblock resource is enough, correct? >>>> >>>> Thanks, >>>> >>>> _______________________________________________ >>>> Manage your subscription: >>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> ClusterLabs home: https://www.clusterlabs.org/ >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ From oalbrigt at redhat.com Wed Oct 23 11:43:51 2024 From: oalbrigt at redhat.com (Oyvind Albrigtsen) Date: Wed, 23 Oct 2024 13:43:51 +0200 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock In-Reply-To: References: Message-ID: This could be related to the following PR: https://github.com/ClusterLabs/resource-agents/pull/1924/files The github version of portblock works fine on Fedora 40, so that's my best guess. Oyvind On 22/10/24 21:44 +0300, Murat Inal wrote: >Hello Oyvind, > >Using your suggestion, I located the issue at function chain_isactive(). > >This function greps the "generated" rule string (via function >active_grep_pat()) in the rule table. Generated string does NOT match >with iptables output anymore. Consequently, RA decides that the rule >is ABSENT, although it is PRESENT. > >I opted to use "iptables --check" command for rule existence >detection. Below is the function with modification comments; > > >#chain_isactive? {udp|tcp} portno,portno ip chain >chain_isactive() >{ >?? ?[ "$4" = "OUTPUT" ] && ds="s" || ds="d" >??? #PAT=$(active_grep_pat "$1" "$2" "$3" "$ds") # grep pattern >??? #$IPTABLES $wait -n -L "$4" | grep "$PAT" >/dev/null ??? ??? ??? >??? ??? ??? ?? ? ? ? ? ? ?? ??? # old detection line >?? ?iptables -C "$4" -p "$1" -${ds} "$3" -m multiport --${ds}ports >"$2" -j DROP??? ??? ??? ??? # new detection using iptables --check/-C >} > >I tested the modified RA with both actions (block & unblock). It >works. If you agree with the above, active_grep_pat() has NO use, it >can be deleted from the script. > > >On 10/21/24 12:25, Oyvind Albrigtsen wrote: >>I would try running "pcs resource debug-stop --full " to see >>what's happening, and try to run the "iptables -D" line manually if it >>doesnt show you an error. >> >> >>Oyvind >> >>On 18/10/24 21:45 +0300, Murat Inal wrote: >>>Hi Oyvind, >>> >>>Probably current portblock has a bug. It CREATES netfilter rule on >>>start(), however DOES NOT DELETE the rule on stop(). >>> >>>Here is the configuration of my simple 2 node + 1 qdevice cluster; >>> >>> >>>node 1: node-a-knet \ >>>??? attributes standby=off >>>node 2: node-b-knet \ >>>??? attributes standby=off >>>primitive r-porttoggle portblock \ >>>??? params action=block direction=out ip=172.16.0.1 portno=1234 >>>protocol=udp \ >>>??? op monitor interval=10s timeout=10s \ >>>??? op start interval=0s timeout=20s \ >>>??? op stop interval=0s timeout=20s >>>primitive r-vip IPaddr2 \ >>>??? params cidr_netmask=24 ip=10.1.6.253 \ >>>??? op monitor interval=10s timeout=20s \ >>>??? op start interval=0s timeout=20s \ >>>??? op stop interval=0s timeout=20s >>>colocation c1 inf: r-porttoggle r-vip >>>order o1 r-vip r-porttoggle >>>property cib-bootstrap-options: \ >>>??? have-watchdog=false \ >>>??? dc-version=2.1.6-6fdc9deea29 \ >>>??? cluster-infrastructure=corosync \ >>>??? cluster-name=testcluster \ >>>??? stonith-enabled=false \ >>>??? last-lrm-refresh=1729272215 >>> >>> >>>- I checked the switchover and observed netfilter chain (watch >>>sudo iptables -L OUTPUT) real-time, >>> >>>- Tried portblock with parameter direction=out & both. >>> >>>- Checked if the relevant functions IptablesBLOCK() & >>>IptablesUNBLOCK() are executing (by inserting syslog mark messages >>>inside). They do run. >>> >>>However rule is ONLY created, NEVER deleted. >>> >>>Any suggestions? >>> >>> >>>On 10/9/24 11:26, Oyvind Albrigtsen wrote: >>> >>>>Correct. That should block the port when the resource is stopped on a >>>>node (e.g. if you have it grouped with the service you're using on the >>>>port). >>>> >>>>I would do some testing to ensure it works exactly as you expect. E.g. >>>>you can telnet to the port, or you can run nc/socat on the port and >>>>telnet to it from the node it blocks/unblocks. If it doesnt accept >>>>the connection you know it's blocked. >>>> >>>> >>>>Oyvind Albrigtsen >>>> >>>>On 06/10/24 22:46 GMT, Murat Inal wrote: >>>>>Hello, >>>>> >>>>>I'd like to confirm with you the mechanism of ocf:heartbeat:portblock. >>>>> >>>>>Given a resource definition; >>>>> >>>>>Resource: r41_LIO (class=ocf provider=heartbeat type=portblock) >>>>>? Attributes: r41_LIO-instance_attributes >>>>>??? action=unblock >>>>>??? ip=10.1.8.194 >>>>>??? portno=3260 >>>>>??? protocol=tcp >>>>> >>>>>- If resource starts, TCP:3260 is UNBLOCKED. >>>>> >>>>>- If resource is stopped, TCP:3260 is BLOCKED. >>>>> >>>>>Is that correct? If action=block, it will run just the >>>>>opposite, correct? >>>>> >>>>>To toggle a port, a single portblock resource is enough, correct? >>>>> >>>>>Thanks, >>>>> >>>>>_______________________________________________ >>>>>Manage your subscription: >>>>>https://lists.clusterlabs.org/mailman/listinfo/users >>>>> >>>>>ClusterLabs home: https://www.clusterlabs.org/ >>>> >>>>_______________________________________________ >>>>Manage your subscription: >>>>https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>>ClusterLabs home: https://www.clusterlabs.org/ >>>_______________________________________________ >>>Manage your subscription: >>>https://lists.clusterlabs.org/mailman/listinfo/users >>> >>>ClusterLabs home: https://www.clusterlabs.org/ >> >>_______________________________________________ >>Manage your subscription: >>https://lists.clusterlabs.org/mailman/listinfo/users >> >>ClusterLabs home: https://www.clusterlabs.org/ >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ From mrt_nl at hotmail.com Wed Oct 23 12:49:16 2024 From: mrt_nl at hotmail.com (Murat Inal) Date: Wed, 23 Oct 2024 15:49:16 +0300 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From oalbrigt at redhat.com Wed Oct 23 13:08:07 2024 From: oalbrigt at redhat.com (Oyvind Albrigtsen) Date: Wed, 23 Oct 2024 15:08:07 +0200 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock In-Reply-To: References: Message-ID: In that case I would report the bug to Ubuntu. Oyvind On 23/10/24 15:49 +0300, Murat Inal wrote: > Hi Oyvind, > > I checked out PR1924 and exacty applied it to my test cluster. > > Problem still exists. Rules do not get deleted, only created. > > Note that; > > - My cluster runs Ubuntu Server 24.04 > > - grep is GNU 3.11 > > - Switches -qE are valid & exist in grep man page. > > On 10/23/24 14:43, Oyvind Albrigtsen wrote: > > This could be related to the following PR: > [1]https://github.com/ClusterLabs/resource-agents/pull/1924/files > > The github version of portblock works fine on Fedora 40, so that's my > best guess. > > Oyvind > > On 22/10/24 21:44 +0300, Murat Inal wrote: > > Hello Oyvind, > > Using your suggestion, I located the issue at function > chain_isactive(). > > This function greps the "generated" rule string (via function > active_grep_pat()) in the rule table. Generated string does NOT match > with iptables output anymore. Consequently, RA decides that the rule > is ABSENT, although it is PRESENT. > > I opted to use "iptables --check" command for rule existence > detection. Below is the function with modification comments; > > #chain_isactive {udp|tcp} portno,portno ip chain > chain_isactive() > { > [ "$4" = "OUTPUT" ] && ds="s" || ds="d" > #PAT=$(active_grep_pat "$1" "$2" "$3" "$ds") # grep pattern > #$IPTABLES $wait -n -L "$4" | grep "$PAT" >/dev/null > # old detection line > iptables -C "$4" -p "$1" -${ds} "$3" -m multiport --${ds}ports > "$2" -j DROP # new detection using iptables --check/-C > } > > I tested the modified RA with both actions (block & unblock). It > works. If you agree with the above, active_grep_pat() has NO use, it > can be deleted from the script. > > On 10/21/24 12:25, Oyvind Albrigtsen wrote: > > I would try running "pcs resource debug-stop --full " to > see > what's happening, and try to run the "iptables -D" line manually if > it > doesnt show you an error. > > Oyvind > > On 18/10/24 21:45 +0300, Murat Inal wrote: > > Hi Oyvind, > > Probably current portblock has a bug. It CREATES netfilter rule on > start(), however DOES NOT DELETE the rule on stop(). > > Here is the configuration of my simple 2 node + 1 qdevice cluster; > > node 1: node-a-knet \ > attributes standby=off > node 2: node-b-knet \ > attributes standby=off > primitive r-porttoggle portblock \ > params action=block direction=out ip=172.16.0.1 portno=1234 > protocol=udp \ > op monitor interval=10s timeout=10s \ > op start interval=0s timeout=20s \ > op stop interval=0s timeout=20s > primitive r-vip IPaddr2 \ > params cidr_netmask=24 ip=10.1.6.253 \ > op monitor interval=10s timeout=20s \ > op start interval=0s timeout=20s \ > op stop interval=0s timeout=20s > colocation c1 inf: r-porttoggle r-vip > order o1 r-vip r-porttoggle > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=2.1.6-6fdc9deea29 \ > cluster-infrastructure=corosync \ > cluster-name=testcluster \ > stonith-enabled=false \ > last-lrm-refresh=1729272215 > > - I checked the switchover and observed netfilter chain (watch > sudo iptables -L OUTPUT) real-time, > > - Tried portblock with parameter direction=out & both. > > - Checked if the relevant functions IptablesBLOCK() & > IptablesUNBLOCK() are executing (by inserting syslog mark messages > inside). They do run. > > However rule is ONLY created, NEVER deleted. > > Any suggestions? > > On 10/9/24 11:26, Oyvind Albrigtsen wrote: > > Correct. That should block the port when the resource is stopped > on a > node (e.g. if you have it grouped with the service you're using > on the > port). > > I would do some testing to ensure it works exactly as you > expect. E.g. > you can telnet to the port, or you can run nc/socat on the port > and > telnet to it from the node it blocks/unblocks. If it doesnt > accept > the connection you know it's blocked. > > Oyvind Albrigtsen > > On 06/10/24 22:46 GMT, Murat Inal wrote: > > Hello, > > I'd like to confirm with you the mechanism of > ocf:heartbeat:portblock. > > Given a resource definition; > > Resource: r41_LIO (class=ocf provider=heartbeat > type=portblock) > Attributes: r41_LIO-instance_attributes > action=unblock > ip=10.1.8.194 > portno=3260 > protocol=tcp > > - If resource starts, TCP:3260 is UNBLOCKED. > > - If resource is stopped, TCP:3260 is BLOCKED. > > Is that correct? If action=block, it will run just the > opposite, correct? > > To toggle a port, a single portblock resource is enough, > correct? > > Thanks, > > _______________________________________________ > Manage your subscription: > [2]https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: [3]https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > [4]https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: [5]https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > [6]https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: [7]https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > [8]https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: [9]https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > [10]https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: [11]https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > [12]https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: [13]https://www.clusterlabs.org/ > >Links: >1. https://github.com/ClusterLabs/resource-agents/pull/1924/files >2. https://lists.clusterlabs.org/mailman/listinfo/users >3. https://www.clusterlabs.org/ >4. https://lists.clusterlabs.org/mailman/listinfo/users >5. https://www.clusterlabs.org/ >6. https://lists.clusterlabs.org/mailman/listinfo/users >7. https://www.clusterlabs.org/ >8. https://lists.clusterlabs.org/mailman/listinfo/users >9. https://www.clusterlabs.org/ >10. https://lists.clusterlabs.org/mailman/listinfo/users >11. https://www.clusterlabs.org/ >12. https://lists.clusterlabs.org/mailman/listinfo/users >13. https://www.clusterlabs.org/ >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ From mrt_nl at hotmail.com Wed Oct 23 13:46:47 2024 From: mrt_nl at hotmail.com (Murat Inal) Date: Wed, 23 Oct 2024 16:46:47 +0300 Subject: [ClusterLabs] About RA ocf:heartbeat:portblock In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From mlisik at redhat.com Thu Oct 24 12:01:57 2024 From: mlisik at redhat.com (Miroslav Lisik) Date: Thu, 24 Oct 2024 14:01:57 +0200 Subject: [ClusterLabs] poor performance for large resource configuration In-Reply-To: References: Message-ID: On 10/21/24 13:07, zufei chen wrote: > Hi all, > > background? > > 1. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + > pcs(0.10.8) > 2. there are 11 nodes in total, divided into 3 groups. If a node > fails within a group, the resources can only be taken over by > nodes within that group. > 3. Each node has 2 MDTs and 16 OSTs. > > Issues: > > 1. The resource configuration time progressively increases. the > second mdt-0? cost only???8s?the last?ost-175 cost??1min:37s > 2. The total time taken for the configuration is approximately 2 > hours and 31 minutes. Is there a way to improve it? > > > attachment: > create bash: pcs_create.sh > create log:?pcs_create.log > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ Hi, you could try to create cluster CIB configuration with pcs commands on a file using the '-f' option and then push it to the pacemaker all at once. pcs cluster cib > original.xml cp original.xml new.xml pcs -f new.xml ... ... pcs cluster cib-push new.xml diff-against=original.xml And then wait for the cluster to settle into stable state: crm_resource --wait Or there is pcs command since version v0.11.8: pcs status wait [] I hope this will help you to improve the performance. Regards, Miroslav From oalbrigt at redhat.com Wed Oct 30 09:00:41 2024 From: oalbrigt at redhat.com (Oyvind Albrigtsen) Date: Wed, 30 Oct 2024 10:00:41 +0100 Subject: [ClusterLabs] resource-agents v4.16.0 rc1 Message-ID: ClusterLabs is happy to announce resource-agents v4.16.0 rc1. Source code is available at: https://github.com/ClusterLabs/resource-agents/releases/tag/v4.16.0rc1 The most significant enhancements in this release are: - bugfixes and enhancements: - ocf-shellfuncs: only create/update and reload systemd drop-in if needed - spec: drop BuildReq python3-pyroute2 for RHEL/CentOS - Filesystem: dont sleep during stop-action when there are no processes to kill - Filesystem: on stop, try umount directly, before scanning for users - Filesystem: only use $umount_force after sending kill_signals - Filesystem: stop/get_pids: improve logic to find processes - Filesystem: add azure aznfs filesystem support - IPaddr2: add proto-parameter to be able to match a specific route - IPaddr2: improve fail logic and check ip_status after adding IP - IPaddr2: use dev keyword when bringing up device - IPsrcaddr: specify dev for default route, as e.g. fe80:: routes can be present on multiple interfaces - apache/http-mon.sh: change curl opts to match wget - azure-events*: use node name from cluster instead of hostname to avoid failing if they're not the same - docker-compose: use "docker compose" when not using older docker-compose command - findif.sh: ignore unreachable, blackhole, and prohibit routes - nfsserver: also stop rpc-statd for nfsv4_only to avoid stop failing in some cases - podman: force-remove containers in stopping state if necessary (#1973) - powervs-subnet: add optional argument route_table (#1966) - powervs-subnet: modify gathering of Apikey, calculation of timeout - powervs-subnet: enable access via private endpoint for IBM IAM The full list of changes for resource-agents is available at: https://github.com/ClusterLabs/resource-agents/blob/v4.16.0rc1/ChangeLog Everyone is encouraged to download and test the new release candidate. We do many regression tests and simulations, but we can't cover all possible use cases, so your feedback is important and appreciated. Many thanks to all the contributors to this release. Best, The resource-agents maintainers From kgaillot at redhat.com Wed Oct 30 14:42:13 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Wed, 30 Oct 2024 09:42:13 -0500 Subject: [ClusterLabs] ClusterLabs website overhauled Message-ID: Hi all, Today, we unveiled the new ClusterLabs website design: https://clusterlabs.org/ The old site had a lot of outdated info as well as broken CSS after multiple OS upgrades, and the Jekyll-based source for site generation was difficult to maintain. The new site is Hugo-based with a much simpler theme and design. It's not the most beautiful or modern site in the world, but it's simple and clean. If someone wants to pitch in and improve it, you can see the source and submit pull requests at: https://github.com/ClusterLabs/clusterlabs-www Enjoy! -- Ken Gaillot From amir.eibagi at nutanix.com Wed Oct 30 18:49:56 2024 From: amir.eibagi at nutanix.com (Amir Eibagi) Date: Wed, 30 Oct 2024 18:49:56 +0000 Subject: [ClusterLabs] Pull request requirements Message-ID: Hello team, My name is Amir Eibagi and I am working for Nutanix. Me and my colleague (CCed on this email) are working to introduce a new fence agent for the AHV host. We have followed the guidelines under https://github.com/ClusterLabs/fence-agents/blob/main/doc/FenceAgentAPI.md developing the fence agent. I just would like to check with the team if there are any specific requirement which we need to follow before submitting our pull request for the upstream. Thanks for your time and consideration. [https://opengraph.githubassets.com/71768307b1f7a262e53035982d12e5f779c1f83cf243cf036526e339194b53a8/ClusterLabs/fence-agents] fence-agents/doc/FenceAgentAPI.md at main ? ClusterLabs/fence-agents Fence agents. Contribute to ClusterLabs/fence-agents development by creating an account on GitHub. github.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From oalbrigt at redhat.com Thu Oct 31 15:24:59 2024 From: oalbrigt at redhat.com (Oyvind Albrigtsen) Date: Thu, 31 Oct 2024 16:24:59 +0100 Subject: [ClusterLabs] Pull request requirements In-Reply-To: References: Message-ID: Hi Amir, This is the current developer guide: https://github.com/ClusterLabs/fence-agents/blob/main/doc/fa-dev-guide.md We prefer REST-based agents, and the agents should work on Python 3.6+. Oyvind Albrigtsen On 30/10/24 18:49 +0000, Amir Eibagi wrote: >Hello team, > >My name is Amir Eibagi and I am working for Nutanix. Me and my colleague (CCed on this email) are working to introduce a new fence agent for the AHV host. >We have followed the guidelines under https://github.com/ClusterLabs/fence-agents/blob/main/doc/FenceAgentAPI.md developing the fence agent. > >I just would like to check with the team if there are any specific requirement which we need to follow before submitting our pull request for the upstream. > >Thanks for your time and consideration. > >[https://opengraph.githubassets.com/71768307b1f7a262e53035982d12e5f779c1f83cf243cf036526e339194b53a8/ClusterLabs/fence-agents] >fence-agents/doc/FenceAgentAPI.md at main ? ClusterLabs/fence-agents >Fence agents. Contribute to ClusterLabs/fence-agents development by creating an account on GitHub. >github.com > > >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ From kgaillot at redhat.com Thu Oct 31 20:33:23 2024 From: kgaillot at redhat.com (Ken Gaillot) Date: Thu, 31 Oct 2024 15:33:23 -0500 Subject: [ClusterLabs] Pacemaker 2.1.9 released Message-ID: <46315aef1bd258629c7e71abed1c46d08d6fd514.camel@redhat.com> Hi all, The final release of Pacemaker 2.1.9 is now available at: https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.9 This is primarily a bug fix release, to provide a clean separation point for the upcoming 3.0.0 release. See the link above for more details. Many thanks to all contributors to this release, including Aleksei Burlakov, Chris Lumens, Hideo Yamauchi, Ken Gaillot, and Reid Wahl. -- Ken Gaillot