[Linux-ha-dev] issues I'm working on at the moment
Andrew Beekhof
beekhof at gmail.com
Tue Mar 13 01:38:12 MDT 2007
where did you find these bugs listed?
On 3/13/07, Aníbal Monsalve Salazar <anibal at sgi.com> wrote:
> Hello,
>
> All the following bugs have been reported by Russell Coker. I'm
> posting them here in case someone knows if they have been fixed
> already. I'll appreciate any help/comment/pointer about these bugs.
> I'm working to get them fixed anyway and submit the patches to
> this list.
>
> heartbeat should detect and recover from corrupt CIB
> ====================================================
>
> Under XFS failure modes a recently created file may end up filled
> with zeros if there is a power outage (IPMI fence) at an
> inconvenient time.
>
> Heartbeat keeps a backup copy of /var/lib/heartbeat/crm/cib.xml but
> if the primary copy is filled with zeros it doesn't use the backup!
>
> I believe that heartbeat should use the backup copy of cib.xml
> whenever the primary does not conform to the XML schema.
>
> I also believe that given the XFS issue one backup copy is not
> enough and that we should have multiple backups so that if there is
> more than one change made to cib.xml in the 30 seconds before the
> reboot there will still be a good copy after the reboot. As the
> files in question are small there is no reason why we couldn't have
> 20 backup copies.
>
> Heartbeat has code to handle the situation of different versions of
> the cib.xml on different nodes, but currently does not seem to
> handle the situation of a corrupt cib.xml. To cope with the failure
> conditions of filesystems other than XFS some of these backups
> should be stored in different directories.
>
> heartbeat needs all code related to writing cib.xml audited
> ===========================================================
>
> There is a potential SEGV in the code for writing the cib.xml
> (fopen() followed immediately by fclose() with no error checking on
> line 636 of lib/crm/common/xml.c).
>
> The the code for writing cib.xml also calls fflush() as it's only
> mechanism for ensuring that the data gets to disk.
>
> The above two bugs need to be fixed and the code in question needs
> to be audited to ensure that there are no more bugs of the same
> nature.
>
> cibadmin -Q should indicate if the data is not available locally
> ================================================================
>
> Currently if you have a cib.xml file that is corrupt it seems that
> there is no way for heartbeat to recover (neither automatically nor
> through manual intervention), see previous bug.
>
> When the machine is in this state (which could happen after the
> previous bug is fixed in the case of a disk error affecting the
> directory which contains the file in question) "cibadmin -Q" should
> indicate that the data is being obtained from another machine. If
> you have a two-node cluster and one node is in such a state then
> there is no redundancy and shutting down the node with the good
> copy of the CIB will destroy the cluster.
>
> The sys-admin should have some way of knowing that a routine
> operation (rebooting one node of a cluster) is certain to cause a
> catastrophy.
>
> It could be argued that a node that can't write to it's cib.xml file
> should disable itself and demand that the sys-admin fix the problem.
> I have no strong feelings on this issue apart from the fact that the
> current operation is wrong!
>
> cibadmin -E should verify that the change took place and report errors
> ======================================================================
>
> The "cibadmin -E" operation should make a minimal effort to verify
> that the requested change took place.
>
> Currently "cibadmin -E" will return 0 and display no error message
> even in a situation where the local cib.xml file is corrupted (which
> with current code means that heartbeat won't modify it) and remote
> nodes are not available - this means silent data loss!
>
> This may require changes to the "cib" daemon as it may not be
> returning the status to the cibadmin tool.
>
> ADDITIONAL INFORMATION
>
> If the "-s" option to cibadmin is used for a synchronous operation
> when the local cib.xml is corrupt and the other node is not running
> then cibadmin will hang seemingly indefinitely (I observed it
> hanging for 20 minutes). But no error message though.
>
> So I guess we can run cibadmin -s to get an indication that things
> are going wrong, but with no idea of what is going wrong.
>
> cibadmin (heartbeat) should flag error conditions on delete
> ===========================================================
>
> When cibadmin can't complete an operation it should display an error
> message to stderr and return an error code to the environment.
>
> cibadmin -s --obj_type status -D -X "<lrm_resource id=\"appman-ui-resource\"/>"
>
> A command such as the above can be run repeatedly and you never know
> how many instances of it succeeded (if any).
>
> (heartbeat) ha_logd read process goes into an infinite loop
> ===========================================================
>
> Periodically when shutting down heartbeat I see the following error
> from the read process of ha_logd. It has 1024 open file handles,
> most of which are socket handles for /var/lib/log_daemon, the
> process will be in an infinite loop at the time and will be using
> 100% of one CPU core.
>
> Feb 22 14:57:00 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files
> Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files
> Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files
> Feb 22 14:57:01 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files
> Feb 22 14:57:02 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files
> Feb 22 14:57:02 ha-node-1 logd: [5642]: ERROR: socket_accept_connection: accept: Too many open files
>
> inappropriate warnings about core files logged by heartbeat
> ===========================================================
>
> When /proc/sys/kernel/core_pattern is set, warning messages such as
> the following should not be logged.
>
> Mar 1 14:55:28 ha-node-0 logd: [29422]: WARN: Core dumps could be lost if multiple dumps occur
> Mar 1 14:55:28 ha-node-0 logd: [29422]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
>
> wrong permissions on cib.xml
> ===========================
>
> Mar 1 15:42:23 ha-node-1 cib: [6010]: WARN: crm_is_writable: /var/lib/heartbeat/crm/cib.xml should be owned and r/w by group haclient
>
> The above message appears in /var/log/heartbeat.log on ha-node-1. The
> file in question is mode 0600, according to the warning it should be
> 0660. Either the warning or the permissions of the file (as created
> by heartbeat) needs to change.
>
> Also I think that both the xml file and the signature should have
> the same permissions, currently the .sig file is mode 0644.
>
> Thank you,
>
> Aníbal Monsalve Salazar
> --
> R&D Software Engineer, SGI Australian Software Group
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
More information about the Linux-HA-Dev
mailing list