[Linux-HA] Emergency Shutdown(MCP dead): Killing ourselves

johng at auctionsolutions.com johng at auctionsolutions.com
Tue Mar 21 12:53:29 MST 2006


Another thing to note is that when the server-1 heartbeat died, a ps -eaf
shows that the ipfail is still running.

I then killed ipfail on server-1

Restarted heartbeat on server-1, then the following messages start to
appear in the /var/log/ha-log file:

heartbeat[23378]: 2006/03/21_14:44:27 WARN: Rexmit of seq 596 requested. 17
is max.
heartbeat[23378]: 2006/03/21_14:44:27 WARN: Rexmit of seq 596 requested. 17
is max.
heartbeat[23378]: 2006/03/21_14:44:28 WARN: Rexmit of seq 596 requested. 17
is max.
heartbeat[23378]: 2006/03/21_14:44:28 WARN: Rexmit of seq 596 requested. 17
is max.
heartbeat[23378]: 2006/03/21_14:44:28 WARN: Rexmit of seq 596 requested. 17
is max.
heartbeat[23378]: 2006/03/21_14:44:28 WARN: Rexmit of seq 596 requested. 17
is max.
heartbeat[23378]: 2006/03/21_14:44:28 WARN: Rexmit of seq 596 requested. 18
is max.
heartbeat[23378]: 2006/03/21_14:44:28 WARN: Rexmit of seq 596 requested. 18
is max.
heartbeat[23378]: 2006/03/21_14:44:28 WARN: Rexmit of seq 596 requested. 18
is max.
heartbeat[23378]: 2006/03/21_14:44:28 WARN: Rexmit of seq 596 requested. 18
is max.

The messages eventually stop being written to the ha-log file, which does
take a while.

Then everything seems to be back to normal .

I was able to make server-1 primary again, by typing
/usr/lib/heartbeat/hb_standby on server-2 without any issues even though
the above messages were being written to the log.  Everything went back to
server-1 from server-2 just fine.

Those messages did not seem to hurt anything at that point.



Thanks for your help.







                                                                           
             Guochun Shi                                                   
             <gshi at ncsa.uiuc.e                                             
             du>                                                        To 
             Sent by:                  General Linux-HA mailing list       
             linux-ha-bounces@         <linux-ha at lists.linux-ha.org>, Alan 
             lists.linux-ha.or         Robertson <alanr at unix.sh>           
             g                                                          cc 
                                                                           
                                                                   Subject 
             03/21/2006 01:52          Re: [Linux-HA] Emergency            
             PM                        Shutdown(MCP dead): Killing         
                                       ourselves                           
                                                                           
             Please respond to                                             
             General Linux-HA                                              
               mailing list                                                
             <linux-ha at lists.l                                             
               inux-ha.org>                                                
                                                                           
                                                                           




ok, thanks

So the error did come from eth0 when it was writing ping(icmp) packets,
and it began
to fill up queue in the ethernet card when the cable was unplugged

I believe this problem is not reported before.

-Guochun

johng at auctionsolutions.com wrote:

> #
> # There are lots of options in this file.  All you have to have is a set
> # of nodes listed {"node ...} one of {serial, bcast, mcast, or ucast},
> # and a value for "auto_failback".
> #
> # ATTENTION: As the configuration file is read line by line,
> #    THE ORDER OF DIRECTIVE MATTERS!
> #
> # In particular, make sure that the udpport, serial baud rate
> # etc. are set before the heartbeat media are defined!
> # debug and log file directives go into effect when they
> # are encountered.
> #
> # All will be fine if you keep them ordered as in this example.
> #
> #
> #       Note on logging:
> #       If any of debugfile, logfile and logfacility are defined then
they
> #       will be used. If debugfile and/or logfile are not defined and
> #       logfacility is defined then the respective logging and debug
> #       messages will be loged to syslog. If logfacility is not defined
> #       then debugfile and logfile will be used to log messges. If
> #       logfacility is not defined and debugfile and/or logfile are not
> #       defined then defaults will be used for debugfile and logfile as
> #       required and messages will be sent there.
> #
> # File to write debug messages to
> #debugfile /var/log/ha-debug
> #
> #
> # File to write other messages to
> #
> logfile /var/log/ha-log
> #
> #
> # Facility to use for syslog()/logger
> #
> logfacility local0
> #
> #
> # A note on specifying "how long" times below...
> #
> # The default time unit is seconds
> # 10 means ten seconds
> #
> # You can also specify them in milliseconds
> # 1500ms means 1.5 seconds
> #
> #
> # keepalive: how long between heartbeats?
> #
> keepalive 2
> #
> # deadtime: how long-to-declare-host-dead?
> #
> # If you set this too low you will get the problematic
> # split-brain (or cluster partition) problem.
> # See the FAQ for how to use warntime to tune deadtime.
> #
> deadtime 30
> #
> # warntime: how long before issuing "late heartbeat" warning?
> # See the FAQ for how to use warntime to tune deadtime.
> #
> warntime 10
> #
> #
> # Very first dead time (initdead)
> #
> # On some machines/OSes, etc. the network takes a while to come up
> # and start working right after you've been rebooted.  As a result
> # we have a separate dead time for when things first come up.
> # It should be at least twice the normal dead time.
> #
> initdead 120
> #
> #
> # What UDP port to use for bcast/ucast communication?
> #
> udpport 694
> #
> # Baud rate for serial ports...
> #
> #baud 19200
> #
> # serial serialportname ...
> #serial /dev/ttyS0 # Linux
> #serial /dev/cuaa0 # FreeBSD
> #serial /dev/cua/a # Solaris
> #
> #
> # What interfaces to broadcast heartbeats over?
> #
> #bcast eth0 # Linux
> bcast eth2 # Linux
> #bcast le0 # Solaris
> #bcast le1 le2 # Solaris
> #
> # Set up a multicast heartbeat medium
> # mcast [dev] [mcast group] [port] [ttl] [loop]
> #
> # [dev] device to send/rcv heartbeats on
> # [mcast group] multicast group to join (class D multicast address
> # 224.0.0.0 - 239.255.255.255)
> # [port] udp port to sendto/rcvfrom (set this value to the
> # same value as "udpport" above)
> # [ttl] the ttl value for outbound heartbeats.  this effects
> # how far the multicast packet will propagate.  (0-255)
> # Must be greater than zero.
> # [loop] toggles loopback for outbound multicast heartbeats.
> # if enabled, an outbound packet will be looped back and
> # received by the interface it was sent on. (0 or 1)
> # Set this value to zero.
> #
> #
> #mcast eth1 225.0.0.1 694 1 0
> #
> # Set up a unicast / udp heartbeat medium
> # ucast [dev] [peer-ip-addr]
> #
> # [dev] device to send/rcv heartbeats on
> # [peer-ip-addr] IP address of peer to send packets to
> #
> # ucast eth1 192.168.1.82
> #
> #
> # About boolean values...
> #
> # Any of the following case-insensitive values will work for true:
> # true, on, yes, y, 1
> # Any of the following case-insensitive values will work for false:
> # false, off, no, n, 0
> #
> #
> #
> # auto_failback:  determines whether a resource will
> # automatically fail back to its "primary" node, or remain
> # on whatever node is serving it until that node fails, or
> # an administrator intervenes.
> #
> # The possible values for auto_failback are:
> # on - enable automatic failbacks
> # off - disable automatic failbacks
> # legacy - enable automatic failbacks in systems
> # where all nodes do not yet support
> # the auto_failback option.
> #
> # auto_failback "on" and "off" are backwards compatible with the old
> # "nice_failback on" setting.
> #
> # See the FAQ for information on how to convert
> # from "legacy" to "on" without a flash cut.
> # (i.e., using a "rolling upgrade" process)
> #
> # The default value for auto_failback is "legacy", which
> # will issue a warning at startup.  So, make sure you put
> # an auto_failback directive in your ha.cf file.
> # (note: auto_failback can be any boolean or "legacy")
> #
> auto_failback on
> #
> #
> #       Basic STONITH support
> #       Using this directive assumes that there is one stonith
> #       device in the cluster.  Parameters to this device are
> #       read from a configuration file. The format of this line is:
> #
> #         stonith <stonith_type> <configfile>
> #
> #       NOTE: it is up to you to maintain this file on each node in the
> #       cluster!
> #
> stonith securelinx /etc/ha.d/conf/securelinx.cfg
> #
> #       STONITH support
> #       You can configure multiple stonith devices using this directive.
> #       The format of the line is:
> #         stonith_host <hostfrom> <stonith_type> <params...>
> #         <hostfrom> is the machine the stonith device is attached
> #              to or * to mean it is accessible from any host.
> #         <stonith_type> is the type of stonith device (a list of
> #              supported drives is in /usr/lib/stonith.)
> #         <params...> are driver specific parameters.  To see the
> #              format for a particular device, run:
> #           stonith -l -t <stonith_type>
> #
> #
> # Note that if you put your stonith device access information in
> # here, and you make this file publically readable, you're asking
> # for a denial of service attack ;-)
> #
> # To get a list of supported stonith devices, run
> # stonith -L
> # For detailed information on which stonith devices are supported
> # and their detailed configuration options, run this command:
> # stonith -h
> #
> #stonith_host *     baytech 10.0.0.3 mylogin mysecretpassword
> #stonith_host ken3  rps10 /dev/ttyS1 kathy 0
> #stonith_host kathy rps10 /dev/ttyS1 ken3 0
> #
> # Watchdog is the watchdog timer.  If our own heart doesn't beat for
> # a minute, then our machine will reboot.
> # NOTE: If you are using the software watchdog, you very likely
> # wish to load the module with the parameter "nowayout=0" or
> # compile it without CONFIG_WATCHDOG_NOWAYOUT set. Otherwise even
> # an orderly shutdown of heartbeat will trigger a reboot, which is
> # very likely NOT what you want.
> #
> #watchdog /dev/watchdog
> #
> # Tell what machines are in the cluster
> # node nodename ... -- must match uname -n
> #
> node   Flatline1 Flatline2
> # node   heartbeat2
> # Less common options...
> #
> # Treats 10.10.10.254 as a psuedo-cluster-member
> # Used together with ipfail below...
> #
> ping 10.1.1.209
> #ping 192.168.1.4
> #
> # Treats 10.10.10.254 and 10.10.10.253 as a psuedo-cluster-member
> #       called group1. If either 10.10.10.254 or 10.10.10.253 are up
> #       then group1 is up
> # Used together with ipfail below...
> #
> #ping_group group1 10.10.10.254 10.10.10.253
> #
> # HBA ping derective for Fiber Channel
> # Treats fc-card-name as psudo-cluster-member
> # used with ipfail below ...
> #
> # You can obtain HBAAPI from http://hbaapi.sourceforge.net.  You need
> # to get the library specific to your HBA directly from the vender
> # To install HBAAPI stuff, all You need to do is to compile the common
> # part you obtained from the sourceforge. This will produce libHBAAPI.so
> # which you need to copy to /usr/lib. You need also copy hbaapi.h to
> # /usr/include.
> #
> # The fc-card-name is the name obtained from the hbaapitest program
> # that is part of the hbaapi package. Running hbaapitest will produce
> # a verbose output. One of the first line is similar to:
> # Apapter number 0 is named: qlogic-qla2200-0
> # Here fc-card-name is qlogic-qla2200-0.
> #
> #hbaping fc-card-name
> #
> #
> # Processes started and stopped with heartbeat.  Restarted unless
> # they exit with rc=100
> #
> #respawn userid /path/name/to/run
> respawn hacluster /usr/lib/heartbeat/ipfail
> #
> # Access control for client api
> #       default is no access
> #
> #apiauth client-name gid=gidlist uid=uidlist
> #apiauth ipfail gid=haclient uid=hacluster
>
> ###########################
> #
> # Unusual options.
> #
> ###########################
> #
> # hopfudge maximum hop count minus number of nodes in config
> #hopfudge 1
> #
> # deadping - dead time for ping nodes
> #deadping 30
> #
> # hbgenmethod - Heartbeat generation number creation method
> # Normally these are stored on disk and incremented as needed.
> #hbgenmethod time
> #
> # realtime - enable/disable realtime execution (high priority, etc.)
> # defaults to on
> #realtime off
> #
> # debug - set debug level
> # defaults to zero
> #debug 1
> #
> # API Authentication - replaces the fifo-permissions-based system of
> the past
> #
> #
> # You can put a uid list and/or a gid list.
> # If you put both, then a process is authorized if it qualifies under
> either
> # the uid list, or under the gid list.
> #
> # The groupname "default" has special meaning.  If it is specified, then
> # this will be used for authorizing groupless clients, and any client
> groups
> # not otherwise specified.
> #
> # There is a subtle exception to this.  "default" will never be used
> in the
> # following cases (actual default auth directives noted in brackets)
> #   ipfail (uid=HA_CCMUSER)
> #   ccm   (uid=HA_CCMUSER)
> #   ping (gid=HA_APIGROUP)
> #   cl_status (gid=HA_APIGROUP)
> #
> # This is done to avoid creating a gaping security hole and matches
> the most
> # likely desired configuration.
> #
> #apiauth ipfail uid=hacluster
> #apiauth ccm uid=hacluster
> #apiauth cms uid=hacluster
> #apiauth ping gid=haclient uid=alanr,root
> #apiauth default gid=haclient
>
> # message format in the wire, it can be classic or netstring,
> # default: classic
> #msgfmt  classic/netstring
>
> # do we use logging daemon?
> # detail policy:
> # 1. if there is any entry for debugfile/logfile/logfacility in ha.cf
> #      a) if use_logd is not set, logging daemon will not be used
> #      b) if use_logd is set to on, logging daemon will be used
> #      c) if use_logd is set to off, logging daemon will not be used
> #
> #
> # 2. if there is no entry for debugfile/logfile/logfacility in ha.cf
> #      a) if use_logd is not set, logging daemon will be used
> #      b) if use_logd is set to on, logging daemon will be used
> #      c) if use_logd is set to off, config error, i.e. you can not turn
> #           off all logging options
> #
> # If logging daemon is used, logfile/debugfile/logfacility in this file
> # are not meaningful any longer. You should check the config file for
> logging
> # daemon (the default is /etc/logd.cf)
> #
> # If you are not sure about this option, don't configure it
> #
> # use_logd yes/no
> #
> # the interval we  reconnect to logging daemon if the previous
> connection failed
> # default: 60 seconds
> #conn_logd_time 60
> #
> #
> # Configure compression module
> # It could be zlib or bz2, depending on whether u have the corresponding
> # library in the system.
> #compression bz2
> #
> # Confiugre compression threshold
> # This value determines the threshold to compress a message,
> # e.g. if the threshold is 1, then any message with size greater than 1
KB
> # will be compressed, the default is 2 (KB)
> #compression_threshold 2
>
>
> Inactive hide details for Guochun Shi <gshi at ncsa.uiuc.edu>Guochun Shi
> <gshi at ncsa.uiuc.edu>
>
>
>                         *Guochun Shi <gshi at ncsa.uiuc.edu>*
>                         Sent by: linux-ha-bounces at lists.linux-ha.org
>
>                         03/21/2006 01:20 PM
>                         Please respond to
>                         General Linux-HA mailing list
>                         <linux-ha at lists.linux-ha.org>
>
>
>
> To
>
> General Linux-HA mailing list <linux-ha at lists.linux-ha.org>
>
> cc
>
>
> Subject
>
> Re: [Linux-HA] Emergency Shutdown(MCP dead): Killing ourselves
>
>
>
>
> Can you post ha.cf to the list?
>
> -Guochun
>
> johng at auctionsolutions.com wrote:
>
> > Not quite because the network cable that I pulled was not one that
> > heartbeat was using for bcast.
> >
> > The network cable that I pulled with the public network cable that was
> > being monitored by ipfail.
> >
> > I have 3 nic's in these servers.
> >
> > eth0 - public
> > eth1 - DBRD
> > etth2 - Heartbeat.
> >
> > I pulled cable for eth0, being monitored by ipfail.
> >
> > Thanks,
> >
> >
> >
> > Inactive hide details for Guochun Shi <gshi at ncsa.uiuc.edu>Guochun Shi
> > <gshi at ncsa.uiuc.edu>
> >
> >
> >                         *Guochun Shi <gshi at ncsa.uiuc.edu>*
> >                         Sent by: linux-ha-bounces at lists.linux-ha.org
> >
> >                         03/21/2006 12:56 PM
> >                         Please respond to
> >                         General Linux-HA mailing list
> >                         <linux-ha at lists.linux-ha.org>
> >
> >
> >
> > To
> >
> > General Linux-HA mailing list <linux-ha at lists.linux-ha.org>
> >
> > cc
> >
> >
> > Subject
> >
> > Re: [Linux-HA] Emergency Shutdown(MCP dead): Killing ourselves
> >
> >
> >
> >
> > Your problem seems similar to bug 699
> > http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=699
> >
> > try the solutions mentioned over there
> >
> > -Guochun
> >
> > johng at auctionsolutions.com wrote:
> >
> > > I have a 2 node cluster using heartbeat-2.0.4
> > >
> > > I am running ipfail on my public network.
> > >
> > > I removed the public network cable on server-1
> > >
> > > server-1 then requested that server-2 take over resources, which
> > > server-2 did successfully.
> > >
> > > I left the network cable unplugged on server-1 for a while then
> > > started getting these error messages in /var/log/ha-log on server-1
> > >
> > > heartbeat[18084]: 2006/03/21_10:43:36 ERROR: Shutting down.
> > > heartbeat[18084]: 2006/03/21_10:43:36 ERROR: Cannot write to media
> > > pipe 1: Resource temporarily unavailable
> > > heartbeat[18084]: 2006/03/21_10:43:36 ERROR: Shutting down.
> > > heartbeat[18084]: 2006/03/21_10:43:36 ERROR: Cannot write to media
> > > pipe 1: Resource temporarily unavailable
> > > heartbeat[18084]: 2006/03/21_10:43:36 ERROR: Shutting down.
> > > heartbeat[18084]: 2006/03/21_10:43:36 ERROR: Cannot write to media
> > > pipe 1: Resource temporarily unavailable
> > > heartbeat[18084]: 2006/03/21_10:43:36 ERROR: Shutting down.
> > >
> > >
> > > server-1 then started getting many errors message in /var/log/ha-log
> > >
> > > like the following:
> > >
> > > heartbeat[18084]: 2006/03/21_10:44:59 ERROR: Message hist queue is
> > > filling up (1000 messages in queue)
> > > heartbeat[18084]: 2006/03/21_10:44:59 ERROR: Cannot write to media
> > > pipe 1: Resource temporarily unavailable
> > > heartbeat[18084]: 2006/03/21_10:44:59 ERROR: Shutting down.
> > > heartbeat[18084]: 2006/03/21_10:44:59 ERROR: Message hist queue is
> > > filling up (1000 messages in queue)
> > > heartbeat[18084]: 2006/03/21_10:44:59 ERROR: Cannot write to media
> > > pipe 1: Resource temporarily unavailable
> > > heartbeat[18084]: 2006/03/21_10:44:59 ERROR: Shutting down.
> > > heartbeat[18087]: 2006/03/21_10:45:00 CRIT: Emergency Shutdown:
Master
> > > Control process died.
> > > heartbeat[18087]: 2006/03/21_10:45:00 CRIT: Killing pid 18084 with
> > SIGTERM
> > > heartbeat[18087]: 2006/03/21_10:45:00 CRIT: Killing pid 18088 with
> > SIGTERM
> > > heartbeat[18087]: 2006/03/21_10:45:00 CRIT: Killing pid 18089 with
> > SIGTERM
> > > heartbeat[18087]: 2006/03/21_10:45:00 CRIT: Killing pid 18090 with
> > SIGTERM
> > > heartbeat[18087]: 2006/03/21_10:45:00 CRIT: Killing pid 18091 with
> > SIGTERM
> > > heartbeat[18089]: 2006/03/21_10:45:00 ERROR: socket_waitout failure:
> > > rc = 3
> > > heartbeat[18087]: 2006/03/21_10:45:00 CRIT: Emergency Shutdown(MCP
> > > dead): Killing ourselves.
> > > heartbeat[18090]: 2006/03/21_10:45:00 ERROR: glib: Error sending
> > > packet: Interrupted system call
> > > heartbeat[18089]: 2006/03/21_10:45:00 ERROR: read_child send: RCs: 0
3
> > > heartbeat[18090]: 2006/03/21_10:45:00 ERROR: write failure on ping
> > > 10.1.1.209.:Interrupted system call
> > >
> > > I performed a ps on server-1 all heartbeat daemons were gone, and the
> > > ipfail daemon was still running.
> > >
> > > Is this supposed to happen?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >------------------------------------------------------------------------
> > >
> > >_______________________________________________
> > >Linux-HA mailing list
> > >Linux-HA at lists.linux-ha.org
> > >http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >See also: http://linux-ha.org/ReportingProblems
> > >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >Linux-HA mailing list
> >Linux-HA at lists.linux-ha.org
> >http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >See also: http://linux-ha.org/ReportingProblems
> >
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Linux-HA mailing list
>Linux-HA at lists.linux-ha.org
>http://lists.linux-ha.org/mailman/listinfo/linux-ha
>See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20060321/cd2b4361/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20060321/cd2b4361/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic30813.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20060321/cd2b4361/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20060321/cd2b4361/attachment-0002.gif>


More information about the Linux-HA mailing list