AW: [Linux-HA] ERROR: 100 NULL cf-read() running heartbeat 2.0.0

Alan Robertson alanr at unix.sh
Wed Aug 3 04:15:04 MDT 2005


Ulrich H. Thomas wrote:
> Okay here are the log file an the configs.
> 
> http://www.rz.unibw-muenchen.de/~j4tu0736/ha/
> 
> Sorry, but my webmailer have a problem with the attachments. Hope that´s now
> okay.

Thanks for the information.  There are some inconsistencies in it below...


Here are the interfaces you show in your ha.cf file.

serial	/dev/ttyS1	# Linux
bcast	bond2		# Linux
ping_group group1 10.102.101.113 10.102.101.241


You're doing broadcast heartbeats over a channel bonded interface, and 
you're pinging over a single ping group.

And here are the processes you show as running from your logs from the 
occurance at 06:00:03.

31180	master control process
31182	FIFO reader
31183	serial write
31184	serial read
31185	bcast write
31186	bcast read
31186	ping_group write
31188	ping_group read
31189	??? write                WHERE DID THIS COME FROM?
31190   ??? read                 WHERE DID THIS COME FROM?

Aug  2 06:00:03 bnhpsryy heartbeat: [31188]: ERROR: 100 NULL vf->read() 
returns in a row. Exiting.: Resource temporarily unavailable
Aug  2 06:00:03 bnhpsryy heartbeat: [31190]: ERROR: 100 NULL vf->read() 
returns in a row. Exiting.: Resource temporarily unavailable

The last two processes should not exist.  Did you maybe delete a 
ping_group from ha.cf after you started heartbeat?

If so, then the explanation below really makes lots of sense...

In any case, it looks like the errors are coming from the "ping" 
interface - assuming you deleted from the end and not from the middle of 
the ha.cf file.  PLEASE explain what you did to the ha.cf file after 
this run, and before attaching the ha.cf file.



If you just deleted another ping_group, and it wasn't above the bcast in 
the ha.cf, then, this is the code which is failing 100 times in a row in 
ping_group.c:


     *lenp = 0;
     if ((numbytes=recvfrom(ei->sock, (void *) &buf.cbuf
     ,       sizeof(buf.cbuf)-1, 0,  (struct sockaddr *)&their_addr
     ,       &addr_len)) < 0) {
         if (errno != EINTR) {
              PILCallLog(LOG, PIL_CRIT, "Error receiving from socket: %s"
              ,       strerror(errno));
          }
          return(NULL);
     }

OR maybe, it's failing later on, since we're not getting any other 
failure messages, and 100 EINTRs in a row seems unlikely...

         if(!node) {
                 return(NULL);
         }

Or maybe this code...

         msg = wirefmt2msg(msgstart, bufmax - msgstart, MSG_NEEDAUTH);
         if(msg == NULL) {
                 return(NULL);
         }


Now what either of these would mean is that some process on this machine 
is pinging something else on the machine and has gotten back 100 packets 
before we got back any of our own ping packets...

Is it possible that something on this machine (nagios?) is doing massive 
pings to other machines periodically?

Like maybe this?

Aug  1 15:57:03 bnhpsryy nagios: HOST ALERT: EDS_eds-db;UP;SOFT;3;PING 
OK - Packet loss = 0%, RTA = 1.44 ms

I think nagios is pinging the heck out of something, and it's making our 
code a little sick.  Now, this isn't to say that this _should_ make us 
sick, but at least it's a start on diagnosing it.

Can you turn off - or slow down the nagios ping activity for a while and 
see if this helps?

Aug  1 15:52:43 bnhpsryy nagios: SERVICE ALERT: 
WAP-GW-bypass-LH-sc-wap;PING bnsncla2;WARNING;SOFT;1;PING WARNING - 
Packet loss = 0%, RTA = 199.56 ms
Aug  1 15:53:43 bnhpsryy nagios: SERVICE ALERT: 
WAP-GW-bypass-LH-sc-wap;PING bnsncla2;OK;SOFT;2;PING OK - Packet loss = 
0%, RTA = 0.94 ms
Aug  1 15:57:03 bnhpsryy nagios: HOST ALERT: EDS_eds-db;UP;SOFT;3;PING 
OK - Packet loss = 0%, RTA = 1.44 ms
Aug  1 15:57:03 bnhpsryy nagios: SERVICE ALERT: 
EDS_eds-db;PING;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug  1 15:57:13 bnhpsryy nagios: HOST ALERT: EDS_eds-omb;UP;SOFT;2;PING 
OK - Packet loss = 0%, RTA = 1.18 ms
Aug  1 15:57:13 bnhpsryy nagios: SERVICE ALERT: 
EDS_eds2b;PING;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug  1 16:00:13 bnhpsryy nagios: SERVICE ALERT: 
EDS_eds-db;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 1.15 ms
Aug  1 16:00:23 bnhpsryy nagios: SERVICE ALERT: 
EDS_eds2b;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 1.41 ms
Aug  1 16:05:33 bnhpsryy nagios: SERVICE ALERT: 
MDI-bnhpsrg7;PING;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 
seconds
Aug  1 16:07:13 bnhpsryy nagios: SERVICE ALERT: MDI-bnhpsrg7;PING 
Management;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 seconds
Aug  1 16:10:33 bnhpsryy nagios: HOST ALERT: MDI-bnhpsrg7;UP;HARD;1;PING 
OK - Packet loss = 0%, RTA = 1.05 ms
Aug  1 16:10:33 bnhpsryy nagios: SERVICE ALERT: 
MDI-bnhpsrg7;PING;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 0.93 ms
Aug  1 16:12:13 bnhpsryy nagios: SERVICE ALERT: MDI-bnhpsrg7;PING 
Management;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 0.29 ms
Aug  1 18:58:54 bnhpsryy nagios: HOST ALERT: EDS_eds2a;UP;HARD;1;PING OK 
- Packet loss = 0%, RTA = 1.48 ms
Aug  1 19:11:14 bnhpsryy nagios: SERVICE ALERT: WAP-GW-SUT-bnsnsrgk;PING 
Management;WARNING;SOFT;1;PING WARNING - Packet loss = 0%, RTA = 199.38 ms
Aug  1 19:12:14 bnhpsryy nagios: SERVICE ALERT: WAP-GW-SUT-bnsnsrgk;PING 
Management;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.25 ms
Aug  1 19:13:04 bnhpsryy nagios: SERVICE ALERT: 
WAP-GW-bypass-LH-sc-wap;PING bnsncla2;WARNING;SOFT;1;PING WARNING - 
Packet loss = 0%, RTA = 199.32 ms
Aug  1 19:14:04 bnhpsryy nagios: SERVICE ALERT: 
WAP-GW-bypass-LH-sc-wap;PING bnsncla2;OK;SOFT;2;PING OK - Packet loss = 
0%, RTA = 0.97 ms

A reasonable fix might be to change the falue of "maxnullcount" in 
heartbeat.c to 10000 or something like that.  It's really only intended 
to detect something which is badly broken.   If it's badly broken, then 
it'll fail pretty soon anyway...

/* Create a read child process (to read messages from hb medium) */
static void
read_child(struct hb_media* mp)
{
         IPC_Channel* ourchan =  mp->rchan[P_READFD];
         int             nullcount=0;
         const int       maxnullcount=10000;




-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce



More information about the Linux-HA mailing list