AW: [Linux-HA] ERROR: 100 NULL cf-read() running heartbeat 2.0.0
Alan Robertson
alanr at unix.sh
Wed Aug 3 04:15:04 MDT 2005
Ulrich H. Thomas wrote:
> Okay here are the log file an the configs.
>
> http://www.rz.unibw-muenchen.de/~j4tu0736/ha/
>
> Sorry, but my webmailer have a problem with the attachments. Hope that´s now
> okay.
Thanks for the information. There are some inconsistencies in it below...
Here are the interfaces you show in your ha.cf file.
serial /dev/ttyS1 # Linux
bcast bond2 # Linux
ping_group group1 10.102.101.113 10.102.101.241
You're doing broadcast heartbeats over a channel bonded interface, and
you're pinging over a single ping group.
And here are the processes you show as running from your logs from the
occurance at 06:00:03.
31180 master control process
31182 FIFO reader
31183 serial write
31184 serial read
31185 bcast write
31186 bcast read
31186 ping_group write
31188 ping_group read
31189 ??? write WHERE DID THIS COME FROM?
31190 ??? read WHERE DID THIS COME FROM?
Aug 2 06:00:03 bnhpsryy heartbeat: [31188]: ERROR: 100 NULL vf->read()
returns in a row. Exiting.: Resource temporarily unavailable
Aug 2 06:00:03 bnhpsryy heartbeat: [31190]: ERROR: 100 NULL vf->read()
returns in a row. Exiting.: Resource temporarily unavailable
The last two processes should not exist. Did you maybe delete a
ping_group from ha.cf after you started heartbeat?
If so, then the explanation below really makes lots of sense...
In any case, it looks like the errors are coming from the "ping"
interface - assuming you deleted from the end and not from the middle of
the ha.cf file. PLEASE explain what you did to the ha.cf file after
this run, and before attaching the ha.cf file.
If you just deleted another ping_group, and it wasn't above the bcast in
the ha.cf, then, this is the code which is failing 100 times in a row in
ping_group.c:
*lenp = 0;
if ((numbytes=recvfrom(ei->sock, (void *) &buf.cbuf
, sizeof(buf.cbuf)-1, 0, (struct sockaddr *)&their_addr
, &addr_len)) < 0) {
if (errno != EINTR) {
PILCallLog(LOG, PIL_CRIT, "Error receiving from socket: %s"
, strerror(errno));
}
return(NULL);
}
OR maybe, it's failing later on, since we're not getting any other
failure messages, and 100 EINTRs in a row seems unlikely...
if(!node) {
return(NULL);
}
Or maybe this code...
msg = wirefmt2msg(msgstart, bufmax - msgstart, MSG_NEEDAUTH);
if(msg == NULL) {
return(NULL);
}
Now what either of these would mean is that some process on this machine
is pinging something else on the machine and has gotten back 100 packets
before we got back any of our own ping packets...
Is it possible that something on this machine (nagios?) is doing massive
pings to other machines periodically?
Like maybe this?
Aug 1 15:57:03 bnhpsryy nagios: HOST ALERT: EDS_eds-db;UP;SOFT;3;PING
OK - Packet loss = 0%, RTA = 1.44 ms
I think nagios is pinging the heck out of something, and it's making our
code a little sick. Now, this isn't to say that this _should_ make us
sick, but at least it's a start on diagnosing it.
Can you turn off - or slow down the nagios ping activity for a while and
see if this helps?
Aug 1 15:52:43 bnhpsryy nagios: SERVICE ALERT:
WAP-GW-bypass-LH-sc-wap;PING bnsncla2;WARNING;SOFT;1;PING WARNING -
Packet loss = 0%, RTA = 199.56 ms
Aug 1 15:53:43 bnhpsryy nagios: SERVICE ALERT:
WAP-GW-bypass-LH-sc-wap;PING bnsncla2;OK;SOFT;2;PING OK - Packet loss =
0%, RTA = 0.94 ms
Aug 1 15:57:03 bnhpsryy nagios: HOST ALERT: EDS_eds-db;UP;SOFT;3;PING
OK - Packet loss = 0%, RTA = 1.44 ms
Aug 1 15:57:03 bnhpsryy nagios: SERVICE ALERT:
EDS_eds-db;PING;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 1 15:57:13 bnhpsryy nagios: HOST ALERT: EDS_eds-omb;UP;SOFT;2;PING
OK - Packet loss = 0%, RTA = 1.18 ms
Aug 1 15:57:13 bnhpsryy nagios: SERVICE ALERT:
EDS_eds2b;PING;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 1 16:00:13 bnhpsryy nagios: SERVICE ALERT:
EDS_eds-db;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 1.15 ms
Aug 1 16:00:23 bnhpsryy nagios: SERVICE ALERT:
EDS_eds2b;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 1.41 ms
Aug 1 16:05:33 bnhpsryy nagios: SERVICE ALERT:
MDI-bnhpsrg7;PING;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10
seconds
Aug 1 16:07:13 bnhpsryy nagios: SERVICE ALERT: MDI-bnhpsrg7;PING
Management;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 seconds
Aug 1 16:10:33 bnhpsryy nagios: HOST ALERT: MDI-bnhpsrg7;UP;HARD;1;PING
OK - Packet loss = 0%, RTA = 1.05 ms
Aug 1 16:10:33 bnhpsryy nagios: SERVICE ALERT:
MDI-bnhpsrg7;PING;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 0.93 ms
Aug 1 16:12:13 bnhpsryy nagios: SERVICE ALERT: MDI-bnhpsrg7;PING
Management;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 0.29 ms
Aug 1 18:58:54 bnhpsryy nagios: HOST ALERT: EDS_eds2a;UP;HARD;1;PING OK
- Packet loss = 0%, RTA = 1.48 ms
Aug 1 19:11:14 bnhpsryy nagios: SERVICE ALERT: WAP-GW-SUT-bnsnsrgk;PING
Management;WARNING;SOFT;1;PING WARNING - Packet loss = 0%, RTA = 199.38 ms
Aug 1 19:12:14 bnhpsryy nagios: SERVICE ALERT: WAP-GW-SUT-bnsnsrgk;PING
Management;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.25 ms
Aug 1 19:13:04 bnhpsryy nagios: SERVICE ALERT:
WAP-GW-bypass-LH-sc-wap;PING bnsncla2;WARNING;SOFT;1;PING WARNING -
Packet loss = 0%, RTA = 199.32 ms
Aug 1 19:14:04 bnhpsryy nagios: SERVICE ALERT:
WAP-GW-bypass-LH-sc-wap;PING bnsncla2;OK;SOFT;2;PING OK - Packet loss =
0%, RTA = 0.97 ms
A reasonable fix might be to change the falue of "maxnullcount" in
heartbeat.c to 10000 or something like that. It's really only intended
to detect something which is badly broken. If it's badly broken, then
it'll fail pretty soon anyway...
/* Create a read child process (to read messages from hb medium) */
static void
read_child(struct hb_media* mp)
{
IPC_Channel* ourchan = mp->rchan[P_READFD];
int nullcount=0;
const int maxnullcount=10000;
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
More information about the Linux-HA
mailing list