Heartbeat failure

Brian Tinsley btinsley@emageon.com
Thu, 24 Oct 2002 20:47:54 -0500


>
>
>Second, it appears that you are hitting a kernel bug, so working around it in user-space doesn't strike me as a good idea.
>  
>
I'm in about 95% agreement with the kernel bug theory (the other 5% is 
the possibility of a driver bug). We've been having problems with lost 
heartbeats and heartbeats "disappearing" for 60+ seconds, in which case 
we get split brain and trashed resources. We now have a lot of 
developers and QA folks hammering away on clusters in our lab, and every 
time we think we have found the cause (i.e., the IBM serial driver for 
their system management adapter, the kernel idle thread polling, 
miscompiled LVS kernel modules, etc...), it happens again (as it did 
just a couple of hours ago). The only heartbeat configuration I know for 
absolute sure works (*for us*) is when we have just a single heartbeat 
interface configured (bcast over 2 100TX NICs with crossover cable). But 
I refuse to put a system into production in this configuration. I don't 
have any problem compiling kernels myself and using them (been doing it 
since version 1.2), but my bosses do for some reason. That's why I was 
hopeful that the RH 7.3 kernel would work for us. Apparently not! Looks 
like I've got a good fight coming ;^)

---
Brian Tinsley
Chief Systems Engineer
Emageon