Heartbeat failure
Brian Tinsley
btinsley@emageon.com
Thu, 24 Oct 2002 20:47:54 -0500
>
>
>Second, it appears that you are hitting a kernel bug, so working around it in user-space doesn't strike me as a good idea.
>
>
I'm in about 95% agreement with the kernel bug theory (the other 5% is
the possibility of a driver bug). We've been having problems with lost
heartbeats and heartbeats "disappearing" for 60+ seconds, in which case
we get split brain and trashed resources. We now have a lot of
developers and QA folks hammering away on clusters in our lab, and every
time we think we have found the cause (i.e., the IBM serial driver for
their system management adapter, the kernel idle thread polling,
miscompiled LVS kernel modules, etc...), it happens again (as it did
just a couple of hours ago). The only heartbeat configuration I know for
absolute sure works (*for us*) is when we have just a single heartbeat
interface configured (bcast over 2 100TX NICs with crossover cable). But
I refuse to put a system into production in this configuration. I don't
have any problem compiling kernels myself and using them (been doing it
since version 1.2), but my bosses do for some reason. That's why I was
hopeful that the RH 7.3 kernel would work for us. Apparently not! Looks
like I've got a good fight coming ;^)
---
Brian Tinsley
Chief Systems Engineer
Emageon