[Linux-ha-dev] 2.0.8 bug in handling of clock wrap in longclock.c

Graham, Simon Simon.Graham at stratus.com
Sat Jun 16 18:56:09 MDT 2007


I've been struggling for a while with a problem whereby the wrap count
in time_longclock was apparently being incremented twice instead of
once, leading to apparent timeouts when none happened - I instrumented
things a little and came up with the following fairly reproducible
result approx 5 mins after boot time (since that's when the first wrap
occurs on Linux):

Jun 14 21:22:22 godzilla heartbeat: [8483]: info: time_longclock:
clock_t wrapped around (uptime), wc 100000000 tv 10
Jun 14 21:22:22 godzilla heartbeat: [8483]: WARN:
Gmain_timeout_dispatch: Dispatch function for cpu limit was delayed
4294953646 ms (> 50 ms) before being called (GSource: 0x80f0458)
Jun 14 21:22:22 godzilla heartbeat: [8483]: info:
Gmain_timeout_dispatch: started at 8589934608 should have started at
4294968677
Jun 14 21:22:22 godzilla heartbeat: [8483]: WARN:
Gmain_timeout_dispatch: Dispatch function for send local status was
delayed 4294966556 ms (> 510 ms) before being called (GSource:
0x80f6500)
Jun 14 21:22:22 godzilla heartbeat: [8483]: info:
Gmain_timeout_dispatch: started at 8589934608 should have started at
4294967386
Jun 14 21:22:22 godzilla heartbeat: [8483]: WARN: node
kingkong.sn.stratus.com: is dead
Jun 14 21:22:22 godzilla heartbeat: [8483]: info: Link
kingkong.sn.stratus.com:xenbr0 dead.
Jun 14 21:22:22 godzilla heartbeat: [8483]: info: Link
kingkong.sn.stratus.com:eth1 dead.

The correct time should have been 0x100000010, but was actually
0x200000010 (8589934608)

After much struggling, I finally found the issue today - inside
time_longclock(0 we log the fact that a wrap occurred; IF you are using
the log daemon (as recommended), then in the guts of this, we end up
calling time_longclock() recursively which goes through the same path
and increments the wrapcount a second time (but does it silently because
the recursive log is ignored) - gdb session attached which proves this.

I think the fix is to update the static data in time_longclock() BEFORE
logging - proposed patch attached. Assuming it's OK, what's the right
way to get this applied to the sources?

Simon Graham

----extract of gdb session---

(gdb) watch lc_wrapcount
Hardware watchpoint 2: lc_wrapcount
(gdb) commands
Type commands for when breakpoint 2 is hit, one per line.
End with a line saying just "end".
>bt
>c
>end
(gdb) c
Continuing.
Hardware watchpoint 2: lc_wrapcount

Old value = 0
New value = 4294967296
time_longclock () at longclock.c:169
#0  time_longclock () at longclock.c:169
#1  0xb7f763ea in Gmain_timeout_check (src=0x80f0458) at GSource.c:1504
#2  0xb7f0e635 in g_main_context_check () from /usr/lib/libglib-2.0.so.0
#3  0xb7f10f84 in g_main_context_acquire () from
/usr/lib/libglib-2.0.so.0
#4  0xb7f11264 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#5  0x0805235d in initialize_heartbeat () at heartbeat.c:1535
#6  0x08056abb in main (argc=1, argv=0xbfbb4f14, envp=0xbfbb4f1c)
    at heartbeat.c:4842
Hardware watchpoint 2: lc_wrapcount

Old value = 4294967296
New value = 8589934592
time_longclock () at longclock.c:169
#0  time_longclock () at longclock.c:169
#1  0xb7f79dd1 in LogToLoggingDaemon (priority=6, 
    buf=0xbfbb3640 "time_longclock: clock_t wrapped around (uptime), wc
100000000 tv 26", bufstrlen=68, use_pri_str=1) at cl_log.c:746
#2  0xb7f79fd2 in cl_log (priority=6, 
    fmt=0xb7f8b0c0 "%s: clock_t wrapped around (uptime), wc %llx tv
%lx")
    at cl_log.c:602
#3  0xb7f83689 in time_longclock () at longclock.c:169
#4  0xb7f763ea in Gmain_timeout_check (src=0x80f0458) at GSource.c:1504
#5  0xb7f0e635 in g_main_context_check () from /usr/lib/libglib-2.0.so.0
#6  0xb7f10f84 in g_main_context_acquire () from
/usr/lib/libglib-2.0.so.0
#7  0xb7f11264 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#8  0x0805235d in initialize_heartbeat () at heartbeat.c:1535
#9  0x08056abb in main (argc=1, argv=0xbfbb4f14, envp=0xbfbb4f1c)
    at heartbeat.c:4842

-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb-longclock.patch
Type: application/octet-stream
Size: 1070 bytes
Desc: hb-longclock.patch
Url : http://lists.community.tummy.com/pipermail/linux-ha-dev/attachments/20070616/d85a4fb3/hb-longclock.obj


More information about the Linux-HA-Dev mailing list