[ENBD] 2.4.32
Peter T. Breuer
ptb at it.uc3m.es
Thu Mar 11 07:44:01 MST 2004
"Also sprach Anders Blomdell:"
> First of all '-n' on the server seems to take care of the hangs.
On the client, you mean.
> So now I have new problems...
> When running the full configuration, performance is very bad
>
> Client 1:
>
> 11:45:41 _llseek(3, 6041313280, [6041313280], SEEK_SET) = 0
> 11:45:41 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
> ..., 32768) = 32768
> 11:46:23 _llseek(3, 6041346048, [6041346048], SEEK_SET) = 0
> 11:46:23 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
> ..., 32768) = 32768
Uh, if it is doing an lseek, then it's surely a server, not a client?
The clients talk to the nbd devices via ioctls.
>
> Server 3 (top):
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2172 root 10 0 2060 2060 1612 S 45.0 0.2 12:05.35 enbd-server
> 2163 root 9 0 2060 2060 1612 S 41.2 0.2 11:46.30 enbd-server
> 2162 root 11 0 2060 2060 1612 S 28.3 0.2 10:09.98 enbd-server
> 2173 root 9 0 2060 2060 1612 S 21.9 0.2 10:49.02 enbd-server
>
> All these servers are doing a select (that eventually times out), like this:
Yes, I have seen this too. It seems to be a new 2.6 thing. I'll try and
figure out what it is.
> 11:48:35 select(2, [1], NULL, NULL, {5, 0}) = 0 (Timeout)
> 11:48:40 select(2, [1], NULL, NULL, {5, 0}) = 0 (Timeout)
They're waiting 5s for incoming packets on the network socket. There are
none. But the 2.6 kernels seem to do a selct wait in a busy loop!
Anyone know what is going on there?
If as you say those servers are just sitting there for 5s waiting on a
byte to appear in a network socket, then it is the kernels/glibc's fault
if that is now implemented in such a way as to take cpu cycles.
Yes?
> Disk load and network load is very light on the servers (since servers and
> clients are moving forward very slowly):
>
> Network load is 20-50 kBytes/s
> Disk load is a few writes/second
>
> Seems like scheduling is broken, could it be TASK_INTERRUPTIBLE that is
> needed?
No. Nothing like that. For a start you are looking at a server, not a
client!
> >> There is a loop between machines, this is what I intend to do
> > Aaaaaaaaaaaargh!
> That's what the machines says as well :(
>
> > Is that diagram copyright, or can I borrow it?
> You're welcome...
>
> > No loop. Here's one server.
> >
> > | |
> > md0 esrv esrv
> > ============ | |
> > hda1 nda ndb hda2 hda3
> > | |
> > eclt eclt
> > | |
> >
> > Now, is that what you have?
> Yes.
>
> > Well, you would deadlock without locks if you really had a loop. It's a
> > buffer deadlock. When one server tries to push data to disks, it needs
> > buffers to hold them, but when memory is full that can only come from
> > flushing other buffers to enbd, which sends them to the other server,
> > which unfortunately is trying to push data to disks ... (repeat
> > previous half-sentence and go round in loop).
> OK, fits the problem.
>
> > pushing to nda on both. So buffers will fill both sides, then the
> > kernel will act to send buffers to disk, which is out to enbd, out to
> > the other side, which will receive and send buffers to disk -
> > unfortunately it can't. Buffers are full. Memory deadlock.
> >
> > The only hope is to run without buffering or to push buffers to disk
> > before memory fills. Try using O_DIRECT. I forget the flag ... "-n" on
> > the server.
> >
> > You can also try making the server async ("-a"). I quote the man page:
> >
> > -a This option tells the server to run asynchronously
> > with respect to the network. It will acknowledge
> > writes from the client before it has written them
> > to disk. There is a potential to lose data here.
> > This is faster than the normal service mode and it
> > can avoid a deadlock under some situations. For
> > example two servers with two clients both writing
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > to each other at the same time will deadlock when
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > both kernels are simultaneously full with dirty
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > buffers aimed at the clients. Each nbd server
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > wants to write to disk to relieve the pressure but
> > cannot get buffers to do so because they are locked
> > up in its client. This happens localhost to loa_
> > clhost too. Using a raw device as resource or
> ^^^^^^^^^^^
> Is /dev/hdaX raw device or is it only /dev/hda, obviously /dev/hdaX
> deadlocks.
I am not sure what you mean ... oh, the -n flag on the server? It's
labelled as experimental, and last I heard Arne was trying to figure
out under what conditions one can write data under O_DIRECT ordinarily.
You will have to set "-b 4096" also on the server, in order to
force writes to be aligned to a page, which we are fairly confident is
OK under O_DIRECT. With -n the serve will do its opens with O_DIRECT.
With -n and -b 4096, its writes will be to a fd opened O_DIRECT and
will be written from buffers aligned to 4096.
Check the code in file.c if in doubt.
/dev/hda and /dev/hdaX are none of them raw devices. You used to be
able to bind tehm to raw devices, but I think that is now deprecated
in 2.6 in favour of the O_DIRECT trick. I did have a trick in the
kernel code which caused all opens on the enbd device to become
O_DIRECT ... yes, you can set direct=1 as a module parameter, or
do it device by device via the /proc/sys/dev/enbd interface. I presume
one can also write something like "direct[a]=1" to /proc/nbdinfo.
Be careful to distinguish client from server. I was talking about
serverside options. But yes, you can also remove buffering clientside.
There used also to be a raw interface for enbd .. the enbd_raw.o
module? Is that only in 2.4 codes?
> My limited experience has been that TASK_INTERRUPTIBLE was necessary to make
> schedule have any effect, but I may well be wrong (as you have kindly
> pointed out).
No, you are right. Adding it will cause the clr_kernel_queue function
to behave more kindly to other kernbel threads.
Peter
More information about the ENBD
mailing list