[ENBD] 2.4.32

Peter T. Breuer ptb at it.uc3m.es
Thu Mar 11 07:44:01 MST 2004


"Also sprach Anders Blomdell:"
> First of all '-n' on the server seems to take care of the hangs.

On the client, you mean.

> So now I have new problems...
> When running the full configuration, performance is very bad
> 
> Client 1:
> 
> 11:45:41 _llseek(3, 6041313280, [6041313280], SEEK_SET) = 0
> 11:45:41 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
> ..., 32768) = 32768
> 11:46:23 _llseek(3, 6041346048, [6041346048], SEEK_SET) = 0
> 11:46:23 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
> ..., 32768) = 32768

Uh, if it is doing an lseek, then it's surely a server, not a client?
The clients talk to the nbd devices via ioctls.


> 
> Server 3 (top):
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   2172 root      10   0  2060 2060 1612 S 45.0  0.2  12:05.35 enbd-server
>   2163 root       9   0  2060 2060 1612 S 41.2  0.2  11:46.30 enbd-server
>   2162 root      11   0  2060 2060 1612 S 28.3  0.2  10:09.98 enbd-server
>   2173 root       9   0  2060 2060 1612 S 21.9  0.2  10:49.02 enbd-server
> 
> All these servers are doing a select (that eventually times out), like this:

Yes, I have seen this too. It seems to be a new 2.6 thing. I'll try and
figure out what it is.

> 11:48:35 select(2, [1], NULL, NULL, {5, 0}) = 0 (Timeout)
> 11:48:40 select(2, [1], NULL, NULL, {5, 0}) = 0 (Timeout)

They're waiting 5s for incoming packets on the network socket. There are
none. But the 2.6 kernels seem to do a selct wait in a busy loop!
Anyone know what is going on there?

If as you say those servers are just sitting there for 5s waiting on a
byte to appear in a network socket, then it is the kernels/glibc's fault
if that is now implemented in such a way as to take cpu cycles.

Yes?


> Disk load and network load is very light on the servers (since servers and 
> clients are moving forward very slowly):
> 
>    Network load is 20-50 kBytes/s
>    Disk load is a few writes/second
> 
> Seems like scheduling is broken, could it be TASK_INTERRUPTIBLE that is 
> needed?

No. Nothing like that. For a start you are looking at a server, not a
client!


> >> There is a loop between machines, this is what I intend to do
> > Aaaaaaaaaaaargh!
> That's what the machines says as well :(
> 
> > Is that diagram copyright, or can I borrow it?
> You're welcome...
> 
> > No loop. Here's one server.
> >
> >                   |    |
> >        md0       esrv esrv
> >     ============  |    |
> >     hda1 nda ndb hda2 hda3
> >           |   |
> >          eclt eclt
> >           |   |
> >
> > Now, is that what you have?
> Yes.
> 
> > Well, you would deadlock without locks if you really had a loop. It's a
> > buffer deadlock.  When one server tries to push data to disks, it needs
> > buffers to hold them, but when memory is full that can only come from
> > flushing other buffers to enbd, which sends them to the other server,
> > which unfortunately is trying to push data to disks ...  (repeat
> > previous half-sentence and go round in loop).
> OK, fits the problem.
> 
> > pushing to nda on both. So buffers will fill both sides, then the
> > kernel will act to send buffers to disk, which is out to enbd, out to
> > the other side, which will receive and send buffers to disk -
> > unfortunately it can't. Buffers are full.  Memory deadlock.
> >
> > The only hope is to run without buffering or to push buffers to disk
> > before memory fills. Try using O_DIRECT. I forget the flag ... "-n" on
> > the server.
> >
> > You can also try making the server async ("-a"). I quote the man page:
> >
> >        -a     This  option tells the server to run asynchronously
> >               with respect to the network.  It  will  acknowledge
> >               writes  from  the client before it has written them
> >               to disk.  There is a potential to lose  data  here.
> >               This  is faster than the normal service mode and it
> >               can avoid a deadlock under  some  situations.   For
> >               example  two  servers with two clients both writing
> >               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >               to each other at the same time will  deadlock  when
> >               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >               both  kernels  are  simultaneously  full with dirty
> >               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >               buffers aimed at  the  clients.   Each  nbd  server
> >               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >               wants  to write to disk to relieve the pressure but
> >               cannot get buffers to do so because they are locked
> >               up  in  its client.  This happens localhost to loa_
> >               clhost too.   Using a raw  device  as  resource  or
>                                       ^^^^^^^^^^^
> Is /dev/hdaX raw device or is it only /dev/hda, obviously /dev/hdaX 
> deadlocks.

I am not sure what you mean ... oh, the -n flag on the server? It's
labelled as experimental, and last I heard Arne was trying to figure
out under what conditions one can write data under O_DIRECT ordinarily.
You will have to set "-b 4096" also on the server, in order to
force writes to be aligned to a page, which we are fairly confident is
OK under O_DIRECT. With -n the serve will do its opens with O_DIRECT.
With -n and -b 4096, its writes will be to a fd opened O_DIRECT and
will be written from buffers aligned to 4096.

Check the code in file.c if in doubt.

/dev/hda and /dev/hdaX are none of them raw devices. You used to be
able to bind tehm to raw devices, but I think that is now deprecated
in 2.6 in favour of the O_DIRECT trick. I did have a trick in the
kernel code which caused all opens on the enbd device to become
O_DIRECT ... yes, you can set direct=1 as a module parameter, or
do it device by device via the /proc/sys/dev/enbd interface. I presume
one can also write something like "direct[a]=1" to /proc/nbdinfo.

Be careful to distinguish client from server. I was talking about
serverside options. But yes, you can also remove buffering clientside.

There used also to be a raw interface for enbd .. the enbd_raw.o
module? Is that only in 2.4 codes?

> My limited experience has been that TASK_INTERRUPTIBLE was necessary to make
> schedule have any effect, but I may well be wrong (as you have kindly 
> pointed out).

No, you are right. Adding it will cause the clr_kernel_queue function
to behave more kindly to other kernbel threads.

Peter



More information about the ENBD mailing list