[ENBD] Re: Some Questions
Peter T. Breuer
ptb@it.uc3m.es
Fri, 20 Oct 2000 16:37:52 +0200 (MET DST)
Hi .. sorry I couldn't answer properly before. I wasn't at my desk
(more like in my bed). I know you're on the list so I'll answer you via
it - this time!
> I'm Wang Gang. I'm working on network softraid, NBD is a very
> important part in the project. Three weeks ago, I found ENBD. I once
> modified the old NBD code, add some functions, such as timeout. But
> compared with your code, it is too unshaped. So now we use ENBD in
> place of old NBD. After several weeks' using, I found some problems.
> 1.ENBD cann't get the correct size of disk in the reiserfs. Because
> the size of device file in the reiserfs is not 0, but 35! I simply
> change "if (!size)" into "if (size < blksize)" (in the function
> "getsize1") to avoid this problem, is there more elegant method?
You have done the right thing. I recall some discussion on reiserfs a
long while ago about why this is. Maybe they had a reason and have
forgotten it! Anyway, it is worth asking. I presume this is either in
response to an fstat() or a GETBLKSIZE call? Please review the
modified nbd-server.c code in http://ww.it.uc3m.es/ptb/nbd/src in
the nbd-2.4.15 directory, and check that my changes match yours. We
are likely to make different bugs, so the intersection should be
correct.
> 2.I think your timeout mechanism has some confusion. User must
You are talking about negotiation timeout? I agree that you should
probably set it to more than the default (which is 90s, I think).
I am willing to take patches. There are two problems here:
1) I don't get negative info from the other end about the state of
the machine or net, so I have to use timeouts
2) timeouts are badly handled in unix, I think. I may have to provide
some generic stacking library. I can only set one timeout with alarm() and
it interferes with select() (or is ignored by select, I am not clear
which). Different timeouts at different levels of the code may
collide. The lowest level timeout will always win. Not the shortest
timeout.
> carefully set the timeout values in the client daemon, kernel and server
> daemon to make them consistent(your default value is not consistent, the
> client/server timeout continuously). Thereby the nbd module can't be
Well, I don't see that!
> loaded automatically, it is too inconvenient. In addition, I think the
I have very little experience with loading the module automatically,
since I just do "make test". But you should be able to rig kerneld
to load the module on demand, and then add a post-install command to
conf.modules to start the client. But the client to where? Are you
saying that there needs to be some kind of /etc/nbdtab that defines
the default server targets of different devices. That's easy.
> timeout value should not be per machine, but per connection. So I
> modified your code, add a "timeout negotiation" and a "MY_NBD_SET_INTVL"
> ioctl command to slove these problems.
I'd be interested to see the code. As I remarked to you (speaking
architecturally) I don't see what the kernel has to know about
negotiation times, however, so I am dubious about this.
> 3.As for whether ENBD should report error when it find outages. In my
> view, if use ENBD solely, your retry model is very good. But if use it
> with other softwares(such as RAID), it should report errors, the error
> control is not its business.(I put modified file in the attachment).
I agree that the modes should be switchable. One problem is knowing
when there IS an error. For example, is all the client daemons dying
an error? Or is it just a bored admin playing around a bit. He
can always start new clients again in a second or two. I frequently
kill and restart servers and clients in testing precisely in order to
provoke them to reconnect, invisibly to the higher levels, and I don't
expect to switch off the device while I'm doing it. I expect requests
on the device to be stored until I say to error them out or I
reconnect the network. Those are user space activities that the kernel
should have no part of. It's comparable to wondering about hotswapping
an ide device ... do you store or error requests while you swap?
> 4.It seems that NBD cann't suffer "fast and massive workload". When I
> use NBD locally or solely in fast network, running big size bonnie test
> will cause system halt, but small size will not. Who cause this
> problem, NBD or kernel?
Depends on nbd version and kernel version. That problem is not known to
exist on 2.2 kernels. OTOH, I can't run more than a few megabytes of
tests on 2.4.0t1 without the kernel locking up. When they fix loop.c
so that it works in the new kernels I'll look again, as nbd is derived
from loop.c (it writes to the net instead of to the disk).
This is a very serious problem, if you can reproduce it. Please
recompile with -g and attach gdb to the running processes. When
they get stuck, tell me what is going on. I need a clue (apart from
details of kernel version and nbd version so on). In particular
show me kernel oopses (translated) if they occur. Run with only
one daemon per device to rule out synchronicity problems. The
code is written to sustain async accesses but if there is a bug,
it is likely in that sort of thing.
> 5.In my experiment, I use one machine as center node, run nbd client
> and softraid on it, and use 4 machines as I/O nodes, run nbd server on
> them. The center node has 4 10/100M ethernet NICs, each I/O node has
> one 10/100M NIC, and let each NIC of center node correspond to one I/O
> node by set them into different subnet. I use bonnie to test the speed
> of disk, but result is not satisfactory. Single NBD can achieve 11MB/s,
> almost is equal to the peak value of 100M ehternet. But the raid(or 4
> simultaneously running NBD) only can achieve 17MB/s. Which part is the
> bottleneck, NBD, TCP or hardware?
As I remarked to you in private, these are very impressive figures. I am
unable to exceed 5 or 6MB/s under the same kind of conditions. I don't
know what my bottleneck is - the kernel is not at more than 15% of the
load. I guess that it is the kernel i/o layer. If you can get me data
on your bottleneck I would be interested. The fact that you can improve
on a single cards performance by using more than one card proves that
there is available bandwidth. You will have to experiment by going to
200Mb/s ethernet and passing all channels through one interface
instead of two! Seriously, you have to vary the conditions in order to
discover the determining variables.
> 6.Do you think changing nbd server into a nfs-like style is a good idea?
I intend to do it. It would require broadcasting invalidation
information.
Peter