[ENBD] Re: new feature in ENBD.
Peter T. Breuer
ptb@it.uc3m.es
Fri, 8 Sep 2000 21:01:51 +0200 (MET DST)
"A month of sundays ago Daniel Shane wrote:"
> By the way, I found out that /dev/zero wasnt there in my ramdisk so it
> was mmapping in a file all the time. Never the less, it works. If I
> create /dev/zero then I get a "invalid argument", as you said earlier.
I fixed that after I found out what you were doing. I have been running
on /dev/zero for a while as journal, with no problem. I was confused
about the proper mmap treatment for a while, but I think it
straightened out correctly by 2.4.11. The difference is the
parameters to the mmap ...
> Ok, I just tested 2.4.11 and found out that something must have changed
> since 2.4.9 because I get an ENOMEM with the mmap (on 2.4.11), darn. But
> it works with 2.4.9. I'm sifting through the diffs (will try 2.4.10
> also) and just cant seem to figure out what changed in regard to this
> problem.
Look at the mmap call. In earlier versions I (wrongly) noticed when you
tried to open() /dev/zero and didn't open it but set the file descriptor
to -1 instead. Later, in the mmap, I used MAP_PRIVATE|MAP_ANON or
some half-understood other permutation of the possibilities. I got
that straightened out at some point and understood that /dev/zero
was OK without MAP_ANON. At any rate the mmap call now is
self->cache = mmap(NULL+4096, jsize, PROT_WRITE|PROT_READ,
MAP_PRIVATE|((journal<0)?MAP_ANON:MAP_FILE),
and the open is just straight:
self->journal = open (journalname, O_RDWR|O_CREAT,S_IRUSR|S_IWUSR);
so unless I use a special switch, you will always open a real
file descriptor as a mmap.
> Obviously, the mmap is identical, done at the same time. I have
> absolutely no idea why I get ENOMEM.
Well, let's see. In 2.4.9:
self->journal = open (journalname, O_RDWR,S_IRUSR|S_IWUSR);
self->cache = mmap(NULL+4096, jsize, PROT_WRITE|PROT_READ,
MAP_SHARED|((journal<=-1)?MAP_ANON:MAP_FILE),
so offhand, I would say the difference is MAP_SHARED and MAP_PRIVATE!
But ...
> I therefore wrote a small C program that called mmap with different
> sizes, and now, I very much question the fact that it works at all. I
> always get an ENOMEM when mmapping a size > mem + swap.
Ahhhhh. I see. Yes. The trick is that you are mapping /dev/zero, aren't
you? I get it. When you mmap /dev/zero as the file, the kernel probably
provides backing blocks for the area requested (though I think it
should only provide them on demand!). If you mmap an ordinary file,
however, the kernel only provides backing blocks for the _blocks in the
file_. The file is sparse in the case of an nbd journal, so not
much backing is required.
But I don't think the kernel should be backing /dev/zero before
anything is requested from it! And I'm fairly sure that I've been
testing with more than my ram+swap (128+256). I was testing
against the 6GB hard disk.
> Here is what the info page has to say :
>
> "Since mmapped pages can be stored back to their file when physical
> memory is low, it is possible to mmap files orders of magnitude larger
> than both the physical memory _and_ swap space. The only limit is
Yes, that's good. If there is address space available.
> address space. The theoretical limit is 4GB on a 32-bit machine -
> however, the actual limit will be smaller since some areas will be
> reserved for other purposes."
They are talking about address space.
I believe the man page confirms my mental image of what happens. I
believe you may want to map a file instead of /dev/zero, but I also
think that if /dev/zero behaves the way you suggest when it is mmapped,
it is aberrant. The kernel shoudl back it on demand, not on opening.
> "Since private mappings effectively revert to ordinary memory when
> written to, you must have enough virtual memory for a copy of the entire
> mmapped region if you use this mode with PROT_WRITE."
There is also a possible difference of behaviour between MAP_SHARED
and MAP_PRIVATE. I thought that MAP_SHARED required more overhead.
It's impossible with anonymous maps (i.e. MAP_ANON and fd=-1), I
think.
> So it seems that ENOMEM should always happen. Curiously, in your
> nbd-client, the mmap works (as a FILE, and with version 2.4.9) but
> doesnt work with my small C code or version 2.4.11. It always returns
> ENOMEM.
>
> These are my findings so far... what do you think?
Can you check out the variants between file/dev_zero/anon mapping
and SHARED/PRIVATE? Make sure the file is sparse.
I think file + MAP_anything shoudl always work. Make really sure the file
is sparse. NBD takes care to open the journal file by seeking to the end of
it and reading 1 char (trap segfault if it happens) and write it back.
It then mmaps the sparse file that results. You can make a sparse file
with
dd if=/dev/zero of=/tmp/foo seek=100 count=1
or some such.
Peter