[ENBD] Re: new feature in ENBD.

Peter T. Breuer ptb@it.uc3m.es
Fri, 8 Sep 2000 21:01:51 +0200 (MET DST)


"A month of sundays ago Daniel Shane wrote:"
> By the way, I found out that /dev/zero wasnt there in my ramdisk so it
> was mmapping in a file all the time. Never the less, it works. If I
> create /dev/zero then I get a "invalid argument", as you said earlier.

I fixed that after I found out what you were doing. I have been running
on /dev/zero for a while as journal, with no problem. I was confused
about the proper mmap treatment for a while, but I think it
straightened out correctly by 2.4.11. The difference is the
parameters to the mmap ...

> Ok, I just tested 2.4.11 and found out that something must have changed
> since 2.4.9 because I get an ENOMEM with the mmap (on 2.4.11), darn. But
> it works with 2.4.9. I'm sifting through the diffs (will try 2.4.10
> also) and just cant seem to figure out what changed in regard to this
> problem.

Look at the mmap call. In earlier versions I (wrongly) noticed when you
tried to open() /dev/zero and didn't open it but set the file descriptor
to -1 instead. Later, in the mmap, I used MAP_PRIVATE|MAP_ANON or
some half-understood other permutation of the possibilities. I got
that straightened out at some point and understood that /dev/zero
was OK without MAP_ANON. At any rate the mmap call now is

  self->cache = mmap(NULL+4096, jsize, PROT_WRITE|PROT_READ,
             MAP_PRIVATE|((journal<0)?MAP_ANON:MAP_FILE),

and the open is just straight:

   self->journal = open (journalname, O_RDWR|O_CREAT,S_IRUSR|S_IWUSR);

so unless I use a special switch, you will always open a real
file descriptor as a mmap.

> Obviously, the mmap is identical, done at the same time. I have
> absolutely no idea why I get ENOMEM.

Well, let's see. In 2.4.9:

   self->journal = open (journalname, O_RDWR,S_IRUSR|S_IWUSR);

   self->cache = mmap(NULL+4096, jsize, PROT_WRITE|PROT_READ,
              MAP_SHARED|((journal<=-1)?MAP_ANON:MAP_FILE),

so offhand, I would say the difference is MAP_SHARED and MAP_PRIVATE!
But ...

> I therefore wrote a small C program that called mmap with different
> sizes, and now, I very much question the fact that it works at all. I
> always get an ENOMEM when mmapping a size > mem + swap.

Ahhhhh. I see. Yes. The trick is that you are mapping /dev/zero, aren't
you?  I get it. When you mmap /dev/zero as the file, the kernel probably
provides backing blocks for the area requested (though I think it
should only provide them on demand!). If you mmap an ordinary file,
however, the kernel only provides backing blocks for the _blocks in the
file_. The file is sparse in the case of an nbd journal, so not
much backing is required.

But I don't think the kernel should be backing /dev/zero before
anything is requested from it! And I'm fairly sure that I've been
testing with more than my ram+swap (128+256). I was testing
against the 6GB hard disk.

> Here is what the info page has to say :
> 
> "Since mmapped pages can be stored back to their file when physical
> memory is low, it is possible to mmap files orders of magnitude larger
> than both the physical memory _and_ swap space.  The only limit is

Yes, that's good. If there is address space available.

> address space.  The theoretical limit is 4GB on a 32-bit machine -
> however, the actual limit will be smaller since some areas will be
> reserved for other purposes."

They are talking about address space.

I believe the man page confirms my mental image of what happens. I
believe you may want to map a file instead of /dev/zero, but I also
think that if /dev/zero behaves the way you suggest when it is mmapped,
it is aberrant. The kernel shoudl back it on demand, not on opening.

> "Since private mappings effectively revert to ordinary memory when
> written to, you must have enough virtual memory for a copy of the entire
> mmapped region if you use this mode with PROT_WRITE."

There is also a possible difference of behaviour between MAP_SHARED
and MAP_PRIVATE. I thought that MAP_SHARED required more overhead.
It's impossible with anonymous maps (i.e. MAP_ANON and fd=-1), I
think.

> So it seems that ENOMEM should always happen. Curiously, in your
> nbd-client, the mmap works (as a FILE, and with version 2.4.9) but
> doesnt work with my small C code or version 2.4.11. It always returns
> ENOMEM. 
> 
> These are my findings so far... what do you think?

Can you check out the variants between file/dev_zero/anon mapping
and SHARED/PRIVATE? Make sure the file is sparse.

I think file + MAP_anything shoudl always work. Make really sure the file
is sparse. NBD takes care to open the journal file by seeking to the end of
it and reading 1 char (trap segfault if it happens) and write it back.
It then mmaps the sparse file that results.  You can make a sparse file
with

dd if=/dev/zero of=/tmp/foo seek=100 count=1

or some such.

Peter