FreeBSD overcommit enable/disable patch

The patch for FreeBSD adds the switch to disable allocation of the anonymous memory that cannot be backed by the swap. After turning the switch on, total amount of the anonymous memory in the system cannot exceed the swap size. Besides this, the amounts of memory allocated under real uids are tracked and can be limited (by the RLIMIT_SWAP).

Patch does this by accounting for mapped and brk-ed memory, /dev/zero, sysv shm allocated from swap (when kern.ipc.shm_use_phys = 0), swap-based md-disks and shm_open()-created objects.

How-to use

The current amount of the accounted swap space is exported as sysctl vm.swap_reserved (count is in bytes). The sysctl vm.overcommit controls the swap allocation policy: setting of the bit 0 to 1 denies allocation request if, after request, total reserved swap space will exceed size of configured swap.

Bit 2 allows to count non-wired physical memory as swap. This is like the swap reservation on Solaris going. Additionally, free_reserved pages (exported as vm.stats.vm.v_free_target) are never allowed to be allocated (from the userspace) to help avoid deadlocks.

Setting of bit 1 allows enforcement of the per-user RLIMIT_SWAP limits. These limits may be set in login.conf by the swapuse capability. Both /bin/sh, /bin/csh and /usr/bin/limits are patched to know about RLIMIT_SWAP.

See also tuning(7) and getrlimit(2) in the patched sources.

Implementation overview

Patch goal is to charge for OBJT_SWAP and OBJT_DEFAULT objects. Usually, the objects backing anonymous memory are created at the fault time (or when clipping vm_map_entry, etc), not when the entry is created. So, both vm_map_entry and vm_object got the uip field that points to the struct uidinfo. Non-null value in this field means that entry or object are charged and points to uidinfo for ruid allocated that memory. When the vm_object for vm_map_entry is created, uip is migrated from entry to object.

vm_map_insert function makes the decision whether the mapped entry be charged. This may be influenced by MAP_ACC_CHARGED and MAP_ACC_NO_CHARGE flags. MAP_ACC_CHARGED means that memory is already charged by some means, MAP_ACC_NO_CHARGE forbids charging even if entry looks like is shall be. E.g., buffers are inserted like anonymous memory in the kernel map.

Objects sometimes have dead pieces that will never reference pages and will not accessed by any map entry. This can happens, e.g, after vm_object_split. So, the charge field was added to the vm_object that shows how much swap is really reserved for the object.

Patch accounts for any mapping that could lead to the use of the swap space. E.g., shared anonymous memory or private mapping of the file are charged. But, executables and shared libraries have text segment mapped private readonly (overwritten by VM_PROT_OVERRIDE_WRITE fault flag). Accordingly, the kludge was added to not charge for private readonly mappings of the files. But, if the area is later mprotected(2) to be writable, object is charged. As consequence, the mprotect(2) and ptrace(2) may return ENOMEM.

Implementation quirks.

Status.

Patch was intensively tested by Peter Holm.

You feedback is welcome. My mail is kostikbel gmail com

The patch.