What started as an effort to automate sysbench testing with mysql and postgres became a bit more involved. The initial parts were relatively straight-forward and mainly around getting the build automation right. Boring, but time-consuming. Eventually it got there and I was able to show that anti-fragmentation did not hurt that workload, something I was reasonably sure about but wanted to double check.
Then it seemed like it should be straight-forward to configure the benchmark to use large pages but there were two upsets that made the job harder than it should have been. The first was that there was no automatic way of making shmget() use large pages. I worked around this with a basic LD_PRELOAD hack which at some point should be done properly and added to libhugetlbfs. The second was tuning the hugepage pool size so the application could run reliably without consuming too much memory.
Tuning the pool size was harder. Originally with hugetlbfs, huge pages were pre-faulted at mmap() time. If mmap() returned successfully, all future references would succeed, end of story. However, prefaulting increases the cost of mmap() and can lead to poor NUMA placement. Support was added in 2.6.18 to reserve pages for MAP_SHARED mappings that would be faulted in as normal. This fixed a performance problem but left MAP_PRIVATE in the lurch. As mmap() would not reserve pages, it just returned success. A fault later with an insufficient pool would result in a SIGKILL. Even if the application used mlock(), thus bringing back the NUMA placement problem, it may still not be safe because on fork() a COW may take place if the child was long-lived resulting in another SIGKILL.
Benchmarking sysbench with libhugetlbfs used MAP_PRIVATE mappings so on getting the configuration wrong, the benchmark would unceremoniously exit and this was not even particularly consistent between machines. For the purposes of completing the benchmark, the pool was simply sized larger than I thought it needed to be but for the long term I found it irritating. The results were more or less what I was expecting for a database load (about 5% improvement) but the details of how it was setup are for another time.
The first step to reducing the pain for someone using large pages was to make MAP_PRIVATE reliable without resorting to prefaulting. The obvious solution was to reserve the pages always but was still problematic on fork(). The reserve would need to double at that point, something that is potentially very expensive if the pool is being dynamically resized. This work would be wasted if the fork() was simply for an exec() and using vfork() may not be suitable in all situations either. Failing fork() due to being unable to reserve hugepages would also be very unwelcome.
Hence, the solution that was put forward instead was to have reliable behaviour for the original mapper at the cost of the child. There are a few situations to deal with but the basic idea is that if a original mapper takes a COW fault and it is going to fail due to a small pool and a child holding a reference to the page, the process will find the children and unmap the large page at the faulting address. The COW is then no longer necessary and the original mapper continues as normal. If the child later takes a fault in that area, it gets killed.
At the face of it, this appears to be bizarre behaviour but the reality is that random killing of the original mapper is simply unacceptable. An application that is expecting to use MAP_PRIVATE hugetlbfs mappings and have a child get its own reliable copy is probably doing something very strange and it's unlikely such applications exist given the history of hugetlbfs. If the pool is too small for a child to operate in this fashion, a message appears on dmesg that is fairly self-explanatory to catch the situation where such an application exists. For existing applications, they should already be able to cope gracefully with mmap() failing.
One objection that was raised is that applications that crate a large mapping that is only to be used sparsely could suffer due to mmap() requiring the full reserve, even if the sysadmin knows that only a small fraction of the pages is needed. Such applications are unlikely to exist given the history of hugetlbfs but just in case, Andy Whitcroft developed support for MAP_NORESERVE and hugetlbfs that bypasses the reserve. mmap() will succeed regardless of pool size and if it is too small, the developer gets to keep both pieces :/
The patches are visible at
here and
here for those that want to take a closer look. They have been merged to -mm for wider testing and should make working with hugetlbfs a more positive experience.