Thursday, August 4. 2011
At LSF/MM a long time ago, there was a few people looking for an MM-orientated set of benchmarks. At the time I talked about some of the benchmarks I used and explained that I had some scripts lying around. I was meant to release a collection quickly but until a few months ago, it was a very low priority. Recently I needed a testing framework so finally put something together that would last longer than a single patch series. To rattle out problems related to it, I ran it against a number of mainline kernels and recorded the results although I have not actually read through the results properly. Rather than repeating myself, the details are posted on the mailing list
Tuesday, January 25. 2011
It took a while and the last 12 months have been extremely packed but now I get to update my mail signature.
| Mel Gorman |
Part-time Phd Student | Linux Technology Center |
University of Limerick | IBM Dublin Software Lab |
As of Wednesday 19th, I've finished in UL as I graduated with my PhD. I took the last week off to relax hence a total lack of responsiveness to mail or IRC for those that were trying to get in touch. I expect the next few weeks to be heavily disrupted as the next long-term plan is put together but I'll get back on track eventually.
Thursday, November 12. 2009
So, in recent kernels since 2.6.31-rc1, there is a seemingly benign problem whose apparently manifestation is page allocation failures of GFP_ATOMIC allocations. The system recovers but there are large stalls even though on server systems, everything goes faster overall. The problem is particularly pronounced when using certain wireless cards but manifests in harder-to-diagnose stalls on machines with low memory under stress. The development methodology means that kernels come out very quickly even though right now, I would really prefer if the world would slow down while my poor test machines try to catch up.
I think I have a solution to this but it take several hours each time to figure out if forward progress has been made or not.
The lesson learnt here? Panic makes for poor decisions. I sent one patch what looked great at the time but have found out in the last few hours that it really sucks. While figuring this out for sure, I have to wait looking at a screen to painfully slowly update. To help the waiting, I found some beer, it's the Irish thing to do. Wonder what the rest of ye do :/
Friday, July 10. 2009
I'm revisiting some old topics as part of research that requires me to extract information out of the kernel but whose instrumentation is not really worth merging to mainline. At one point, I extracted this information using a kernel module relaying additional information stored in struct page to userspace. It's not smart or clever but the options were limited at the time. This time around, I intend to give SystemTap a go as it should be able to do this type of job.
When I last used SystemTap, it was a total and utter pain to install which was also one of the main critisms levelled at it during one of the kernel summits. I had an installation script at the time to automate installation but it was a kludge of workarounds and patches. Since then, things have improved considerably because the reworked installation script for the current release is downright trivial and 180 lines shorter than the previous version. A 2 minute glance through the example scripts show considerable improvement in terms of usability and readability as well.
I haven't figured out if systemtap is usable for my needs or not yet but things certainly appear to be going in the right direction on that front.
Tuesday, January 27. 2009
For much of the last 30 days or so[1], I've been writing a paper as part of college work. A large part of that was running tests 24 hours a day over the course of 20 days[2]. On the day they finished, I reran one of the tests (crypto from the specjvm2008 suite) with more debugging information as I wanted to confirm a suspicion but the machine spontaneously rebooted. Not good, but the room was cold, nothing was on serial console or in the logs so I thought it might just have been a power flicker. Rebooted again an hour later so I popped the case to see had something become unseated or some other problem. There were a few hints as to what the problem might be;
Hint 1 ..... memory modules are usually in a nice neat line together, maybe it somehow fell out
wrong
Hint 2 ..... heat sinks should not melt off

Sometimes a cold room with case fans just isn't cold enough if you run that machine hard enough for long enough. Makes for bad surprises but a funny picture. Module is pretty hosed but easily replaced at least.
[1] Minus 7 days during which I was skiiing for the first time in Austria. Skiing rules
[2] Based on mainline kernel 2.6.27 installed on Debian Lenny Beta from Dec 2008 using an AMD Phonem based machine for x86-64 and a Terrasoft Powerstation for ppc64.
Monday, December 15. 2008
The userspace allocation API was only one of the big additions made to libhugetlbfs 2.1. A number of helper utilities were also added for configuring the system easily (hugeadm), the launching of applications to automatically use large pages (hugectl), to set defaults for applications (hugeedit) and to display information about the page sizes supported by the system (pagesize). In combination, they should make using huge pages under Linux that bit easier but here is a document that briefly describes how to install the utilities, configure the system and launch the sysbench benchmark using the utilities.
Wednesday, December 3. 2008
On Linux, there are two basic ways of creating a region of memory backed by huge pages. The first is to use shmget() with the SHM_HUGETLB flag set but this can leak memory if the application fails to clean up memory properly. The second is to mmap() a file created on a hugetlbfs mount but this requires a lot of boring boilerplate code. Things are more painful than they should be and this lead to whinging.
As the necessary code existed in libhugetlbfs to discover the mount and create a file, two APIs were created called get_huge_pages() for use when implementing custom allocators and get_hugepage_regions() when used as drop-in replacement for malloc() of large buffers. These are available and documented with manual pages in libhugetlbfs 2.1-pre5 with a final release expected in the near future.
I put together this document describing how to alter STREAM to use malloc with small pages, malloc with large pages and the two direct hugepage allocation APIs now supported in libhugetlbfs. It should make life easier for anyone writing a hugepage-aware application that cannot use the automatic support in libhugetlbfs for whatever reason.
Friday, September 26. 2008
Last week the Linux Plumbers Conference was held in Portland as a developers conference for those working close to the boundary between user and kernel space. Many have described it being one of the best conferences they have attended in a while and I have to agree. The talks were interesting and the people running them discussed their current activities rather than a one-way monologue on their activities over the last year or so (for the most part anyway). For myself, I met a number of people to iron out issues that have been bugging me for a while and got a number of small hacking jobs prototyped that have been on my TODO list for too long. After being working on large pages for some time, it was also a chance to get a quick tour of what is active in the lower levels of the Linux world at the moment.
One area of interest to me was the graphics layer which has had a history of hilarity working outside of the kernel tree. This is out of necessity as the people willing to test an X driver are not necessarily the same people willing to test kernels. Hence, the out-of-tree is required to build against a number of different kernel versions and the resulting ifdef trickery would have a hard time living in mainline not to mention keeping the API in sync. During the track, the possibility was raised that some of the interesting developments in the kernel, X and graphics drivers over the next year would the user to update userspace before enabling using certain kernel features. There was pressure to not require this update but the X guys seemed pretty insistent that a fully incremental effort would be a real pain and the end result ultimately worse.
In case a kernel feature requires an updated userspace in the future, I took a look at what was involved in building X these days. If nothing else, my laptop was using the vesa driver which broke when switching to a text console (fonts were the wrong size) and generally performed worse than what the hardware should have been capable of. The distribution drivers for the card were less than satisfactory for a number of reasons that I never got around to ironing out and I knew there was a lot of additional support for the ATI M56GL Mobility card in my machine added over the last year. Plenty of incentive to get this working.
I vaguely recall from years ago that building X from scratch was no fun whatsoever. Others must have had similar experiences as there is a general perception that X development is scary and building it from source is as much fun as a punch in the nose. X development may still be daunting, I haven't tried, but building from source is straight-forward today. Maybe I missed a pile of options and combinations, but getting the basics right involved an evening on the couch watching Flight of the Concords on DVD and poking buttons periodically - hardly a taxing event.
The X server, the modules and supporting infrastructure consists of a large number of git repositories. If you were to download and build them by hand, you'd be there for a few hours and maybe that turns people away. There is a build scripton the wiki, but it is a rushed hack by the looks of things and did not check that build of a module actually succeeded for example. I updated it to have some new smarts and it should be a fire-and-forget effort although you may need to add your graphics driver to its list. I'll update the wiki when I get back to Ireland (on plane at the moment) and have a
chance to test it on my other machines to make sure it works in general but Script for building X how it currently stands for those that are interested.
What did catch me was starting the new X properly. The site is very clear on starting X itself but I must have missed the instructions on how to give mesa the right paths. For the library paths, I added /opt/gfx-test/lib to /etc/ld.so.conf (it could also have been done in .bashrc with greater smarts but I was lazy). I then used a laucher script for X to load the kernel modules and
set LIBGL_DRIVERS_PATH to find the new DRI drivers. Without LIBGL_DRIVERS_PATH, the system DRI libraries get used resulting in some weirdness which I only spotted after setting some debug options. gdm was starting the old X server so I simply disabled it rather than fixing the init script. End result? One very satisfactory X desktop running very smoothly - nice one.
Monday, June 2. 2008
What started as an effort to automate sysbench testing with mysql and postgres became a bit more involved. The initial parts were relatively straight-forward and mainly around getting the build automation right. Boring, but time-consuming. Eventually it got there and I was able to show that anti-fragmentation did not hurt that workload, something I was reasonably sure about but wanted to double check.
Then it seemed like it should be straight-forward to configure the benchmark to use large pages but there were two upsets that made the job harder than it should have been. The first was that there was no automatic way of making shmget() use large pages. I worked around this with a basic LD_PRELOAD hack which at some point should be done properly and added to libhugetlbfs. The second was tuning the hugepage pool size so the application could run reliably without consuming too much memory.
Tuning the pool size was harder. Originally with hugetlbfs, huge pages were pre-faulted at mmap() time. If mmap() returned successfully, all future references would succeed, end of story. However, prefaulting increases the cost of mmap() and can lead to poor NUMA placement. Support was added in 2.6.18 to reserve pages for MAP_SHARED mappings that would be faulted in as normal. This fixed a performance problem but left MAP_PRIVATE in the lurch. As mmap() would not reserve pages, it just returned success. A fault later with an insufficient pool would result in a SIGKILL. Even if the application used mlock(), thus bringing back the NUMA placement problem, it may still not be safe because on fork() a COW may take place if the child was long-lived resulting in another SIGKILL.
Benchmarking sysbench with libhugetlbfs used MAP_PRIVATE mappings so on getting the configuration wrong, the benchmark would unceremoniously exit and this was not even particularly consistent between machines. For the purposes of completing the benchmark, the pool was simply sized larger than I thought it needed to be but for the long term I found it irritating. The results were more or less what I was expecting for a database load (about 5% improvement) but the details of how it was setup are for another time.
The first step to reducing the pain for someone using large pages was to make MAP_PRIVATE reliable without resorting to prefaulting. The obvious solution was to reserve the pages always but was still problematic on fork(). The reserve would need to double at that point, something that is potentially very expensive if the pool is being dynamically resized. This work would be wasted if the fork() was simply for an exec() and using vfork() may not be suitable in all situations either. Failing fork() due to being unable to reserve hugepages would also be very unwelcome.
Hence, the solution that was put forward instead was to have reliable behaviour for the original mapper at the cost of the child. There are a few situations to deal with but the basic idea is that if a original mapper takes a COW fault and it is going to fail due to a small pool and a child holding a reference to the page, the process will find the children and unmap the large page at the faulting address. The COW is then no longer necessary and the original mapper continues as normal. If the child later takes a fault in that area, it gets killed.
At the face of it, this appears to be bizarre behaviour but the reality is that random killing of the original mapper is simply unacceptable. An application that is expecting to use MAP_PRIVATE hugetlbfs mappings and have a child get its own reliable copy is probably doing something very strange and it's unlikely such applications exist given the history of hugetlbfs. If the pool is too small for a child to operate in this fashion, a message appears on dmesg that is fairly self-explanatory to catch the situation where such an application exists. For existing applications, they should already be able to cope gracefully with mmap() failing.
One objection that was raised is that applications that crate a large mapping that is only to be used sparsely could suffer due to mmap() requiring the full reserve, even if the sysadmin knows that only a small fraction of the pages is needed. Such applications are unlikely to exist given the history of hugetlbfs but just in case, Andy Whitcroft developed support for MAP_NORESERVE and hugetlbfs that bypasses the reserve. mmap() will succeed regardless of pool size and if it is too small, the developer gets to keep both pieces :/
The patches are visible at here and here for those that want to take a closer look. They have been merged to -mm for wider testing and should make working with hugetlbfs a more positive experience.
Sunday, January 13. 2008
Since PowerTOP was released, I noticed that the number of wakeups on my Thinkpad T60p were excessively high. Usually they were around the 350 mark, even without much running, leading to about 3.5 hours battery life or about an hour less than Windows XP. Even with disabling non-essential services, hardware and following other suggestions dotted around sites like lesswatts.org and thinkwiki.org, power usage was roughly 23-26 watts (might be off, I didn't record data in detail). I decided to take a closer look. Basically, almost all processes using X were waking up constantly with strace patterns looking vaguely like;
poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI},
{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI},
{fd=14, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN},
{fd=11, events=POLLIN|POLLPRI}, {fd=16, events=POLLIN}], 9, 499) = 0
gettimeofday({1200178170, 948253}, NULL) = 0
ioctl(3, FIONREAD, [0]) = 0
gettimeofday({1200178170, 948443}, NULL) = 0
poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI},
{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI},
{fd=14, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN},
{fd=11, events=POLLIN|POLLPRI}, {fd=16, events=POLLIN}], 9, 0) = 0
write(3, "\211\1\1\0", 4) = 4
read(3, "\1\0\340\0\0\0\0\0\1\0\0\0\0\0\0\0\4\0\0\0(\0\0\0\4\0\0"..., 32) = 32
write(3, "\211\7\1\0", 4) = 4
read(3, "\1\0\341\0\0\0\0\0\0\0\1\0\0\0\0\0\4\0\0\0(\0\0\0\4\0\0"..., 32) = 32
ioctl(3, FIONREAD, [0]) = 0
gettimeofday({1200178170, 949286}, NULL) = 0
Similar patterns were seen in the other processes. They were all waking up and examining the same file descriptor which turned out to be a socket /tmp/.X11-unix/X0 in the control of X. I did not look closely but made the assumption that processes were polling some sort of event queue. Knowing that a lot of fixes of this sort of nature have been worked on, I decided to try out Debian Testing. As I was already running it on my desktop, I was reasonably sure the upgrade would be smooth enough. Font settings got mucked up as well as locales but as dist-upgrades go, it was pretty smooth.
Power-wise, it made a big difference. Even with wireless running, wakeups went from around 350 and 23 watts to 150 and 19 watts. There are still processes waking up in similar style of loops but they are a lot less infrequent (once every 1-4 seconds instead of many times per second). X was showing up high in the list with calls to do_setitimer() so I applied this patch to the xserver-xorg-core package and installed it made the problem go away. Annoyingly, i8042 was causing a large number of interrupts even when nothing was happening. Adding i8042.nomux i8042.reset to the kernel boot command-line removed most of these wakups once the machine sits idle. Wakeups were down to 50 and 18.2 watts usage with the most common wakeup being the wireless. Turning off wireless and it's down to 36 wakeups, almost a tenth of what it was with Debian Etch and battery life is comparable with Windows XP. There are still a few anomalies but clearly things are going the right direction.
Thursday, November 15. 2007
Recently I got hold of a PlayStation 3 and pretty much instantly tried to install Linux on it. It is a bit time-consuming but a surprisingly straight-forward affair. Documents, articles and blogs already exist aplenty on how to install Linux on the PS3 so this blog is just to note what I found strange along the way.
Hooking up to a VGA Monitor
Considering the number of hits you find when googling for PS3 to VGA converter, it is surprising there is not an obviously named piece of kit out there already. If using the HDMI cable, it must be connecting to HDCP-complaint display which rules out adapters of any sort if do not own such a device. I don't but found that a VGA TV Gamer Box called a called TVBox 1440 on maplins.co.uk that did the job of hooking the PS3 up to a bog-standard monitor.
Installing Linux (Debian)
The most straight-forward guide I found to install was on IBM developerWorks. It is pretty dated but gets most of the basics. Early on, I installed Debian. At the time I tried, the Debian Live CD was not able to start X properly but the Debian Installer worked just fine. If going this route, be sure to avoid trying to setup a PReP partition. Not only do you not need it, but the installer makes a shambles of trying and gets seriously confused.
Getting ps3videomode
Many guides make reference to running this command to alter video (or getting it right in the first place in some cases). On install, I converted an RPM package from rpmfind.net although I've spotted since that I could have tried the Debian packages linked from here so look around. The actual git tree is git://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-utils.git.
Other Post-Install Tasks
I found I had to add the ps3 sound module to modules.conf as it was not loaded automatically. Most stuff installed without headache although mileage varied considerably with movie players. xine had a strange echo effect but mplayer appeared to get it right. I did not track down why this is but I found it odd that I experienced a similar problem on my T60P ages back for a brief period of time.
Upgrading the Actual Kernel
The most straight-forward guide I found was . What it missed is that recent kernels require the device-tree compiler. This is not in the stable repository but it's available from testing so grab it from there. Initially, I tried installing a stock 2.6.23 but it cannot even get past early-boot and without a serial console, I had no idea where it was getting locked up. I suspect Geoff Lavand the maintainer of the PS3 kernel tree has a developer version of the PS3 with serial console or he has a machine simulator of some sort.
As described in the linked article and Geoff's , there is a git tree for patches against the mainline kernel at git://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-linux.git . Using the distributions config in /boot as a starting point, it was a simple affair to get something booting with kboot and it supported huge pages in the usual manner one would expect. The one gotcha was that the root device had renamed to /dev/ps3dai so a slightly different kboot entry is needed.
Wrapping up
All in all, getting the machine setup, CD's burned off, install done and the kernel upgrade took about 4 hours in all - much of it waiting for downloads. Between oprofile not working, no early printing support and the lack of a serial console, it is not the best box for kernel development on if you are like me and bust up early-boot a lot. However, once beyond early-boot, it has been a decent box to try stuff on as long as no proper serial console is not a problem for you. I still have to try a few Cell-related things to see what sort of behaviour I get from them so that either will be hella-interesting or a pure waste of time. Worst comes to the worst, I'll open those two games!
Friday, October 26. 2007
Hugepages can potentially attain higher performance from the hardware by reducing TLB misses and CPU cache usage[1]. The benefits generally apply to applications that use large amounts of address space although there can be secondary benefits such as improved hardware pre-fetch.
For an application to exploit any of this though, hugepages must be available. This is not trivial as the vast majority of processors require that the memory be naturally aligned and contiguous, both physically and virtually. In the past, administrators were required to reserve the hugepages at boot-time as memory became too fragmented in a short period of time to allocate the pages normally. Sizing the pool presented a difficult problem for the administrator: too large and memory is wasted on long-lived systems; too small and applications will fail. This issue alone precludes hugepages from being used in more situations.
However, this restriction has been relaxed somewhat in kernel 2.6.23 due to memory partitions. Using partitioning, an administrator can grow and shrink the pool during the lifetime of the system instead of making the decisions at boot time. In this article, we will discuss how to make hugepages available with a greater degree of flexibility than was previously available.
Basic Hugepage Information
To discover if the kernel supports hugepages, read /proc/meminfo and look for the HugePages entries. For example, on my machine I see
mel@arnold:~$ grep Huge /proc/meminfo
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 4096 kB
This means the kernel supports hugepage usage but I have no hugepages in the pool and a hugepage is 4MiB in size. To add hugepages to the pool, a number is simply written to /proc/sys/vm/nr_hugepages like so;
arnold:/proc/sys/vm# echo 10 > nr_hugepages
arnold:/proc/sys/vm# cat nr_hugepages
10
arnold:/proc/sys/vm# grep Huge /proc/meminfo
HugePages_Total: 10
HugePages_Free: 10
HugePages_Rsvd: 0
Hugepagesize: 4096 kB
Discussing how to use hugepages in an application is beyond the scope of this article but information exists if you look around[1][2].
Where it Goes Wrong
In a standard system once memory has been filled (updatedb, untarring lots of data etc), the memory becomes too fragmented and writing a value to nr_hugepages may not allocate all of the requested hugepages. The problem is that not all memory can be reclaimed by the kernel. Many pages allocated by the kernel cannot be easily reclaimed on demand and only one badly placed page can cause a hugepage allocation failure.
Solving the Problem with a Memory Partition
Since 2.6.23, memory can be split in two at boot-time creating a zone called ZONE_MOVABLE (see information on zones by reading /proc/zoneinfo). Only pages that can be reclaimed on demand by the kernel use this zone. Within this zone of memory, hugepage allocations will almost always (See Caveats later) succeed no matter how long the system is running.
So, let us say that an administrator knows that a number of jobs will be running on his system that want to use hugepages. The jobs use between 0 and 256 hugepages and he does not want to waste memory. Previously, the administrator would specify hugepages=256 on the command-line and waste the memory for jobs that are not using it.
Now instead, the administrator would specify movablecore=1024MB on the command line to setup a partition for movable pages 1GiB in size or 256 hugepages. Jobs that require hugepages can now request them from /proc/sys/vm/nr_hugepages and have a reasonable expectation of getting those pages.
The difference between the partitioning and configuring the pool at boot-time is that memory in the partition unused by hugepages can still be used for normal pages. This means that memory can be returned after the huge page process completes, and reallocated to small page processes; memory is never wasted.
Configuring the Memory Partition
The partition must be configured at boot-time using either kernelcore= or movablecore= as documented in Documentation/kernel-parameters.txt. movablecore specifies how much memory should be used for ZONE_MOVABLE. An alternative way of looking at movablecore is
Max Hugepages that can be allocated at any time = movablecore / hugepagesize
If on the other hand the administrator knows how much memory the rest of the kernel needs and wants as much memory as possible to be available for a varying number of hugepages, kernelcore= can be used instead. In this case, the size of ZONE_MOVABLE is whatever memory is left over or alternatively
Max Hugepages that can be allocated at any time = (TotalMem - kernelcore) / hugepagesize
Growing The Pool
Once the partition is setup, the hugepage pool can be easily grown. Depending on system activity the first attempt may not succeed so try a few times or use a script like this not-very-tested and very rough piece of work;
#!/bin/bash
# Attempt to grow the pool to the requested size
# This benchmark checks how many hugepages can be allocated in the hugepage
# pool
#
# Copyright Mel Gorman (c) 2007
# Licensed under GPL V2. See http://www.gnu.org/licenses/gpl-2.0.txt for details
SLEEP_INTERVAL=5
FAIL_AFTER_NO_CHANGE_ATTEMPTS=20
NUM_REQUIRED=0
usage() {
echo "get_hugepages: Get the requested number of hugepages"
echo
echo " -s Time to sleep between attempts to grow pool"
echo " -f Give up after failing this number of times"
echo " -n Number of hugepages that should be in the pool"
echo
exit $1
}
# Arg processing
while [ "$1" != "" ]; do
case "$1" in
-s) export SLEEP_INTERVAL=$2; shift 2;;
-f) export FAIL_AFTER_NO_CHANGE_ATTEMPTS=$2; shift 2;;
-c) export MAX_ATTEMPT=$2; shift 2;;
-n) export NUM_REQUIRED=$2; shift 2;;
esac
done
# Check proc entry exists
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
echo Attempting load of hugetlbfs module
modprobe hugetlbfs
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
echo ERROR: /proc/sys/vm/nr_hugepages does not exist
exit -1
fi
fi
# Check a number was requested
if [ "$NUM_REQUIRED" = "" -o $NUM_REQUIRED -lt 0 ]; then
echo ERROR: You must specify a number of hugepages to alloc
usage -2
fi
# Ensure we have permission to write
echo $STARTING_COUNT 2> /dev/null > /proc/sys/vm/nr_hugepages || {
echo ERROR: Do not have permission to adjust nr_hugepages count
exit -3
}
# Record existing hugepage count
STARTING_COUNT=`cat /proc/sys/vm/nr_hugepages`
echo Starting page count: $STARTING_COUNT
# Start attempt to grow pool
CURRENT_COUNT=$STARTING_COUNT
LAST_COUNT=$STARTING_COUNT
NOCHANGE_COUNT=0
ATTEMPT=0
while [ $NOCHANGE_COUNT -ne $FAIL_AFTER_NO_CHANGE_ATTEMPTS ] && [ $CURRENT_COUNT -ne $NUM_REQUIRED ]; do
ATTEMPT=$((ATTEMPT+1))
echo $NUM_REQUIRED > /proc/sys/vm/nr_hugepages
CURRENT_COUNT=`cat /proc/sys/vm/nr_hugepages`
PROGRESS=
if [ $CURRENT_COUNT -eq $LAST_COUNT ]; then
NOCHANGE_COUNT=$(($NOCHANGE_COUNT+1))
elif [ $CURRENT_COUNT -ne $NUM_REQUIRED ]; then
NOCHANGE_COUNT=0
PROGRESS="Progress made with $(($CURRENT_COUNT-$LAST_COUNT)) pages"
echo Attempt $ATTEMPT: $CURRENT_COUNT pages $PROGRESS
LAST_COUNT=$CURRENT_COUNT
sleep $SLEEP_INTERVAL
fi
done
echo Final page count: $CURRENT_COUNT
# Exit with 0 if number of pages was successfully allocated
if [ $CURRENT_COUNT -eq $NUM_REQUIRED ]; then
exit 0
else
exit -4
fi
Caveats
Allocations almost always succeed. The one case where they do not is when memory is mlock()ed. Technically these pages could be moved and patches exist to do just that but there has not been demand for the feature to date.
Future
In 2.6.24, grouping pages by mobility may mean that the partition does not even have to be setup for the kernel to be able to grow/shrink the pool to a large extent. However, if hugepage availability must be guaranteed at all times, then the partition should be setup.
Summary
There you have it. In the past, hugepages had to be allocated at boot-time. Now, you can setup a partition at boot-time instead and allocate hugepages to the pool when they are needed, and return them to general use when not required allowing the memory to be used as normal.
Acknowledgements
Thanks to Nishanth Aravamudan and Andy Whitcroft for reviewing drafts of this.
References
[1] Leverage transparent huge pages on Linux on POWER
http://www.ibm.com/developerworks/systems/library/es-lop-leveragepages/
[2] Kernel source: Documentation/vm/hugetlbpage.txt
Thursday, October 25. 2007
Just to clear up what a huge page is!
Any architecture supporting virtual memory is required to map virtual addresses to physical addresses through an address translation mechanism. Recent translations are stored in a cache called a Translation Lookaside Buffer (TLB). TLB Coverage is defined as memory addressable through this cache without having to access the master tables in main memory. When the master table is used to resolve a translation, a TLB Miss is incurred. This can have as significant an impact on Clock cycles Per Instruction (CPI) as CPU cache misses[1]. To compound the problem, the percentage of memory covered by the TLB has decreased from about 10% of physical memory in early machines to approximately 0.01% today. As a means of alleviating this, modern processors support multiple page sizes, usually up to several megabytes, but gigabyte pages are also possible. The downside is that processors commonly require that physical memory for a page entry be contiguous.
So, that is what a huge page is. Linux supports huge pages but the support is a bit primitive in spots and bolted on at spots. Work is on-going to make it better although the public plans on what is happening are a bit spotty at best[2].
[1] Book: Computer Architecture a Quantitative Approach
[2] Some pretty minimal information at http://linux-mm.org/HugePages
|
 |
 |
 |
|