|
First posted as a blog entry here. Last Updated: 26 October, 2007 Hugepages can potentially attain higher performance from the hardware by reducing TLB misses and CPU cache usage[1]. The benefits generally apply to applications that use large amounts of address space although there can be secondary benefits such as improved hardware pre-fetch. For an application to exploit any of this though, hugepages must be available. This is not trivial as the vast majority of processors require that the memory be naturally aligned and contiguous, both physically and virtually. In the past, administrators were required to reserve the hugepages at boot-time as memory became too fragmented in a short period of time to allocate the pages normally. Sizing the pool presented a difficult problem for the administrator: too large and memory is wasted on long-lived systems; too small and applications will fail. This issue alone precludes hugepages from being used in more situations. However, this restriction has been relaxed somewhat in kernel 2.6.23 due to memory partitions. Using partitioning, an administrator can grow and shrink the pool during the lifetime of the system instead of making the decisions at boot time. In this article, we will discuss how to make hugepages available with a greater degree of flexibility than was previously available. Basic Hugepage Information To discover if the kernel supports hugepages, read /proc/meminfo and look for the HugePages entries. For example, on my machine I see
mel@arnold:~$ grep Huge /proc/meminfo
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 4096 kB
This means the kernel supports hugepage usage but I have no hugepages in the pool and a hugepage is 4MiB in size. To add hugepages to the pool, a number is simply written to /proc/sys/vm/nr_hugepages like so;
arnold:/proc/sys/vm# echo 10 > nr_hugepages
arnold:/proc/sys/vm# cat nr_hugepages
10
arnold:/proc/sys/vm# grep Huge /proc/meminfo
HugePages_Total: 10
HugePages_Free: 10
HugePages_Rsvd: 0
Hugepagesize: 4096 kB
Discussing how to use hugepages in an application is beyond the scope of this article but information exists if you look around[1][2].Where it Goes Wrong In a standard system once memory has been filled (updatedb, untarring lots of data etc), the memory becomes too fragmented and writing a value to nr_hugepages may not allocate all of the requested hugepages. The problem is that not all memory can be reclaimed by the kernel. Many pages allocated by the kernel cannot be easily reclaimed on demand and only one badly placed page can cause a hugepage allocation failure. Solving the Problem with a Memory Partition Since 2.6.23, memory can be split in two at boot-time creating a zone called ZONE_MOVABLE (see information on zones by reading /proc/zoneinfo). Only pages that can be reclaimed on demand by the kernel use this zone. Within this zone of memory, hugepage allocations will almost always (See Caveats later) succeed no matter how long the system is running. So, let us say that an administrator knows that a number of jobs will be running on his system that want to use hugepages. The jobs use between 0 and 256 hugepages and he does not want to waste memory. Previously, the administrator would specify hugepages=256 on the command-line and waste the memory for jobs that are not using it. Now instead, the administrator would specify movablecore=1024MB on the command line to setup a partition for movable pages 1GiB in size or 256 hugepages. Jobs that require hugepages can now request them from /proc/sys/vm/nr_hugepages and have a reasonable expectation of getting those pages. The difference between the partitioning and configuring the pool at boot-time is that memory in the partition unused by hugepages can still be used for normal pages. This means that memory can be returned after the huge page process completes, and reallocated to small page processes; memory is never wasted. Configuring the Memory Partition The partition must be configured at boot-time using either kernelcore= or movablecore= as documented in Documentation/kernel-parameters.txt. movablecore specifies how much memory should be used for ZONE_MOVABLE. An alternative way of looking at movablecore is Max Hugepages that can be allocated at any time = movablecore / hugepagesizeIf on the other hand the administrator knows how much memory the rest of the kernel needs and wants as much memory as possible to be available for a varying number of hugepages, kernelcore= can be used instead. In this case, the size of ZONE_MOVABLE is whatever memory is left over or alternatively Max Hugepages that can be allocated at any time = (TotalMem - kernelcore) / hugepagesizeOnce the partition is setup, huge pages will not be allocated from it automatically. This is because hugepages are strictly not movable and the partition may be also setup for hot-removing memory at runtime. To allow hugepages to be allocated from the pool, do
arnold:/proc/sys/vm# echo 1 > hugepages_treat_as_movable
Growing The Pool Once the partition is setup, the hugepage pool can be easily grown. Depending on system activity the first attempt may not succeed so try a few times or use a script like this not-very-tested and very rough piece of work;
#!/bin/bash
# Attempt to grow the pool to the requested size
# This benchmark checks how many hugepages can be allocated in the hugepage
# pool
#
# Copyright Mel Gorman (c) 2007
# Licensed under GPL V2. See http://www.gnu.org/licenses/gpl-2.0.txt for details
SLEEP_INTERVAL=5
FAIL_AFTER_NO_CHANGE_ATTEMPTS=20
NUM_REQUIRED=0
usage() {
echo "get_hugepages: Get the requested number of hugepages"
echo
echo " -s Time to sleep between attempts to grow pool"
echo " -f Give up after failing this number of times"
echo " -n Number of hugepages that should be in the pool"
echo
exit $1
}
# Arg processing
while [ "$1" != "" ]; do
case "$1" in
-s) export SLEEP_INTERVAL=$2; shift 2;;
-f) export FAIL_AFTER_NO_CHANGE_ATTEMPTS=$2; shift 2;;
-c) export MAX_ATTEMPT=$2; shift 2;;
-n) export NUM_REQUIRED=$2; shift 2;;
esac
done
# Check proc entry exists
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
echo Attempting load of hugetlbfs module
modprobe hugetlbfs
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
echo ERROR: /proc/sys/vm/nr_hugepages does not exist
exit -1
fi
fi
# Check a number was requested
if [ "$NUM_REQUIRED" = "" -o $NUM_REQUIRED -lt 0 ]; then
echo ERROR: You must specify a number of hugepages to alloc
usage -2
fi
# Ensure we have permission to write
echo $STARTING_COUNT 2> /dev/null > /proc/sys/vm/nr_hugepages || {
echo ERROR: Do not have permission to adjust nr_hugepages count
exit -3
}
# Record existing hugepage count
STARTING_COUNT=`cat /proc/sys/vm/nr_hugepages`
echo Starting page count: $STARTING_COUNT
# Start attempt to grow pool
CURRENT_COUNT=$STARTING_COUNT
LAST_COUNT=$STARTING_COUNT
NOCHANGE_COUNT=0
ATTEMPT=0
while [ $NOCHANGE_COUNT -ne $FAIL_AFTER_NO_CHANGE_ATTEMPTS ] && [ $CURRENT_COUNT -ne $NUM_REQUIRED ]; do
ATTEMPT=$((ATTEMPT+1))
echo $NUM_REQUIRED > /proc/sys/vm/nr_hugepages
CURRENT_COUNT=`cat /proc/sys/vm/nr_hugepages`
PROGRESS=
if [ $CURRENT_COUNT -eq $LAST_COUNT ]; then
NOCHANGE_COUNT=$(($NOCHANGE_COUNT+1))
elif [ $CURRENT_COUNT -ne $NUM_REQUIRED ]; then
NOCHANGE_COUNT=0
PROGRESS="Progress made with $(($CURRENT_COUNT-$LAST_COUNT)) pages"
echo Attempt $ATTEMPT: $CURRENT_COUNT pages $PROGRESS
LAST_COUNT=$CURRENT_COUNT
sleep $SLEEP_INTERVAL
fi
done
echo Final page count: $CURRENT_COUNT
# Exit with 0 if number of pages was successfully allocated
if [ $CURRENT_COUNT -eq $NUM_REQUIRED ]; then
exit 0
else
exit -4
fi
CaveatsAllocations almost always succeed. The one case where they do not is when memory is mlock()ed. Technically these pages could be moved and patches exist to do just that but there has not been demand for the feature to date. Future In 2.6.24, grouping pages by mobility may mean that the partition does not even have to be setup for the kernel to be able to grow/shrink the pool to a large extent. However, if hugepage availability must be guaranteed at all times, then the partition should be setup. Summary There you have it. In the past, hugepages had to be allocated at boot-time. Now, you can setup a partition at boot-time instead and allocate hugepages to the pool when they are needed, and return them to general use when not required allowing the memory to be used as normal. Acknowledgements Thanks to Nishanth Aravamudan and Andy Whitcroft for reviewing drafts of this. References [1] Leverage transparent huge pages on Linux on POWER http://www.ibm.com/developerworks/systems/library/es-lop-leveragepages/ [2] Kernel source: Documentation/vm/hugetlbpage.txt |