Last Updated: April 29th, 2009

Huge pages can be of benefit to applications that use large amounts of address space due to the reduction of TLB misses and increased availability of cache usable by applications. On Linux, there are two basic ways of programatically creating a region of memory backed by hugepages. The first is to use shmget() with the SHM_HUGETLB flag set. While convenient, this can result in regions persisting in memory due to application error. The second is to mmap() a file created on the hugetlbfs filesystem but creating the file requires a significant amount of boiler-plate code.

To alleviate the pain, an explicit API was added to libhugetlbfs 2.0 for the easy allocation and freeing of regions backed by huge pages with get_huge_pages() and free_huge_pages() respectively. The intention was that the API be used in the development of custom allocators while users of glibc would continue to use malloc() with HUGETLB_MORECORE set. In 2.0, the documentation was non-existent and as there was no suitable drop-in replacement for malloc(), the API was not widely advertised. libhugetlbfs 2.1 has improved this API significantly with the addition of a manual page and an API called get_hugepage_region() which is more suitable for use as a drop-in replacement for malloc().

This document begins by converting a memory bandwidth benchmark called STREAM to use malloc() backed by huge pages. It is then altered to use get_huge_pages() and get_hugepage_region() are used as an example of how the API can be used. This is based on libhugetlbfs 2.3. First of all, install libhugetlbfs in your home directory.

    $ wget \
        http://heanet.dl.sourceforge.net/sourceforge/libhugetlbfs/libhugetlbfs-2.3.tar.gz
    $ tar -zxf libhugetlbfs-2.3.tar.gz
    $ cd libhugetlbfs-2.3/
    $ make PREFIX=$HOME/opt/libhugetlbfs
    $ make PREFIX=$HOME/opt/libhugetlbfs install
Download stream and build it.
    $ mkdir stream
    $ cd stream
    $ wget \
        http://www.cs.virginia.edu/stream/FTP/Code/stream.c \
	-O stream-orig.c
    $ gcc -O2 stream-orig.c -o stream-orig
Patch using this patch to use malloc() instead of statically allocating the arrays.
    $ cp stream-orig.c stream-one-malloc.c
    $ cat stream-one-malloc.patch | patch
    $ gcc -O2 stream-one-malloc.c -o stream-one-malloc
There is a very important point to note about the patch - it only calls malloc() once and uses the same buffer for the three arrays. This arranges the arrays in memory so that they look similar to a static declaration making comparisons easier.

Next, we use hugeadm to create a mount point for huge pages usable by any user in the system. See the man page for other options with respect to the creation of mount points. hugeadm then sets the dynamic hugepage pool to be sizable to 1GB (1048576*1024) of 2M pages with --pool-pages-max. Use the pagesize utility to see what the pagesize of your system is and adjust accordingly. With dynamic hugepage pool sizing, the hugepages will be created on demand. To statically allocate them, use --pool-pages-min instead. Again, details are in the manual pages.

    $ su
    # hugeadm --create-global-mounts
    # hugeadm --pool-pages-max 2M:$((1048576*1024/2))
    # hugeadm --pool-pages-min 2M:0
    # exit
Now the malloc version of STREAM can be used either as-is or backing its memory with hugepages with something like the following. Note that you may need to alter the paths for a 64-bit binary.
    $ LD_PRELOAD=$HOME/opt/libhugetlbfs/lib/libhugetlbfs.so \
                HUGETLB_MORECORE=yes \
                ./stream-one-malloc
The patch to use get_huge_pages() is relatively straight-forward. It requires adding the header, calculation of the aligned length, a one-line change to malloc and an updated informational message. It should be noted again that this is intended for illustrative purposes and generally should not be used as a drop-in replacement for malloc() - use get_hugepage_region() for that.
--- stream-one-gethuge.orig	2008-12-03 11:18:19.000000000 +0000
+++ stream-one-gethuge.c	2008-12-03 11:28:47.000000000 +0000
@@ -46,6 +46,7 @@
 # include <limits.h>
 # include <sys/time.h>
 # include <stdlib.h>
+# include <hugetlbfs.h>
 
 /* INSTRUCTIONS:
  *
@@ -122,9 +123,14 @@
     int			BytesPerWord;
     register int	j, k;
     double		scalar, t, times[4][NTIMES];
+    size_t		bufsize;
+
+    /* Calculate and hugepage-align the size of the arrays */
+    bufsize = sizeof(double) * (N + OFFSET) * 3;
+    bufsize = (bufsize + gethugepagesize() - 1) & ~(gethugepagesize() - 1);
 
     /* --- SETUP --- determine precision and check timing --- */
-    a = malloc(sizeof(double) * (N + OFFSET) * 3);
+    a = get_huge_pages(bufsize, GHP_DEFAULT);
     b = a + N + OFFSET;
     c = b + N + OFFSET;
     if (a == NULL) {
@@ -137,7 +144,7 @@
     BytesPerWord = sizeof(double);
     printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
 	BytesPerWord);
-    printf("The work arrays are allocated with malloc()\n");
+    printf("The work arrays are allocated with get_huge_pages()\n");
 
     printf(HLINE);
     printf("Array size = %d, Offset = %d\n" , N, OFFSET);
@@ -276,7 +283,7 @@
     checkSTREAMresults();
     printf(HLINE);
 
-    free(a);
+    free_huge_pages(a);
     return 0;
 }
To apply the patch and build, do something like the following. Note you may need to adjust the paths for 64-bit binaries.
    $ cp stream-one-malloc.c stream-one-gethuge.c
    $ cat stream-one-gethuge.patch | patch
    $ gcc -O2 \
        -I$HOME/opt/libhugetlbfs/include \
	-L$HOME/opt/libhugetlbfs/lib \
	-lhugetlbfs \
	stream-one-gethuge.c -o stream-one-gethuge
    $ LD_LIBRARY_PATH=$HOME/opt/libhugetlbfs/lib ./stream-one-gethuge
Now there are malloc() and get_huge_pages() versions of STREAM ready for benchmarking, each of which allocates one large buffer and splits it into three. The tests runs are based on an AMD Phenom 9950 Quad-Core Processor with 4GB of RAM. Tests were run varying the size of the arrays to simulate an increasing working set size. Each size was tested five times and an average throughput recorded.

The first figure shows the throughput in MB/s for the STREAM Add and Scale operations. For each of the three tests, it uses small pages with static allocation, small pages with one malloc() call and huge pages backing the one malloc() call.

	  Figure 1: Comparison of STREAM operation using static allocation
    	  and malloc with small and huge pages
	
There are some performance differences between the static and malloc() allocations. It varies which wins depending on the STREAM operation and the exact layout in memory. What is clear is that using huge pages for the buffers increases throughput which is what one would expect. Note that huge pages are not always a win and depends on the workload, the TLB and the exact processor.

Contrast the patch that uses get_huge_pages() with the patch below that uses get_hugepage_region() which is closer to being a drop-in replacement for malloc. It does not require aligned lengths, can fall back to small pages in hugepages are unavailable and can use wasted bytes to cache colour the buffer which can improve performance as discussed later. Still note that each call results in at least one hugepage being allocated. If used excessively for small buffers, there will be memory wastage.

--- stream-one-getregion.orig	2008-12-03 11:44:45.000000000 +0000
+++ stream-one-getregion.c	2008-12-03 11:45:42.000000000 +0000
@@ -45,6 +45,8 @@
 # include <float.h>
 # include <limits.h>
 # include <sys/time.h>
+# include <stdlib.h>
+# include <hugetlbfs.h>
 
 /* INSTRUCTIONS:
  *
@@ -88,9 +90,7 @@
 # define MAX(x,y) ((x)>(y)?(x):(y))
 # endif
 
-static double	a[N+OFFSET],
-		b[N+OFFSET],
-		c[N+OFFSET];
+static double	*a, *b, *c;
 
 static double	avgtime[4] = {0}, maxtime[4] = {0},
 		mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
@@ -123,15 +123,26 @@
     int			BytesPerWord;
     register int	j, k;
     double		scalar, t, times[4][NTIMES];
+    size_t		bufsize;
 
-    /* --- SETUP --- determine precision and check timing --- */
+    /* Calculate the size of the arrays */
+    bufsize = sizeof(double) * (N + OFFSET) * 3;
 
+    /* --- SETUP --- determine precision and check timing --- */
+    a = get_hugepage_region(bufsize, GHP_DEFAULT);
+    b = a + N + OFFSET;
+    c = b + N + OFFSET;
+    if (a == NULL) {
+        printf("Failed to alloc arrays\n");
+        exit(-1);
+    }
     printf(HLINE);
     printf("STREAM version $Revision: 5.8 $\n");
     printf(HLINE);
     BytesPerWord = sizeof(double);
     printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
 	BytesPerWord);
+    printf("The work arrays are allocated with get_huge_pages()\n");
 
     printf(HLINE);
     printf("Array size = %d, Offset = %d\n" , N, OFFSET);
@@ -270,6 +281,7 @@
     checkSTREAMresults();
     printf(HLINE);
 
+    free_hugepage_region(a);
     return 0;
 }

Figure 2 compares performance between malloc() using huge pages, get_huge_pages() and get_hugepage_region(). For the two uses of the API, one buffer has been allocated and split up between the three arrays. As you would expect, the performance between the two users of the API is very similar. They offer a slight improvement over the user of malloc() but this is likely a co-incidence due to memory layout.

	  Figure 2: Comparison of STREAM operation using huge pages for
	  malloc(), get_huge_pages() and get_hugepage_region()
	

The last comparison we will make is between get_huge_pages() and get_hugepage_region() when called once for all the buffers and once for each buffer (i.e. one call versus three calls). malloc()-style allocators cache colour buffers so that indexes in similarly-sized arrays do not use the same cache lines which can be very important for workloads like STREAM. As get_huge_pages() interface is a close-to-kernel interface to aid the implementation of custom allocators, the lack of cache colouring can have performance implications when used incorrectly. get_hugepage_region() does not have this problem.

Figure 3 shows the performance consequences of incorrect usage of get_huge-pages() by comparing one call to three calls to the allocator. As is clear, the three calls to get_huge_pages() performs extremely poorly due to lack of proper cache colouring. As is also clear, get_hugepage_region() using cache colouring gets comparable performance to allocating one single buffer. This is what makes it a more suitable malloc() drop-in replacement.

	  Figure 3: Comparison of STREAM operation using one versus three
	  calls to get_huge_pages() and get_hugepage_region()
	

In summary, using the libhugetlbfs APIs get_huge_pages() and get_hugepage_region() can achieve performance from hugepages in situations where automatic support is not suitable. It should be clear that get_huge_pages() should be used when implementing custom allocators and get_hugepage_region() should be used where a call to malloc() of a large buffer is being replaced. Happy hacking.