<?xml version="1.0" encoding="utf-8" ?>

<rss version="2.0" 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:admin="http://webns.net/mvcb/"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
   xmlns:wfw="http://wellformedweb.org/CommentAPI/"
   xmlns:content="http://purl.org/rss/1.0/modules/content/"
   >
<channel>
    <title>Mel Gorman: Kernel Spanner</title>
    <link>http://www.csn.ul.ie/~mel/blog/</link>
    <description>Postings from a small room in a cardboard estate...</description>
    <dc:language>en</dc:language>
    <generator>Serendipity 1.2 - http://www.s9y.org/</generator>
    <pubDate>Thu, 12 Nov 2009 00:36:59 GMT</pubDate>

    <image>
        <url>http://www.csn.ul.ie/~mel/blog/templates/default/img/s9y_banner_small.png</url>
        <title>RSS: Mel Gorman: Kernel Spanner - Postings from a small room in a cardboard estate...</title>
        <link>http://www.csn.ul.ie/~mel/blog/</link>
        <width>100</width>
        <height>21</height>
    </image>

<item>
    <title>Page allocator failure warnings in recent kernels</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/14-Page-allocator-failure-warnings-in-recent-kernels.html</link>
            <category>VMM</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/14-Page-allocator-failure-warnings-in-recent-kernels.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=14</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=14</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    So, in recent kernels since 2.6.31-rc1, there is a seemingly benign problem whose apparently manifestation is page allocation failures of GFP_ATOMIC allocations. The system recovers but there are large stalls even though on server systems, everything goes faster overall. The problem is particularly pronounced when using certain wireless cards but manifests in harder-to-diagnose stalls on machines with low memory under stress. The development methodology means that kernels come out very quickly even though right now, I would really prefer if the world would slow down while my poor test machines try to catch up.&lt;br /&gt;
&lt;br /&gt;
I think I have a solution to this but it take several hours each time to figure out if forward progress has been made or not.&lt;br /&gt;
&lt;br /&gt;
The lesson learnt here? Panic makes for poor decisions. I sent one patch what looked great at the time but have found out in the last few hours that it really sucks. While figuring this out for sure, I have to wait looking at a screen to painfully slowly update. To help the waiting, I found some beer, it&#039;s the Irish thing to do. Wonder what the rest of ye do :/ 
    </content:encoded>

    <pubDate>Thu, 12 Nov 2009 00:36:59 +0000</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/14-guid.html</guid>
    
</item>
<item>
    <title>SystemTap is no longer a pain to install</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/13-SystemTap-is-no-longer-a-pain-to-install.html</link>
            <category>SystemTap</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/13-SystemTap-is-no-longer-a-pain-to-install.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=13</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=13</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    I&#039;m revisiting some old topics as part of research that requires me to extract information out of the kernel but whose instrumentation is not really worth merging to mainline. At one point, I extracted this information using a kernel module relaying additional information stored in struct page to userspace. It&#039;s not smart or clever but the options were limited at the time. This time around, I intend to give SystemTap a go as it should be able to do this type of job.&lt;br /&gt;
&lt;br /&gt;
When I last used SystemTap, it was a total and utter pain to install which was also one of the main critisms levelled at it during one of the kernel summits. I had an installation script at the time to automate installation but it was a kludge of workarounds and patches. Since then, things have improved considerably because the reworked &lt;a href=&quot;http://www.csn.ul.ie/~mel/postings/stap-20090710/stap-install&quot; title=&quot;SystemTap installation script&quot;&gt;installation script&lt;/a&gt; for the current release is downright trivial and 180 lines shorter than the previous version. A 2 minute glance through the example scripts show considerable improvement in terms of usability and readability as well.&lt;br /&gt;
&lt;br /&gt;
I haven&#039;t figured out if systemtap is usable for my needs or not yet but things certainly appear to be going in the right direction on that front. 
    </content:encoded>

    <pubDate>Fri, 10 Jul 2009 09:38:49 +0100</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/13-guid.html</guid>
    
</item>
<item>
    <title>When is a cool room not cold enough</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/12-When-is-a-cool-room-not-cold-enough.html</link>
            <category>Linux</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/12-When-is-a-cool-room-not-cold-enough.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=12</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=12</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    For much of the last 30 days or so[1], I&#039;ve been writing a paper as part of college work. A large part of that was running tests 24 hours a day over the course of 20 days[2]. On the day they finished, I reran one of the tests (crypto from the specjvm2008 suite) with more debugging information as I wanted to confirm a suspicion but the machine spontaneously rebooted. Not good, but the room was cold, nothing was on serial console or in the logs so I thought it might just have been a power flicker. Rebooted again an hour later so I popped the case to see had something become unseated or some other problem. There were a few hints as to what the problem might be;&lt;br /&gt;
&lt;br /&gt;
Hint 1 ..... memory modules are usually in a nice neat line together, maybe it somehow fell out &lt;br /&gt;
&lt;br /&gt;
wrong&lt;br /&gt;
&lt;br /&gt;
Hint 2 ..... heat sinks should not melt off&lt;br /&gt;
&lt;br /&gt;
&lt;a href=&quot;http://www.csn.ul.ie/~mel/postings/whoops-20090127/IMG_0641.jpg&quot;&gt;&lt;br /&gt;
&lt;img src=&quot;http://www.csn.ul.ie/~mel/postings/whoops-20090127/IMG_0641_small.jpg&quot; alt=&quot;Slightly melted memory module&quot; /&gt;&lt;br /&gt;
&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Sometimes a cold room with case fans just isn&#039;t cold enough if you run that machine hard enough for long enough. Makes for bad surprises but a funny picture. Module is pretty hosed but easily replaced at least.&lt;br /&gt;
&lt;br /&gt;
[1] Minus 7 days during which I was skiiing for the first time in Austria. Skiing rules&lt;br /&gt;
[2] Based on mainline kernel 2.6.27 installed on Debian Lenny Beta from Dec 2008 using an AMD Phonem based machine for x86-64 and a &lt;a href=&quot;http://us.fixstars.com/products/powerstation/&quot;&gt;Terrasoft Powerstation&lt;/a&gt; for ppc64. 
    </content:encoded>

    <pubDate>Tue, 27 Jan 2009 15:02:34 +0000</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/12-guid.html</guid>
    
</item>
<item>
    <title>Utilities for configuring and using huge pages</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/11-Utilities-for-configuring-and-using-huge-pages.html</link>
            <category>Large Pages</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/11-Utilities-for-configuring-and-using-huge-pages.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=11</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=11</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    The &lt;a href=&quot;http://www.csn.ul.ie/~mel/blog/index.php?/archives/10-Allocating-huge-pages-easily-from-userspace.html&quot;&gt;userspace allocation API&lt;/a&gt; was only one of the big additions made to libhugetlbfs 2.1. A number of helper utilities were also added for configuring the system easily (hugeadm), the launching of applications to automatically use large pages (hugectl), to set defaults for applications (hugeedit) and to display information about the page sizes supported by the system (pagesize). In combination, they should make using huge pages under Linux that bit easier but &lt;a href=&quot;http://www.csn.ul.ie/~mel/docs/sysbench-utils/&quot; title=&quot;Link to mini-doc on configuring and launching applications with libhugetlbfs utilities&quot;&gt;here&lt;/a&gt; is a document that briefly describes how to install the utilities, configure the system and launch the sysbench benchmark using the utilities. 
    </content:encoded>

    <pubDate>Mon, 15 Dec 2008 23:13:56 +0000</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/11-guid.html</guid>
    
</item>
<item>
    <title>Allocating huge pages easily from userspace</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/10-Allocating-huge-pages-easily-from-userspace.html</link>
            <category>Large Pages</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/10-Allocating-huge-pages-easily-from-userspace.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=10</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=10</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    On Linux, there are two basic ways of creating a region of memory backed by huge pages. The first is to use shmget() with the SHM_HUGETLB flag set but this can leak memory if the application fails to clean up memory properly. The second is to mmap() a file created on a hugetlbfs mount but this requires a lot of boring boilerplate code. Things are more painful than they should be and this lead to whinging.&lt;br /&gt;
&lt;br /&gt;
As the necessary code existed in libhugetlbfs to discover the mount and create a file, two APIs were created called get_huge_pages() for use when implementing custom allocators and get_hugepage_regions() when used as drop-in replacement for malloc() of large buffers. These are available and documented with manual pages in &lt;a href=&quot;http://libhugetlbfs.ozlabs.org/releases/libhugetlbfs-2.1-pre5.tar.gz&quot; title=&quot;Link to libhugetlbfs 2.1-pre5 tarball&quot;&gt;libhugetlbfs 2.1-pre5&lt;/a&gt; with a final release expected in the near future.&lt;br /&gt;
&lt;br /&gt;
I put together this &lt;a href=&quot;http://www.csn.ul.ie/~mel/docs/stream-api/&quot; title=&quot;Converting STREAM to use Direct Hugepage Allocation API&quot;&gt;document&lt;/a&gt; describing how to alter STREAM to use malloc with small pages, malloc with large pages and the two direct hugepage allocation APIs now supported in libhugetlbfs. It should make life easier for anyone writing a hugepage-aware application that cannot use the automatic support in libhugetlbfs for whatever reason. 
    </content:encoded>

    <pubDate>Wed, 03 Dec 2008 12:35:43 +0000</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/10-guid.html</guid>
    
</item>
<item>
    <title>YAFAP - What else is there to do when you are ill?</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/9-YAFAP-What-else-is-there-to-do-when-you-are-ill.html</link>
            <category>General</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/9-YAFAP-What-else-is-there-to-do-when-you-are-ill.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=9</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=9</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    I&#039;ve been out ill the last few days (and still am). As having a head full of goo made be dumber than the average stick that couldn&#039;t do more than an hour or two of useful work in a day, I decided to fire up nethack again to chew up a bit of the day. The only games I play are guitar hero variants, final fantasy anything and nethack which was the first game I played on the PS3. I hadn&#039;t played properly in months as the last time I &lt;i&gt;almost&lt;/i&gt; made it which just sickened me. The one exception was a game a few weeks ago for a local competition that had a keg as a prize if someone ascended (no one did).&lt;br /&gt;
&lt;br /&gt;
The game was mainly a grind but being dumb also made me patient. One major setback was cancelling the whole inventory including the Bell of Opening which needs charged to finish the game. This is dumb, don&#039;t do it. I had the Wizard of Yendor killed by the time I realized the artifact was a no-go and there was no means of charging or wishing left in the game as I had them cleared out or used already. In a somewhat disgruntled mood, I put the character in a situation where it was attacked by literally all the time to finish the game and call it a day. Fortunately for me, between the monsters summoned by having being negatively aligned with negative luck and the repeated monsters summoned by the Wizard, eventually a scroll of charging fell out for the first time in the game putting me back in action and levelled me up considerably in the process. After opening the sanctum I found another scroll which was enough to get a magic marker to bring my AC back to something respectable and stand up to two wizards (double trouble), the high priest of Moloch and various beasties at the same time without using wands of death as I had lost all instant-kill methods at that time. Was a real messy fight but eventually with the Amulet of Yendor in hand and after poking the Wizard a few more times in the eye, I got to the Astral planes. Visiting all three altars later and last night night I see the lovely message. &lt;br /&gt;
&lt;br /&gt;
&lt;blockquote&gt;&quot;In return for thy service, I grant thee the gift of Immortality!&quot;  You ascend to the status of Demigod...&lt;br /&gt;
&lt;br /&gt;
Mel the Lord                St:25 Dx:18 Co:18 In:16 Wi:18 Ch:18  Lawful&lt;br /&gt;
Astral Plane $:175 HP:441(441) Pw:141(141) AC:-23 Xp:30/100090501 T:99085 Satiated&lt;br /&gt;
&lt;/blockquote&gt;&lt;br /&gt;
&lt;br /&gt;
Wooooo! &lt;a href=&quot;http://www.csn.ul.ie/~mel/postings/nethack-20081115/mel-ascended-dump.txt&quot; title=&quot;nethack ascension dump file&quot;&gt;This&lt;/a&gt; is the full dump file and now, it&#039;s time to jam yet more wonderful drugs into my head and clear it out! 
    </content:encoded>

    <pubDate>Fri, 14 Nov 2008 15:22:12 +0000</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/9-guid.html</guid>
    
</item>
<item>
    <title>Plumbers and Building X</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/8-Plumbers-and-Building-X.html</link>
            <category>Graphics</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/8-Plumbers-and-Building-X.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=8</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=8</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    Last week the &lt;a href=&quot;http://linuxplumbersconf.org/&quot; title=&quot;Linux Plumbers Conference&quot;&gt;Linux Plumbers Conference&lt;/a&gt; was held in Portland as a developers conference for those working close to the boundary between user and kernel space. Many have described it being one of the best conferences they have attended in a while and I have to agree. The talks were interesting and the people running them discussed their current activities rather than a one-way monologue on their activities over the last year or so (for the most part anyway). For myself, I met a number of people to iron out issues that have been bugging me for a while and got a number of small hacking jobs prototyped that have been on my TODO list for too long. After being working on large pages for some time, it was also a chance to get a quick tour of what is active in the lower levels of the Linux world at the moment.&lt;br /&gt;
&lt;br /&gt;
One area of interest to me was the graphics layer which has had a history of hilarity working outside of the kernel tree. This is out of necessity as the people willing to test an X driver are not necessarily the same people willing to test kernels. Hence, the out-of-tree is required to build against a number of different kernel versions and the resulting ifdef trickery would have a hard time living in mainline not to mention keeping the API in sync. During the track, the possibility was raised that some of the interesting developments in the kernel, X and graphics drivers over the next year would the user to update userspace before enabling using certain kernel features. There was pressure to not require this update but the X guys seemed pretty insistent that a fully incremental effort would be a real pain and the end result ultimately worse.&lt;br /&gt;
&lt;br /&gt;
In case a kernel feature requires an updated userspace in the future, I took a look at what was involved in building X these days. If nothing else, my laptop was using the vesa driver which broke when switching to a text console (fonts were the wrong size) and generally performed worse than what the hardware should have been capable of. The distribution drivers for the card were less than satisfactory for a number of reasons that I never got around to ironing out and I knew there was a lot of additional support for the ATI M56GL Mobility card in my machine added over the last year. Plenty of incentive to get this working.&lt;br /&gt;
&lt;br /&gt;
I vaguely recall from years ago that building X from scratch was no fun whatsoever. Others must have had similar experiences as there is a general perception that X development is scary and building it from source is as much fun as a punch in the nose. X development may still be daunting, I haven&#039;t tried, but building from source is straight-forward today. Maybe I missed a pile of options and combinations, but getting the basics right involved an evening on the couch watching Flight of the Concords on DVD and poking buttons periodically - hardly a taxing event.&lt;br /&gt;
&lt;br /&gt;
The X server, the modules and supporting infrastructure consists of a large number of git repositories. If you were to download and build them by hand, you&#039;d be there for a few hours and maybe that turns people away. There is a &lt;a href=&quot;http://www.x.org/wiki/Development/git&quot; title=&quot;X Build Script&quot;&gt;build script&lt;/a&gt;on the wiki, but it is a rushed hack by the looks of things and did not check that build of a module actually succeeded for example. I updated it to have some new smarts and it should be a fire-and-forget effort although you may need to add your graphics driver to its list. I&#039;ll update the wiki when I get back to Ireland (on plane at the moment) and have a&lt;br /&gt;
chance to test it on my other machines to make sure it works in general but &lt;a href=&quot;http://www.csn.ul.ie/~mel/postings/buildx-20080926/build-x.sh&quot; title=&quot;Script for building X&quot;&gt;Script for building X&lt;/a&gt; how it currently stands for those that are interested.&lt;br /&gt;
&lt;br /&gt;
What did catch me was starting the new X properly. The site is very clear on starting X itself but I must have missed the instructions on how to give mesa the right paths. For the library paths, I added /opt/gfx-test/lib to /etc/ld.so.conf (it could also have been done in .bashrc with greater smarts but I was lazy). I then used a &lt;a href=&quot;http://www.csn.ul.ie/~mel/postings/buildx-20080926/testgraphics-startx&quot; title=&quot;Launcher script for X&quot;&gt;laucher script for X&lt;/a&gt; to load the kernel modules and&lt;br /&gt;
set LIBGL_DRIVERS_PATH to find the new DRI drivers. Without LIBGL_DRIVERS_PATH, the system DRI libraries get used resulting in some weirdness which I only spotted after setting some debug options. gdm was starting the old X server so I simply disabled it rather than fixing the init script.  End result? One very satisfactory X desktop running very smoothly - nice one.&lt;br /&gt;
 
    </content:encoded>

    <pubDate>Fri, 26 Sep 2008 18:07:38 +0100</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/8-guid.html</guid>
    
</item>
<item>
    <title>Parked up in Portland for a bit</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/7-Parked-up-in-Portland-for-a-bit.html</link>
            <category>General</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/7-Parked-up-in-Portland-for-a-bit.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=7</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=7</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    I am just back online after being absent for two weeks. If you were trying to get in touch with me in that time and I am apparently ignoring mails, nudge me again because it is lost in the mess. The start of the absense was due to presenting a paper at &lt;a href=&quot;http://www.cs.kent.ac.uk/people/staff/rej/ismm2008/&quot; title=&quot;ISMM 2008&quot;&gt;ISMM 2008&lt;/a&gt; in Tuscon, Arizona. The conference is fairly in-depth and was of significant interest despite many of the discussions are around managed languages and garbage collection which I do not focus on ordinarily. Luckily for me, attending next year will be relatively handy as it is due to be held in Dublin, Ireland.&lt;br /&gt;
&lt;br /&gt;
Post conference, I drove to Portland, Oregan over the course of two weeks stopping off at various places along the way such as Vegas (a unique place), Bagdad (in Arizona, not the other one), Los Angeles and Yosemite park. I&#039;m now parked up in Portland for the next 4 weeks so if anyone in the area wants to meet up that has met me in the past, drop a line and we&#039;ll make something happen. 
    </content:encoded>

    <pubDate>Tue, 24 Jun 2008 07:22:13 +0100</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/7-guid.html</guid>
    
</item>
<item>
    <title>Reservations about MAP_PRIVATE in hugetlbfs</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/6-Reservations-about-MAP_PRIVATE-in-hugetlbfs.html</link>
            <category>Large Pages</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/6-Reservations-about-MAP_PRIVATE-in-hugetlbfs.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=6</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=6</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    What started as an effort to automate sysbench testing with mysql and postgres became a bit more involved. The initial parts were relatively straight-forward and mainly around getting the build automation right. Boring, but time-consuming. Eventually it got there and I was able to show that anti-fragmentation did not hurt that workload, something I was reasonably sure about but wanted to double check.&lt;br /&gt;
&lt;br /&gt;
Then it seemed like it should be straight-forward to configure the benchmark to use large pages but there were two upsets that made the job harder than it should have been. The first was that there was no automatic way of making shmget() use large pages. I worked around this with a basic LD_PRELOAD hack which at some point should be done properly and added to libhugetlbfs. The second was tuning the hugepage pool size so the application could run reliably without consuming too much memory.&lt;br /&gt;
&lt;br /&gt;
Tuning the pool size was harder. Originally with hugetlbfs, huge pages were pre-faulted at mmap() time. If mmap() returned successfully, all future references would succeed, end of story. However, prefaulting increases the cost of mmap() and can lead to poor NUMA placement. Support was added in 2.6.18 to reserve pages for MAP_SHARED mappings that would be faulted in as normal. This fixed a performance problem but left MAP_PRIVATE in the lurch. As mmap() would not reserve pages, it just returned success. A fault later with an insufficient pool would result in a SIGKILL. Even if the application used mlock(), thus bringing back the NUMA placement problem, it may still not be safe because on fork() a COW may take place if the child was long-lived resulting in another SIGKILL.&lt;br /&gt;
&lt;br /&gt;
Benchmarking sysbench with libhugetlbfs used MAP_PRIVATE mappings so on getting the configuration wrong, the benchmark would unceremoniously exit and this was not even particularly consistent between machines. For the purposes of completing the benchmark, the pool was simply sized larger than I thought it needed to be but for the long term I found it irritating. The results were more or less what I was expecting for a database load (about 5% improvement) but the details of how it was setup are for another time.&lt;br /&gt;
&lt;br /&gt;
The first step to reducing the pain for someone using large pages was to make MAP_PRIVATE reliable without resorting to prefaulting. The obvious solution was to reserve the pages always but was still problematic on fork(). The reserve would need to double at that point, something that is potentially very expensive if the pool is being dynamically resized. This work would be wasted if the fork() was simply for an exec() and using vfork() may not be suitable in all situations either. Failing fork() due to being unable to reserve hugepages would also be very unwelcome.&lt;br /&gt;
&lt;br /&gt;
Hence, the solution that was put forward instead was to have reliable behaviour for the original mapper at the cost of the child. There are a few situations to deal with but the basic idea is that if a original mapper takes a COW fault and it is going to fail due to a small pool and a child holding a reference to the page, the process will find the children and unmap the large page at the faulting address. The COW is then no longer necessary and the original mapper continues as normal. If the child later takes a fault in that area, it gets killed.&lt;br /&gt;
&lt;br /&gt;
At the face of it, this appears to be bizarre behaviour but the reality is that random killing of the original mapper is simply unacceptable. An application that is expecting to use MAP_PRIVATE hugetlbfs mappings and have a child get its own reliable copy is probably doing something very strange and it&#039;s unlikely such applications exist given the history of hugetlbfs. If the pool is too small for a child to operate in this fashion, a message appears on dmesg that is fairly self-explanatory to catch the situation where such an application exists. For existing applications, they should already be able to cope gracefully with mmap() failing.&lt;br /&gt;
&lt;br /&gt;
One objection that was raised is that applications that crate a large mapping that is only to be used sparsely could suffer due to mmap() requiring the full reserve, even if the sysadmin knows that only a small fraction of the pages is needed. Such applications are unlikely to exist given the history of hugetlbfs but just in case, Andy Whitcroft developed support for MAP_NORESERVE and hugetlbfs that bypasses the reserve. mmap() will succeed regardless of pool size and if it is too small, the developer gets to keep both pieces :/&lt;br /&gt;
&lt;br /&gt;
The patches are visible at &lt;a href=&quot;http://lkml.org/lkml/2008/5/27/278&quot; title=&quot;Private-reserve for hugetlbfs patches&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://marc.info/?l=linux-mm&amp;m=121192992507397&amp;w=2&quot; title=&quot;MAP_NORESERVE support for hugetlbfs patches&quot;&gt;here&lt;/a&gt; for those that want to take a closer look. They have been merged to -mm for wider testing and should make working with hugetlbfs a more positive experience. 
    </content:encoded>

    <pubDate>Mon, 02 Jun 2008 17:12:11 +0100</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/6-guid.html</guid>
    
</item>
<item>
    <title>Wakeups between Debian Etch and Debian Testing</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/5-Wakeups-between-Debian-Etch-and-Debian-Testing.html</link>
            <category>Power</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/5-Wakeups-between-Debian-Etch-and-Debian-Testing.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=5</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=5</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    Since &lt;a href=&quot;http://www.lesswatts.org/projects/powertop/&quot;&gt;PowerTOP&lt;/a&gt; was released, I noticed that the number of wakeups on my Thinkpad T60p were excessively high. Usually they were around the 350 mark, even without much running, leading to about 3.5 hours battery life or about an hour less than Windows XP. Even with disabling non-essential services, hardware and following other suggestions dotted around sites like &lt;a href=&quot;http://www.lesswatts.org&quot;&gt;lesswatts.org&lt;/a&gt; and &lt;a href=&quot;http://www.thinkwiki.org&quot;&gt;thinkwiki.org&lt;/a&gt;, power usage was roughly 23-26 watts (might be off, I didn&#039;t record data in detail). I decided to take a closer look. Basically, almost all processes using X were waking up constantly with strace patterns looking vaguely like;&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;
poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI},
        {fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, 
        {fd=14, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN}, 
        {fd=11, events=POLLIN|POLLPRI}, {fd=16, events=POLLIN}], 9, 499) = 0
gettimeofday({1200178170, 948253}, NULL) = 0
ioctl(3, FIONREAD, [0])                 = 0
gettimeofday({1200178170, 948443}, NULL) = 0
poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, 
        {fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, 
        {fd=14, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN}, 
        {fd=11, events=POLLIN|POLLPRI}, {fd=16, events=POLLIN}], 9, 0) = 0
write(3, &quot;\211\1\1\0&quot;, 4)               = 4
read(3, &quot;\1\0\340\0\0\0\0\0\1\0\0\0\0\0\0\0\4\0\0\0(\0\0\0\4\0\0&quot;..., 32) = 32
write(3, &quot;\211\7\1\0&quot;, 4)               = 4
read(3, &quot;\1\0\341\0\0\0\0\0\0\0\1\0\0\0\0\0\4\0\0\0(\0\0\0\4\0\0&quot;..., 32) = 32
ioctl(3, FIONREAD, [0])                 = 0
gettimeofday({1200178170, 949286}, NULL) = 0
&lt;/pre&gt;&lt;br /&gt;
&lt;br /&gt;
Similar patterns were seen in the other processes. They were all waking up and examining the same file descriptor which turned out to be a socket /tmp/.X11-unix/X0 in the control of X. I did not look closely but made the assumption that processes were polling some sort of event queue. Knowing that a lot of fixes of this sort of nature have been worked on, I decided to try out Debian Testing. As I was already running it on my desktop, I was reasonably sure the upgrade would be smooth enough. Font settings got mucked up as well as locales but as dist-upgrades go, it was pretty smooth.&lt;br /&gt;
&lt;br /&gt;
Power-wise, it made a big difference. Even with wireless running, wakeups went from around 350 and 23 watts to 150 and 19 watts. There are still processes waking up in similar style of loops but they are a lot less infrequent (once every 1-4 seconds instead of many times per second). X was showing up high in the list with calls to &lt;code&gt;do_setitimer()&lt;/code&gt; so I applied this &lt;a href=&quot;http://www.bughost.org/pipermail/power/2007-October/001107.html&quot;&gt;patch&lt;/a&gt; to the xserver-xorg-core package and installed it made the problem go away. Annoyingly, i8042 was causing a large number of interrupts even when nothing was happening. Adding &lt;code&gt;i8042.nomux i8042.reset&lt;/code&gt; to the kernel boot command-line removed most of these wakups once the machine sits idle. Wakeups were down to 50 and 18.2 watts usage with the most common wakeup being the wireless. Turning off wireless and it&#039;s down to 36 wakeups, almost a tenth of what it was with Debian Etch and battery life is comparable with Windows XP. There are still a few anomalies but clearly things are going the right direction. 
    </content:encoded>

    <pubDate>Sun, 13 Jan 2008 14:11:10 +0000</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/5-guid.html</guid>
    
</item>
<item>
    <title>Experiences Installing Debian on a PS3</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/4-Experiences-Installing-Debian-on-a-PS3.html</link>
            <category>Kernel</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/4-Experiences-Installing-Debian-on-a-PS3.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=4</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=4</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    Recently I got hold of a PlayStation 3 and pretty much instantly tried to install Linux on it. It is a bit time-consuming but a surprisingly straight-forward affair. Documents, articles and blogs already exist aplenty on how to install Linux on the PS3 so this blog is just to note what I found strange along the way.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Hooking up to a VGA Monitor&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
Considering the number of hits you find when googling for PS3 to VGA converter, it is surprising there is not an obviously named piece of kit out there already.  If using the HDMI cable, it must be connecting to HDCP-complaint display which rules out adapters of any sort if do not own such a device. I don&#039;t but found that a &lt;a href=&quot;http://www.maplin.co.uk/Module.aspx?ModuleNo=217676&amp;C=Maplin&amp;U=SearchTop&amp;T=tv%20box%201440&amp;doy=15m11&quot;&gt;VGA TV Gamer Box&lt;/a&gt; called a  called TVBox 1440 on &lt;a href=&quot;http://maplins.co.uk&quot;&gt;maplins.co.uk&lt;/a&gt; that did the job of hooking the PS3 up to a bog-standard monitor.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Installing Linux (Debian)&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
The most straight-forward guide I found to install was on &lt;a href=&quot;http://www.ibm.com/developerworks/power/library/pa-linuxps3-1/&quot;&gt;IBM developerWorks&lt;/a&gt;. It is pretty dated but gets most of the basics. Early on, I installed Debian. At the time I tried, the &lt;a href=&quot;http://www.keshi.org/moin/moin.cgi/PS3/Debian/Live&quot;&gt;Debian Live CD&lt;/a&gt; was not able to start X properly but the &lt;a href=&quot;http://www.keshi.org/moin/moin.cgi/PS3/Debian/Installer&quot;&gt;Debian Installer&lt;/a&gt; worked just fine. If going this route, be sure to avoid trying to setup a PReP partition. Not only do you not need it, but the installer makes a shambles of trying and gets seriously confused.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Getting ps3videomode&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
Many guides make reference to running this command to alter video (or getting it right in the first place in some cases). On install, I converted an RPM package from &lt;a href=&quot;http://www.rpmfind.net&quot;&gt;rpmfind.net&lt;/a&gt; although I&#039;ve spotted since that I could have tried the Debian packages linked from &lt;a href=&quot;http://www.louiscandell.com/ps3/&quot;&gt;here&lt;/a&gt; so look around. The actual git tree is git://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-utils.git.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Other Post-Install Tasks&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
I found I had to add the ps3 sound module to modules.conf as it was not loaded automatically. Most stuff installed without headache although mileage varied considerably with movie players. xine had a strange echo effect but mplayer appeared to get it right. I did not track down why this is but I found it odd that I experienced a similar problem on my T60P ages back for a brief period of time.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Upgrading the Actual Kernel&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
The most straight-forward guide I found was &lt;a href=&quot;http://julipedia.blogspot.com/2007/03/building-updated-kernel-for-ps3.html&quot;&lt;/a&gt;.  What it missed is that recent kernels require the device-tree compiler. This is not in the stable repository but it&#039;s available from testing so grab it from there.  Initially, I tried installing a stock 2.6.23 but it cannot even get past early-boot and without a serial console, I had no idea where it was getting locked up. I suspect Geoff Lavand the maintainer of the PS3 kernel tree has a developer version of the PS3 with serial console or he has a machine simulator of some sort.&lt;br /&gt;
&lt;br /&gt;
As described in the linked article and Geoff&#039;s &lt;a href=&quot;http://www.kernel.org/pub/linux/kernel/people/geoff/cell/README&quot;&lt;/a&gt;, there is a git tree for patches against the mainline kernel at git://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-linux.git . Using the distributions config in /boot as a starting point, it was a simple affair to get something booting with kboot and it supported huge pages in the usual manner one would expect. The one gotcha was that the root device had renamed to &lt;b&gt;/dev/ps3dai&lt;/b&gt; so a slightly different kboot entry is needed.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Wrapping up&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
All in all, getting the machine setup, CD&#039;s burned off, install done and the kernel upgrade took about 4 hours in all - much of it waiting for downloads.  Between oprofile not working, no early printing support and the lack of a serial console, it is not the best box for kernel development on if you are like me and bust up early-boot a lot. However, once beyond early-boot, it has been a decent box to try stuff on as long as no proper serial console is not a problem for you. I still have to try a few Cell-related things to see what sort of behaviour I get from them so that either will be hella-interesting or a pure waste of time. Worst comes to the worst, I&#039;ll open those two games!&lt;br /&gt;
 
    </content:encoded>

    <pubDate>Thu, 15 Nov 2007 21:59:41 +0000</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/4-guid.html</guid>
    
</item>
<item>
    <title>Managing Hugepage Pools with Memory Partitioning in Linux</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/3-Managing-Hugepage-Pools-with-Memory-Partitioning-in-Linux.html</link>
            <category>Large Pages</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/3-Managing-Hugepage-Pools-with-Memory-Partitioning-in-Linux.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=3</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=3</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    Hugepages can potentially attain higher performance from the hardware by reducing TLB misses and CPU cache usage[1]. The benefits generally apply to applications that use large amounts of address space although there can be secondary benefits such as improved hardware pre-fetch.&lt;br /&gt;
&lt;br /&gt;
For an application to exploit any of this though, hugepages must be available.  This is not trivial as the vast majority of processors require that the memory be naturally aligned and contiguous, both physically and virtually. In the past, administrators were required to reserve the hugepages at boot-time as memory became too fragmented in a short period of time to allocate the pages normally. Sizing the pool presented a difficult problem for the administrator: too large and memory is wasted on long-lived systems; too small and applications will fail.  This issue alone precludes hugepages from being used in more situations.&lt;br /&gt;
&lt;br /&gt;
However, this restriction has been relaxed somewhat in kernel 2.6.23 due to &lt;i&gt;memory partitions&lt;/i&gt;. Using partitioning, an administrator can grow and shrink the pool during the lifetime of the system instead of making the decisions at boot time. In this article, we will discuss how to make hugepages available with a greater degree of flexibility than was previously available.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Basic Hugepage Information&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
To discover if the kernel supports hugepages, read &lt;kbd&gt;/proc/meminfo&lt;/kbd&gt; and look for the HugePages entries. For example, on my machine I see&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;
    mel@arnold:~$ grep Huge /proc/meminfo 
    HugePages_Total:     0
    HugePages_Free:      0
    HugePages_Rsvd:      0
    Hugepagesize:     4096 kB
&lt;/pre&gt;&lt;br /&gt;
&lt;br /&gt;
This means the kernel supports hugepage usage but I have no hugepages in the pool and a hugepage is 4MiB in size. To add hugepages to the pool, a number is simply written to &lt;kbd&gt;/proc/sys/vm/nr_hugepages&lt;/kbd&gt; like so;&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;
    arnold:/proc/sys/vm# echo 10 &gt; nr_hugepages 
    arnold:/proc/sys/vm# cat nr_hugepages 
    10
    arnold:/proc/sys/vm# grep Huge /proc/meminfo 
    HugePages_Total:    10
    HugePages_Free:     10
    HugePages_Rsvd:      0
    Hugepagesize:     4096 kB
&lt;/pre&gt;&lt;br /&gt;
&lt;br /&gt;
Discussing how to use hugepages in an application is beyond the scope of this article but information exists if you look around[1][2].&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Where it Goes Wrong&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
In a standard system once memory has been filled (updatedb, untarring lots of data etc), the memory becomes too fragmented and writing a value to &lt;kbd&gt;nr_hugepages&lt;/kbd&gt; may not allocate all of the requested hugepages. The problem is that not all memory can be reclaimed by the kernel. Many pages allocated by the kernel cannot be easily reclaimed on demand and only one badly placed page can cause a hugepage allocation failure.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Solving the Problem with a Memory Partition&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Since 2.6.23, memory can be split in two at boot-time creating a zone called ZONE_MOVABLE (see information on zones by reading &lt;kbd&gt;/proc/zoneinfo&lt;/kbd&gt;).  Only pages that can be reclaimed on demand by the kernel use this zone. Within this zone of memory, hugepage allocations will almost always (See Caveats later) succeed no matter how long the system is running.&lt;br /&gt;
&lt;br /&gt;
So, let us say that an administrator knows that a number of jobs will be running on his system that want to use hugepages. The jobs use between 0 and 256 hugepages and he does not want to waste memory. Previously, the administrator would specify &lt;kbd&gt;hugepages=256&lt;/kbd&gt; on the command-line and waste the memory for jobs that are not using it.&lt;br /&gt;
&lt;br /&gt;
Now instead, the administrator would specify &lt;kbd&gt;movablecore=1024MB&lt;/kbd&gt; on the command line to setup a partition for movable pages 1GiB in size or 256 hugepages. Jobs that require hugepages can now request them from &lt;kbd&gt;/proc/sys/vm/nr_hugepages&lt;/kbd&gt; and have a reasonable expectation of getting those pages.&lt;br /&gt;
&lt;br /&gt;
The difference between the partitioning and configuring the pool at boot-time is that memory in the partition unused by hugepages can still be used for normal pages. This means that memory can be returned after the huge page process completes, and reallocated to small page processes; memory is never wasted.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Configuring the Memory Partition&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
The partition must be configured at boot-time using either &lt;kbd&gt;kernelcore=&lt;/kbd&gt; or &lt;kbd&gt;movablecore=&lt;/kbd&gt; as documented in &lt;kbd&gt;Documentation/kernel-parameters.txt&lt;/kbd&gt;.  &lt;b&gt;movablecore&lt;/b&gt; specifies how much memory should be used for ZONE_MOVABLE. An alternative way of looking at movablecore is&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;Max Hugepages that can be allocated at any time = movablecore / hugepagesize&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
If on the other hand the administrator knows how much memory the rest of the kernel needs and wants as much memory as possible to be available for a varying number of hugepages, kernelcore= can be used instead. In this case, the size of ZONE_MOVABLE is whatever memory is left over or alternatively&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;Max Hugepages that can be allocated at any time = (TotalMem - kernelcore) / hugepagesize&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Growing The Pool&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Once the partition is setup, the hugepage pool can be easily grown. Depending on system activity the first attempt may not succeed so try a few times or use a script like this not-very-tested and very rough piece of work;&lt;br /&gt;
&lt;br /&gt;
&lt;pre class=&quot;code_snippit&quot;&gt;
#!/bin/bash
# Attempt to grow the pool to the requested size
# This benchmark checks how many hugepages can be allocated in the hugepage
# pool
#
# Copyright Mel Gorman (c) 2007
# Licensed under GPL V2. See http://www.gnu.org/licenses/gpl-2.0.txt for details

SLEEP_INTERVAL=5
FAIL_AFTER_NO_CHANGE_ATTEMPTS=20
NUM_REQUIRED=0

usage() {
	echo &quot;get_hugepages: Get the requested number of hugepages&quot;
	echo
	echo &quot;	-s	Time to sleep between attempts to grow pool&quot;
	echo &quot;	-f	Give up after failing this number of times&quot;
	echo &quot;	-n	Number of hugepages that should be in the pool&quot;
	echo
	exit $1
}

# Arg processing
while [ &quot;$1&quot; != &quot;&quot; ]; do
	case &quot;$1&quot; in
		-s)	export SLEEP_INTERVAL=$2; shift 2;;
		-f)	export FAIL_AFTER_NO_CHANGE_ATTEMPTS=$2; shift 2;;
		-c)	export MAX_ATTEMPT=$2; shift 2;;
		-n)	export NUM_REQUIRED=$2; shift 2;;
	esac
done

# Check proc entry exists
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
	echo Attempting load of hugetlbfs module
	modprobe hugetlbfs
	if [ ! -e /proc/sys/vm/nr_hugepages ]; then
		echo ERROR: /proc/sys/vm/nr_hugepages does not exist
		exit -1
	fi
fi

# Check a number was requested
if [ &quot;$NUM_REQUIRED&quot; = &quot;&quot; -o $NUM_REQUIRED -lt 0 ]; then
	echo ERROR: You must specify a number of hugepages to alloc
	usage -2
fi

# Ensure we have permission to write
echo $STARTING_COUNT 2&gt; /dev/null &gt; /proc/sys/vm/nr_hugepages || {
	echo ERROR: Do not have permission to adjust nr_hugepages count
	exit -3
}

# Record existing hugepage count
STARTING_COUNT=`cat /proc/sys/vm/nr_hugepages`
echo Starting page count: $STARTING_COUNT

# Start attempt to grow pool
CURRENT_COUNT=$STARTING_COUNT
LAST_COUNT=$STARTING_COUNT
NOCHANGE_COUNT=0
ATTEMPT=0

while [ $NOCHANGE_COUNT -ne $FAIL_AFTER_NO_CHANGE_ATTEMPTS ] &amp;&amp;amp; [ $CURRENT_COUNT -ne $NUM_REQUIRED ]; do
	ATTEMPT=$((ATTEMPT+1))
	echo $NUM_REQUIRED &gt; /proc/sys/vm/nr_hugepages

	CURRENT_COUNT=`cat /proc/sys/vm/nr_hugepages`
	PROGRESS=
	if [ $CURRENT_COUNT -eq $LAST_COUNT ]; then
		NOCHANGE_COUNT=$(($NOCHANGE_COUNT+1))
	elif [ $CURRENT_COUNT -ne $NUM_REQUIRED ]; then
		NOCHANGE_COUNT=0
		PROGRESS=&quot;Progress made with $(($CURRENT_COUNT-$LAST_COUNT)) pages&quot;
		echo Attempt $ATTEMPT: $CURRENT_COUNT pages $PROGRESS 
		LAST_COUNT=$CURRENT_COUNT
		sleep $SLEEP_INTERVAL
	fi

done

echo Final page count: $CURRENT_COUNT

# Exit with 0 if number of pages was successfully allocated
if [ $CURRENT_COUNT -eq $NUM_REQUIRED ]; then
	exit 0
else
	exit -4
fi
&lt;/pre&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Caveats&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Allocations &lt;em&gt;almost&lt;/em&gt; always succeed. The one case where they do not is when memory is mlock()ed. Technically these pages could be moved and patches exist to do just that but there has not been demand for the feature to date.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Future&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
In 2.6.24, grouping pages by mobility may mean that the partition does not even have to be setup for the kernel to be able to grow/shrink the pool to a large extent. However, if hugepage availability must be guaranteed at all times, then the partition should be setup.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Summary&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
There you have it. In the past, hugepages had to be allocated at boot-time.  Now, you can setup a partition at boot-time instead and allocate hugepages to the pool when they are needed, and return them to general use when not required allowing the memory to be used as normal.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Acknowledgements&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Thanks to Nishanth Aravamudan and Andy Whitcroft for reviewing drafts of this.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;References&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
[1] Leverage transparent huge pages on Linux on POWER&lt;br /&gt;
	http://www.ibm.com/developerworks/systems/library/es-lop-leveragepages/&lt;br /&gt;
&lt;br /&gt;
[2] Kernel source: Documentation/vm/hugetlbpage.txt 
    </content:encoded>

    <pubDate>Fri, 26 Oct 2007 11:39:53 +0100</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/3-guid.html</guid>
    
</item>
<item>
    <title>Huge pages are what now?</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/2-Huge-pages-are-what-now.html</link>
            <category>Large Pages</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/2-Huge-pages-are-what-now.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=2</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=2</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    Just to clear up what a huge page is!&lt;br /&gt;
&lt;br /&gt;
Any architecture supporting virtual memory is required to map virtual addresses to physical addresses through an address translation mechanism. Recent translations are stored in a cache called a &lt;em&gt;Translation Lookaside Buffer (TLB)&lt;/em&gt;. &lt;em&gt;TLB Coverage&lt;/eM&gt; is defined as memory addressable through this cache without having to access the master tables in main memory. When the master table is used to resolve a translation, a &lt;em&gt;TLB Miss&lt;/em&gt; is incurred. This can have as significant an impact on &lt;em&gt;Clock cycles Per Instruction (CPI)&lt;/em&gt; as CPU cache misses[1]. To compound the problem, the percentage of memory covered by the TLB has decreased from about 10% of physical memory in early machines to approximately 0.01% today. As a means of alleviating this, modern processors support multiple page sizes, usually up to several megabytes, but gigabyte pages are also possible. The downside is that processors commonly require that physical memory for a page entry be contiguous.&lt;br /&gt;
&lt;br /&gt;
So, that is what a huge page is. Linux supports huge pages but the support is a bit primitive in spots and bolted on at spots. Work is on-going to make it better although the public plans on what is happening are a bit spotty at best[2].&lt;br /&gt;
&lt;br /&gt;
[1] Book: Computer Architecture a Quantitative Approach&lt;br /&gt;
[2] Some pretty minimal information at http://linux-mm.org/HugePages 
    </content:encoded>

    <pubDate>Thu, 25 Oct 2007 12:02:12 +0100</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/2-guid.html</guid>
    
</item>
<item>
    <title>Have to start somewhere</title>
    <link>http://www.csn.ul.ie/~mel/blog/index.php?/archives/1-Have-to-start-somewhere.html</link>
            <category>General</category>
    
    <comments>http://www.csn.ul.ie/~mel/blog/index.php?/archives/1-Have-to-start-somewhere.html#comments</comments>
    <wfw:comment>http://www.csn.ul.ie/~mel/blog/wfwcomment.php?cid=1</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.csn.ul.ie/~mel/blog/rss.php?version=2.0&amp;type=comments&amp;cid=1</wfw:commentRss>
    

    <author>nospam@example.com (Mel Gorman)</author>
    <content:encoded>
    This is the beginning of what I intend to be a semi-technical blog on what is happening with the Linux kernel with the virtual memory manager (VMM). At the time of writing, I work with IBMs Linux Technology Center out of a house in Limerick, Ireland. I guess the usual disclaimer is that on this blog, I speak on my own behalf and in no way represent IBM&#039;s position, strategies or whatever else occurs to you related to the company. While far from being the most influencial VMM developer in Linux, there is not a whole pile of information out there on what is happening and something is better than nothing. Most of the time, I work on improving large page support so I imagine many posts will deal with that subject.  Depending on the mood, general activity and spare time, I&#039;ll blabber on about other subjects. 
    </content:encoded>

    <pubDate>Thu, 25 Oct 2007 11:31:45 +0100</pubDate>
    <guid isPermaLink="false">http://www.csn.ul.ie/~mel/blog/index.php?/archives/1-guid.html</guid>
    
</item>

</channel>
</rss>