Anatomy of Linux Kernel Shared Memory

Anatomy of Linux Kernel Shared Memory
Memory de-duplication in the Linux kernel

Skill Level: Intermediate
M. Tim Jones (mtj@mtjones.com)
Independent Author
07 Apr 2010
Linux as a hypervisor includes a number of innovations, and one of the more
interesting changes in the 2.6.32 kernel is Kernel Shared Memory (KSM). KSM
allows the hypervisor to increase the number of concurrent virtual machines by
consolidating identical memory pages. Explore the ideas behind KSM (such as
storage de-duplication), its implementation, and how you manage it.
Connect with Tim
Tim is one of our most popular and prolific authors. Browse all of
Tim's articles on developerWorks. Check out Tim's profile and
connect with him, other authors, and fellow readers in My
developerWorks.
Software engineering tends to be an evolutionary process. Problems are addressed

with solutions that can create new problems and subsequently new solutions.
Ideally, the new problems that are created justify the original solution. The
technology discussed here is one of the secondary solutions to a problem created by
server virtualization. Before delving into KSM, however, let's take a quick look at the
original solution and how KSM is applied here and elsewhere.
Server virtualization
Extreme server consolidation
Although enterprise servers can consolidate 10 or more servers
using virtualization, a single IBM System z server can support for
thousands of Linux guests on a single logical partition.

Copyright IBM Corporation 2010. All rights reserved.
Trademarks
Page 1 of 10
developerWorks
ibm.com/developerWorks
Virtualization as a technology has been around since the 1960s, popularized by the
IBM System/360 mainframe. Five decades later, virtualization has exploded,
making it possible to share a single server with multiple operating systems and
applications. This particular use, called server virtualization, is transforming data
centers, as a single physical machine can be used to host typically 10 or more virtual
machines (VMs), as shown in Figure 1. This virtualization enables the infrastructure
to be much more dynamic, power efficient, and (therefore) cost-efficient. See
Resources for more information on server virtualization and its benefits.
Figure 1. Server consolidation through virtualization
With only 10-15% of a typical server in use at any given time, consolidating virtual
servers onto a single physical server makes perfect sense. But what about
resources? The major resources available include CPU, memory, and networking
bandwidth. CPU and networking are typically underutilized, so the real issue
becomes available memory. Each operating system makes its own unique demands
on available memory resources. But the question that arises is, how unique are
those demands?
Memory sharing
It turns out that if you virtualize many of the same operating system and application
sets, many memory pages are identical. This makes sense, given that operating
system and application code and constant data are identical among VMs. And when
pages are unique, they can be consolidated, which frees memory for use in other
applications. Figure 2 illustrates memory sharing and shows the benefit of more free
memory available when pages are shared among VMs whose contents are the
same.
Figure 2. Memory sharing across VMs

Trademarks
Page 2 of 10
developerWorks
Feature naming
The feature described here is very new; therefore, the name has
undergone some changes. You'll find this Linux kernel feature
called Kernel Shared Memory as well as Kernel Samepage
Merging.
What you'll soon discover is that although memory sharing in Linux is advantageous
in virtualized environments (KSM was originally designed for use with the
Kernel-based Virtual Machine [KVM]), it's also useful in non-virtualized
environments. In fact, KSM was found to be beneficial even in embedded Linux
systems, indicating the flexibility of the approach. Now, let's explore the Linux
approach to memory sharing and how you can use it to increase the memory density
of a server and therefore increase its ability to host additional applications or VMs.
Parallels to other technologies

A recent advancement in storage technologies called de-duplication is a precursor to

Trademarks
Page 3 of 10
developerWorks
memory sharing in Linux and other hypervisors. De-duplication is a technology in

which stored data is reduced by removing redundant data (either on a block basis or
on larger segments of data such as files). Common segments are merged (in a
copy-on-write [CoW] fashion), freeing space for other uses. Storage de-duplication is
an important technology moving forward, as it optimizes the capacity of storage by
removing duplicate data. In this way, storage is cheaper, because less is ultimately
required. Given the rate of data growth, this functionality is very relevant.
KSM operation
KSM exists as a daemon in the kernel (called ksmd) that periodically performs page
scans to identify duplicate pages and merges duplicates to free pages for other
uses. It does this in a way that's transparent to the user. For example, duplicate
pages are merged (and subsequently marked read only), but if one of the users of
the page changes it for any reason, that user will receive his or her own copy (in a
CoW fashion). You can find the full implementation of the KSM kernel module within
the kernel source at ./mm/ksm.c.
KSM relies on a higher-level application to provide instruction on which memory
regions are candidates for merging. Although it would be possible for KSM to simply
scan all anonymous pages in the system, it would be wasteful of CPU and memory
(given the space necessary to manage the page-merging process). Therefore,
applications can register the virtual areas that are likely to contain duplicate pages.
The KSM application programming interface (API) is implemented through the
madvise system call (see Listing 1) and a new advice parameter called
MADV_MERGEABLE (which indicates that the defined region is mergeable). A region
can be removed from the mergeable state through the MADV_UNMERGEABLE
parameter (which immediately un-merges any merged pages from that region). Note
that removing a region of pages through madvise can result in an EAGAIN error, as
the operation could run out of memory during the un-merge process, potentially
leading to even greater trouble (out-of-memory conditions).
Listing 1. The madvise system call
#include <sys/mman.h>
int madvise( void *start, size_t length, int advice );
Once a region is defined as mergeable, KSM adds this region to its list of working
memory. When KSM is enabled, it searches for identical pages, keeping one page in
a write-protected CoW fashion and freeing one for other uses.
KSM uses a approach that's different from the approach used in storage
de-duplication. In traditional de-duplication, objects are hashed, and the hash value

Trademarks
Page 4 of 10
developerWorks
is used as the initial check for similarity. When hashes are identical, the next step is
a comparison of the actual objects (in this case, a memory comparison) to formally
determine whether the objects are identical. KSM used this approach in its first
implementation, but a more straightforward approach was developed to simplify it.
In the current KSM, pages are managed by two red-black trees, one of which is
ephemeral. The first tree, call the unstable tree, is used to store new pages that are
not yet understood to be stable. In other words, pages that are candidates for
merging (unchanged for some period of time) are stored in the unstable tree. Pages
in the unstable tree are not write protected. The second tree, called the stable tree,
stores those pages that have been found to be stable and merged by KSM. To
identify whether a page is volatile or non-volatile, KSM uses a simple 32-bit
checksum. When a page is scanned, its checksum is calculated and stored with the
page. On a subsequent scan, if a newly computed checksum is different from the
previously generated checksum, the page is changing and is therefore not a good
candidate for merging.
The first step in handling a single page using KSM's process is to see whether the
page can be found in the stable tree. The process of searching the tree is interesting
in that each page is treated as a very large number (the contents of the page). A
memcmp (memory compare) operation is performed on the page and node's page of
consideration. If the memcmp returns 0, the pages are equal and a match is found.
Otherwise, memcmp can return -1 (which means that the candidate page is less than
the current node's page) or 1 (the candidate page is greater than the current node's
page). In each case, the operation leads the search down the red-black tree (to the
left if less than and to the right if greater than). Although comparing 4KB pages
seems fairly heavyweight, in most cases, the memcmp will end prematurely once a
difference is found. Therefore, the process itself is fast and efficient. See Figure 3 for
a visualization of this process.
Figure 3. Search process for a page in the tree

Trademarks
Page 5 of 10
developerWorks
If the candidate page is found in the stable tree, then it is merged and the candidate
page freed. This code can be found in ksm.c/stable_tree_search() (which is
called by ksm.c/cmp_and_merge_page()). Otherwise, if the candidate page is not
found, you move on to the unstable tree (see ksm.c/unstable_tree_search()).
The first step in the unstable tree search is to recalculate the checksum over the
page. If it's found to be different than the original checksum, the page is discarded
from further search in this scan (because if it changes, it's not worthwhile to track). If
the checksum has not changed, the unstable tree is searched for the candidate
page. The unstable tree processing is a bit different than the stable tree. First, if the
search does not find the page in the unstable tree, you add a new node in the
unstable tree for this page. But if the page is found in the unstable tree, you merge
the pages, and then migrate the node to the stable tree.
When the scan is complete (performed through ksm.c/ksm_do_scan()), the stable
tree is kept, but the unstable tree is removed and rebuilt at the time of the next scan.
This process simplifies things quite a bit, as the organization of the unstable tree can
change based upon changes to the pages (recall that pages in the unstable tree are
not write protected). Because all pages in the stable tree are write protected, a page
fault is generated when a page tries to be written, allowing the CoW process to
un-merge the page for the writer (see ksm.c/break_cow()). Orphaned pages in the
stable tree are subsequently removed (unless two or more users of the page exist,
indicating that the page is still shared).

Trademarks
Page 6 of 10
developerWorks
As discussed, KSM uses red-black trees for managing pages for fast lookup. Linux
actually includes red-black trees as a reusable data structure, and they can be used
generically. Red-black trees are also used by the Completely Fair Scheduler (CFS)
for storage of tasks in time order. You can find the implementation of red-black trees
in the kernel at ./lib/rbtree.c.
KSM configuring and monitoring

Managing and monitoring KSM occurs through sysfs (at the root
/sys/kernel/mm/ksm). In this sysfs subdirectory, you'll find a collection of files, some
used for control and others used for monitoring.
The first file, run, is used to enable or disable paging merging with KSM. By default,
KSM is disabled (0) but can be enabled by writing a 1 to the file to enable the KSM
daemon (for example, echo 1 > sys/kernel/mm/ksm/run). From the run state,
you can disable the daemon (but keep the current set of merged pages) by writing a
0. Also from the run (1) state, you can stop KSM and request that all merged pages
become un-merged by writing a 2.
While KSM runs, you can control it with three parameters (files within sysfs). The
sleep_millisecs file defines how many milliseconds ksmd should sleep before
performing another page scan. The max_kernel_pages file defines the maximum
number of pages that ksmd can use (the default is 25% of the available memory, but
you can write a 0 to specify no limit). Finally, the pages_to_scan file defines how
many pages can be scanned in a given scan. Any user can view these files, but the
user must have root privileges to change them.
There are also five monitorable files exported through sysfs that indicate ksmd's
operation and effectiveness (these are read only). The full_scans file indicates the
number of full area scans that have been performed. The remaining four files
indicate the page-level statistics of KSM:
pages_shared: The number of unswappable kernel pages that KSM is
using
pages_sharing: An indication of memory savings
pages_unshared: The number of unique pages repeatedly checked for
merging
pages_volatile: The number of pages that are changing too often
The KSM authors define that a high ratio of pages_sharing to pages_shared
indicates efficient sharing of pages (whereas the reverse is indicative of wasted
effort).

Trademarks
Page 7 of 10
developerWorks
Going further
Linux is not alone in using page sharing to improve memory efficiency, but it is
unique in its implementation as an operating system feature. VMware's ESX server
hypervisor provides this feature under the name Transparent Page Sharing (TPS),
while XEN calls it Memory CoW. But whatever the name or implementation, the
feature provides better memory utilization, allowing the operating system (or
hypervisor, in the case of KVM) to over-commit memory to support greater numbers
of applications or VMs. You can find KSMand many other interesting featuresin
the latest 2.6.32 Linux kernel.

Trademarks
Page 8 of 10
developerWorks
Resources
Learn
The authors of KSM at Red Hat wrote a great paper on KSM entitled
"Increasing memory density by using KSM" (PDF). This paper introduces KSM
and delves into the details of its implementation.
Virtualization is growing dramatically as a way to consolidate for power and cost
savings but also as a key element of cloud computing architectures. Learn
about virtualization and its role in green IT in "Growing green with virtualization:
Virtualization as a backbone of green IT" (developerWorks, August 2009).
This article demonstrates an advantage of using Linux as a hypervisor. You can
learn more about this and other topics in "Anatomy of a Linux Hypervisor"
(developerWorks, May 2009), "Discover the Linux Kernel Virtual Machine"
(developerWorks, April 2007), and "Virtual Linux" (developerWorks, December
2006). Finally, read about another kernel subsystem that uses red-black trees in
"Inside the 2.6 Completely Fair Scheduler" (developerWorks, December 2009).
You can read more about memory de-duplication from the 2.6.32 kernel release
notes. These notes provide useful materials on KSM (including a link to the
kernel documentation) in addition to the various other changes that made their
way into 2.6.32. In the KSM section, you'll find an interesting reference to
running 52 Windows XP VMs on a single server with 16GB of RAM (where
each was allocated its own 1GB RAM segment but shared with KSM).
Learn more about de-duplication of storage systems and the approaches in the
IT Analysis article, "De-dupe for big storage savings?".
Wikipedia provides a great introduction to de-duplication, including the various
implementation strategies (such as source and target).
Although KSM is an automatic mechanism for reducing memory footprint in
virtualized servers, application developers have had a similar manual capability
for some time. Linux dynamic (or shared) libraries permit a binary to use a
common library object instead of one statically compiled with the application.
Read more about shared libraries in "Anatomy of Linux dynamic libraries"
(developerWorks, August 2008).
A red-black tree (or symmetric binary B-tree) is a self-balancing binary tree
invented by Rudolf Bayer. It's a useful tree representation that has good
worst-case time for operations such as insert, search, and delete. You can find
red-black trees used in a variety of applications, including the construction of
associative arrays.
In the developerWorks Linux zone, find hundreds of how-to articles and
tutorials, as well as downloads, discussion forums, and a wealth other
resources for Linux developers and administrators.
Trademarks
Page 9 of 10
developerWorks
Stay current with developerWorks technical events and webcasts focused on a

variety of IBM products and IT industry topics.
Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM
products and tools as well as IT industry trends.
Watch developerWorks on-demand demos ranging from product installation and
setup demos for beginners, to advanced functionality for experienced
developers.
Follow developerWorks on Twitter.
Get products and technologies
Evaluate IBM products in the way that suits you best: Download a product trial,
try a product online, use a product in a cloud environment, or spend a few hours
in the SOA Sandbox learning how to implement Service Oriented Architecture
efficiently.
Discuss
Get involved in the My developerWorks community. Connect with other
developerWorks users while exploring the developer-driven blogs, forums,
groups, and wikis.
About the author

M. Tim Jones
M. Tim Jones is an embedded firmware architect and the author of
Artificial Intelligence: A Systems Approach, GNU/Linux Application
Programming (now in its second edition), AI Application Programming
(in its second edition), and BSD Sockets Programming from a
Multilanguage Perspective. His engineering background ranges from
the development of kernels for geosynchronous spacecraft to
embedded systems architecture and networking protocols
development. Tim is a Senior Architect for Emulex Corp. in Longmont,
Colorado.

Trademarks
Page 10 of 10

Anatomy of Linux Kernel Shared Memory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anatomy of Linux Kernel Shared Memory

Uploaded by

Copyright:

Available Formats

Anatomy of Linux Kernel Shared Memory

Memory de-duplication in the Linux kernel

Software engineering tends to be an evolutionary process. Problems are addressed