Professional Documents
Culture Documents
ESXi supports memory over commitment in order to provide higher memory utilization
and higher ratio of consolidation. In order to effectively support memory over
commitment, the hypervisor provides efficient host memory reclamation techniques.
ESXi uses several techniques to reclaim virtual machine memory, which are:
Ballooning
Memory compression
Hypervisor swapping
Do check the links for detailed discussion about each of these techniques.
Now the question is, when do these techniques are running, is it always? is it at specific
threshold? So lets explore that too.
Which memory reclamation technique is active will depend upon which memory state is
active currently.
Following are the possible memory states in vSphere.
High
Soft
Hard
Low
NOTE: As we all know that vSphere 6 onward, TPS is by default turned OFF.
However, if you enable it, the TPS runs always and tries to share memory pages
like what we had in old versions of ESXi but this is applicable only on small
memory pages i.e. 4KB pages.
When available free memory is less than High state but more then Clear state as
in chart above then ESXi will start preemptively breaking up large pages so that
TPS (If enabled in vSphere 6) can collapse them at next run cycle.
If the amount of available free memory is bit less than the Min.FreePct threshold
as in chart above, the VMkernel applies ballooning to reclaim memory.
Compression helps to avoid hitting the low state without impacting virtual
machine performance, but if memory demand is higher than the VMkernelss
ability to reclaim, drastic measure of Hypervisor swapping is taken to avoid
memory exhaustion.
operating system, have the same applications, or contain the same user data. Due to
this, there is possibility that memory pages created by virtual machines are similar in
terms of content. So instead of creating similar multiple pages in host memory for each
virtual machine, TPS is used to perform memory page sharing.
In vSphere 6, intra-VM TPS is enabled by default and inter-VM TPS is disabled by
default, due to some security concerns as described in VMware KB 2080735.
With page sharing, the hypervisor reclaims the redundant copies and keeps only one
copy, which is shared by multiple virtual machines in the host physical memory. As a
result, the total virtual machine host memory consumption is reduced and a high
memory over commitment is possible.
How TPS works?
ESXi scans the content of guest physical memory for sharing opportunities. Instead of
comparing each byte of a candidate guest physical page to other pages, ESXi uses
hashing to identify potentially identical pages.
Image: VMware
A hash value is generated based on the virtual machines physical pages (GA)
content and stored in global hash table. Each entry in global hash table includes
a hash value and the physical page number of a shared page
The hash value is used to look up a global hash table. If the hash value of virtual
machines physical page matches an existing entry in hash table, a bit-by-bit
comparison of the page contents is performed to exclude any false match.
Once the virtual machine physical pages content matches with the content of an
existing shared host physical page, the guest physical (GA) to host physical
mapping (HA) of the virtual machine physical page is changed to the shared host
physical page, and the redundant host memory copy (the page pointed to by the
dashed arrow in above image) is reclaimed.
This remapping is invisible to the virtual machine and inaccessible to the guest
operating system. Because of this invisibility, sensitive information cannot be
leaked from one virtual machine to another.
Image:VMware
Any attempt to write to the shared pages will generate a minor page fault. In the
page fault handler, the hypervisor will transparently create a private copy of the
page for the virtual machine and remap the affected guest physical page to this
private copy. A standard copy-on-write (CoW) technique is used to handle writes
to the shared host physical pages.
In hardware-assisted memory virtualization (Intel EPT and AMD RVI) systems, ESXi will
not share large pages because:
The probability of finding two large pages having identical contents is low
The overhead of doing a bit-by-bit comparison for a 2MB page is much larger
than for a 4KB page
Since ESXi will not swap out large pages, the large page (2MB) will be broken into small
pages (4KB) during host swapping so that these pre-generated hashes can be used to
share the small pages before they are swapped out.
What is Salting in TPS?
Salting is used to allow more granular management of the virtual machines participating
in TPS. Salting is enabled after the ESXi update mentioned below are deployed.
ESXi 5.0 Patch ESXi500-201502001
ESXi 5.1 Update 3
ESXi 5.5, Patch ESXi550-201501001
ESXi 6.0
By default, salting is set Mem.ShareForceSalting=2 and each virtual machine has a
different salt. This means page sharing does not occur across the virtual machines
(inter-VM TPS) and only happens inside a virtual machine (intra VM).
When salting is enabled (Mem.ShareForceSalting=1 or 2) in order to share a page
between two virtual machines both salt and the content of the page must be same. A
salt value is a configurable vmx option for each virtual machine. You can manually
specify the salt values in the virtual machine's vmx file with the new vmx option
sched.mem.pshare.salt. If this option is not present in the virtual machine's vmx file,
then the value of vc.uuid vmx option is taken as the default value. Since the vc.uuid is
unique to each virtual machine, by default TPS happens only among the pages
belonging to a particular virtual machine (Intra-VM).
How can I enable or disable salting?
Memory Reclamation-Ballooning
For example, when I open MS outlook for the first time on my computer, it takes some
amount of time to load all pages of that program. Now lets just say, I closed the outlook,
but after couple of minutes I tried to re-open outlook again, now I may not need to wait
same amount of time, in fact this time it will be quicker. So what happened in back end?
Well, when I started application first time, it loaded all the required pages of that
program into the memory which we call as Active Pages or MRU. But when I closed the
application, memory pages of that application which were loaded into MRU are not
deleted from memory, rather operating system keeps those pages back in LRU or Idle
pages, considering application may require those pages if request comes in again like in
my example I started application again.
Now this is really good approach of managing memory pages and ensuring
performance by keeping pages in LRU. But this approach is good for physical systems.
The challenge that we face in virtual machine due to this approach is as below.
Hypervisor has no visibility of Free list, LRU and MRU memory pages that are
managed by Operating system of a virtual machine.
So if multiple VMs are demanding memory resources and later keeping memory
pages in LRU even after workload is no longer present, this results in
unnecessary consumption of host memory of ESXi host which can cause
memory contention when multiple VMs puts high demand for memory resources.
On the other hand, operating system of virtual machine is also not aware that
ESXi server is under memory contention as virtual machine operating system
also does not have visibility of ESXi memory consumption and cannot detect the
hosts memory shortage.
In Figure (A), four guest physical pages are mapped in the host physical memory. Two
of the pages are used by the guest application and the other two pages (marked by
stars) are in the guest operating system free list. Note that since the hypervisor cannot
identify the two pages in the guest free list, it cannot reclaim the host physical pages
that are backing them. Assuming the hypervisor needs to reclaim two pages from the
virtual machine, it will set the target balloon size to two pages.
After obtaining the target balloon size, the balloon driver allocates two guest physical
pages inside the virtual machine and pins them, as shown in Figure (B). Here, pinning
is achieved through the guest operating system interface, which ensures that the pinned
pages cannot be paged out to disk under any circumstances.
Once the memory is allocated, the balloon driver notifies the hypervisor the page
numbers of the pinned guest physical memory so that the hypervisor can reclaim the
host physical pages that are backing them. In Figure (B), RED and GREEN are
representing these pages.
The hypervisor can safely reclaim this host physical memory because neither the
balloon driver nor the guest operating system relies on the contents of these pages.
If any of these pages are re-accessed by the virtual machine for some reason, the
hypervisor will treat it as normal virtual machine memory allocation and allocate a new
host physical page for the virtual machine.
OK. Now the above description is as per the VMware Documentation. In order to
understand this in simple terms lets discuss this same process further.
When ESXi host is under memory contention, ESXi host sets the target for
balloon driver.
As per the target, balloon driver inside virtual machine, will fake itself as another
application and demand memory from Operating system of virtual machine.
Considering the request from application (FAKE), VM operating system will start
allocating memory pages to balloon driver from Free list, LRU and if required
from MRU as well in case there is situation to satisfy reservation demand.
As soon as balloon driver receives memory pages from operating system of VM,
it starts inflating from its initial size just like what happens with actual balloon
when we pump air into it.
Memory pages that are consumed by balloon driver, are pinned (Red and Green
pages in above figure) so that they are not swapped out.
Balloon driver communicates with the hypervisor through a private channel and
informs hypervisor about pinned pages.
Hypervisor then reclaims these pages by setting up lower target, this causes
balloon driver to deflate back to initial state, just like in actual balloon, if we take
air out of it, it comes back to initial state.
Image: VMware
ESXi host will try to reclaim memory from virtual machines as per target received. How
much memory is reclaimed from each VM is calculated with the help of Memory Taxing
(mem.idletax).
Like if you earn more bucks, you pay more tax, so if any VM holding more number of
Do check my previous articles on TPS and Ballooning in this series on VMware Memory
reclamation, as compression will start after TPS and Ballooning.
ESXi provides a memory compression cache to improve virtual machine performance
when you use memory over commitment.
If the virtual machines memory usage reaches to the level at which host-level swapping
will be required, ESXi uses memory compression to reduce the number of memory
pages it will need to swap out. Because the decompression latency is much smaller
than the swap-in from disk latency, compressing memory pages has significantly less
impact on performance than swapping out those pages.
Lets see how compression helps improving performance of virtual machines that are
running on over-committed ESXi Host. Below video from VMware is for old version of
ESXi, however it gives us idea about Memory compression impact on VM Performance.
Note 1: ESXi does not directly compress 2MB large pages, rather 2MB large pages are
chopped down to 4KB pages first and later they are compressed to 2KB pages.
Note 2: if a pages compression ratio is larger than 75%, ESXi will store the
compressed page using a 1KB quarter-page space.
There are couple of conditions for pages that will be considered for compression. If
memory pages are meeting below criteria then only memory pages are compressed.
1. Memory pages that are any way marked for swapping out to disk only those
pages. AND
2. Memory pages that can be compressed at least 50%.
Any page that is not meeting above criteria, will be swapped out to disk.
Lets understand how compression works with an example.
Image: VMware
Lets assume that ESXi needs to reclaim 8 KB physical memory (two 4KB pages) from
Virtual machines.If we consider host swapping, two swap candidate pages, page A and
B, are directly swapped to disk (Image A).
With compression, a swap candidate page is compressed and stored using 2KB of
space in a per -VM compression cache. Hence, each compressed page yields 2KB
memory space for ESXi to reclaim.In order to reclaim 8 KB physical memory, four
swap candidate pages need to be compressed (Image B).
If memory requests comes in to access a compressed page, the page is decompressed
and pushed back to the guest memory. The page is then removed from the compression
cache.
What is Per-VM Compression Cache:
The memory for the compression cache is not allocated separately as an extra
overhead
memory. The compression cache size starts with zero when host memory is under
committed and grows when the virtual machine memory starts to be swapped out.
If the compression cache is full, one compressed page must be replaced in order to
make room for a new compressed page. The page which has not been accessed for the
longest time will be decompressed and swapped out. ESXi will not swap out
compressed pages.
If the pages belonging to compression cache need to be swapped out under severe
memory pressure, the compression cache size is reduced and the affected compressed
pages are decompressed and swapped out.
The maximum compression cache size is important for maintaining good VM
performance. Since compression cache is accounted for by the VMs guest memory
usage, a very large compression cache may waste VM memory and unnecessarily
create host memory pressure.
In vSphere 5.0, the default maximum compression cache size is conservatively set to
10% of configured VM memory size. This value can be changed through Advanced
Settings by changing the value for Mem.MemZipMaxPct
VMware Memory Reclamation:Hypervisor Swapping
ESXi employs below methods to address the limitations mentioned above that improves
hypervisor swapping performance: