You are on page 1of 9

I receive a lot of questions lately about ESX memory management.

Things that are very obvious to me seem to be not so obvious at all for some other people. So Ill try to explain these things from my point of view. First lets have a look at the virtual machine settings available to us. On the vm setting page we have several options we can configure for memory assignment. 1. Allocated memory: This is the amount of memory we assign to the vm and is also the amount of memory the guest OS will see as its physical memory. This is a hard limit and the vm cannot exceed this limit if it demands more memory. It is configured on the hardware tab of the vms settings. 2. Reservations: A reservation is a guaranteed amount of memory assigned to the vm. This is a way of ensuring that the vm gets a minimal amount of memory assigned. When this reservation cannot be met, you will be unable to start the vm. This is known as Admission Control. Reservations are set on the resources tab of the vms settings and by default there is no reservation set. 3. Limits: A limit is a restriction on the vm, so it cannot use more memory than this limit. If you would set this limit lower than the allocated memory value, the ballooning driver will start to inflate as soon as the vm demands more memory than this limit. Limits are set on the resources tab of the vms settings and by default the limit is set to unlimited.Now that we know of limits and reservations, we need to have a quick look at the VMkernel swap file. This swap file is used by the VMkernel to swap out the vms memory as a last resort to free up memory when the host is running out of it. When we set a reservation, that memory is guaranteed and cannot be swapped out to disk. So whenever a vm starts up, the VMkernel creates a swap file which has a size of the limit minus the reservation. For example we have a vm with a 1024MB limit and a reservation of 512MB. The swap file created will be 1024MB 512MB = 512MB. If we would set the reservation to 1024MB there wont be a swap file created at all. Remember that by default there are no reservations and no limits set, so the swap file created for each vm will be the same size as the allocated memory. 4. Shares: With shares you set a relative importance on a vm. Unlike limits and reservation which are fixed, shares can change dynamically. Remember that the share system only comes into play when memory resources are scarce and contention is occurring. Shares are set on the resources tab of the vms settings and can be set to low, normal, high or a custom value. low = 5 shares per 1MB allocated to the vm normal = 10 shares per 1MB allocated to the vm high = 20 shares per 1MB allocated to the vm It is important to note that the more memory you assign to a vm, the more shares it receives.Lets look at an example to show you how this share system works. Say you have 5 vms with each 2,000MB memory allocated and the share value set to normal. The ESX host only has 4,000MB of physical machine memory available for virtual machines. Each vm receives

20,000 shares according to the normal setting (10 * 2,000). The sum of all shares is 5 * 20,000 = 100,000. Every vm will receive an equal share of 20,000/100,000 = 1/5th of the resources available = 4,000/5 = 800MB.Now we change the shares setting on 1 vm to High, which results in this vm receiving 40,000 shares instead of 20,000. The sum of all shares is now increased to 120,000. This vm will receive 40,000/120,000 = 1/3rd of the resources available. Thus 4,000/3 = 1333 MB. All the other vms will receive only 20,000/120,000 = 1/6th of the available resources = 4,000/6 = 666 MB Instead of configuring these settings on a vm basis, it is also possible to configure these settings on a resource pool. A VMware ESX resource pool is a pool of CPU and memory resources. I always look to the resource pool as a group of VMs. This concludes the memory settings we can configure on a vm. Next time I will go into ESX memory management techniques. In part 1, I explained the virtual machine memory settings that are available to us. This caused quite some controversy on what would happen when a limit is hit thats configured lower than the actual allocated memory. Scott Herold did an excellent follow-up on this. Be sure to read his article too, because it contains valuable information. This time Ill explain how the ESX kernel assigns memory to a VM. To explain how ESX memory management works, we first need to clarify some definitions, which are often unclear and used interchangeably in many writings.

Machine memory: this is the physical memory installed in the ESX host and is managed by the ESX kernel. Physical memory: this is the allocated memory to the guest, which the guest OS sees as its physical memory (since the guest OS is unaware that its running virtual!). This memory is managed by the guest OS. Virtual memory: this is the memory as seen from the applications running inside the guest OS.

Figure1 Virtual Machine memory usage Now lets take a look at how memory is assigned to the guest. Whenever an application needs a memory page for storing data, it requests it through a system call to the operating system. The operating system keeps tracks of which memory pages are in use and which are free. Imagine this as two simple lists, one for the free memory pages and one for the allocated pages. So when a system call is made by the application, the operating system will look on its list of free memory pages and assign a page to the application. The operating system then moves this page from the free list to the allocated list. On the hypervisor layer, memory is allocated on demand. When the vm accesses its physical memory for the first time, the hypervisor will assign a machine memory page to the vm. The hypervisor keeps a record for this assignment in a Physical Memory Page to Machine Memory Page (PMP-MMP) mapping table for every vm, so the hypervisor knows which page is in use and what its location in the machine memory is. This process is illustrated in Figure1. So we now know how memory is assigned, but what happens when the application no longer needs the memory page. In this case the application will free the memory page again through a system call to the operating system. The operating system will then move this memory page from the allocated list to the free list. Because the operating system is unaware that its actually virtualized there is no interaction with the hypervisor. The underlying hypervisor is not aware that the memory page is now actually free. So while memory pages are constantly allocated and freed by the guest operating system inside the vm, the hypervisor can only allocate memory to the vm. The hypervisor cant reclaim memory through guest frees. When we take a quick look at the summary tab on our vm in vCenter, we see host memory usage and guest memory usage. Host memory usage is the amount of machine memory in MB allocated to the guest as previously described. Remember however that this value also includes virtualization overhead (e.g. the PMP-MMP mapping table). On a non-memory overcommitted host this represents a high

water mark on the guests memory usage, but host memory usage is based on shares when the host is overcommitted. At the same page we also see guest memory usage. This is the amount of memory in MB that is actively used by the guests OS and its applications. But didnt I just mention before that the hypervisor is unaware of the actual state of a guests physical memory page? Well thats correct, and this value is actually calculated through statistical sampling. This means that the hypervisor selects a random subset of the guests memory and monitors how many pages are accessed in a certain interval. This value is than extrapolated to the full size of the guests memory and shown as guest memory usage. Why is there a difference between this guest memory usage value from ESX and from inside the guest? This is because the guests OS has a much better visibility of what memory is actively used. The ESX hypervisor can only use sampling to guess the active memory size.

Figure2 Figure2 schematically represents guest memory states. From the total allocated memory to the guest (guests physical memory), memory can be either allocated to the OS and applications (and subsequently allocated from machine memory) or unallocated. The allocated memory can be either active, or inactive. For now this concludes part2. After knowing how memory is assigned were going to look how memory can be reclaimed by the ESX kernel next time. In part 1, I explained the virtual machine settings that are available to us regarding to memory, in part 2, I explained how the ESX kernel assigns memory to a VM and in this part I will dive into ESX memory reclaiming techniques.

The ESX kernel uses transparent page sharing, ballooning and swapping to reclaim memory. Ballooning and swapping are used only when the host is running out of machine memory or a VM limit is hit (see also Limits I discussed in part 1). Transparent Page Sharing (TPS) One great feature of ESX is that it supports memory overcommitment. This means that the aggregated size of the guests physical memory can be greater than the actual size of the physical machine memory. This is accomplished because assigned memory that is never accessed by the VM isnt mapped to machine memory and by a feature called Transparent Page Sharing or simply TPS. TPS is a technique that makes it possible for VMs to share memory of identical physical guest pages, so only one copy is saved in the hosts machine memory. The ESX kernel scans VM memory pages regularly and generates a hash value for every scanned page. This hash value is then compared to a global hash table which contains entries for all scanned pages. If a match is found, a full comparison of both pages is made to verify that the pages are identical. If the pages are identical, both physical pages (guest) are mapped to the same machine page (host) and the physical pages are marked Read-Only. Whenever a VM wants to write to this physical page, a private copy of the machine page is made and the PPN-to-MPN mapping is changed accordingly. Remember: TPS is always on. You can however disable memory sharing of a particular VM by setting the advanced parameter sched.mem.pshare.enable to False. To view information about TPS you can use the esxtop command from the COS (see Figure1). From the COS, issue the command esxtop en then press m to display the memory statistics page. On the topright you see the overcommit level averages in 1-min, 5-min and 15-min. The value of 0.77 in Figure1 is a percentage and means 77% memory overcommit. There is 77% more memory allocated to virtual machines than machine memory available.

Figure1 The counter to look for is PSHARE. The shared value is the total amount of guest physical memory that is being shared and the common value is the amount of machine memory actually used to share that amount of guest physical memory. The saving value is the amount of machine memory saved due to page sharing. Ballooning

When the ESX hosts machine memory is scarce or when a VM hits a Limit, The kernel needs to reclaim memory and prefers ballooning over swapping. The balloon driver is installed inside the guest OS as part of the VMware Tools installation and is also known as the vmmemctl driver. When the ESX kernel wants to reclaim memory, it instructs the balloon driver to inflate. The balloon driver then requests memory from the guest OS. When there is enough memory available, the guest OS will return memory from its free list. When there isnt enough memory, the guest OS will have to use its own memory management techniques to decide which particular pages to reclaim and if necessary page them out to its swap- or page-file. In the background, the ESX kernel frees up the machine memory page that corresponds to the physical machine memory page allocated to the balloon driver. When there is enough memory reclaimed, the balloon driver will deflate after some time returning physical memory pages to the guest OS again. This process will also decrease the Host Memory Usage parameter (discussed in part 2) Ballooning is only effective it the guest has available space in its swap- or page-file, because used memory pages need to be swapped out in order to allocated the page to the balloon driver. Ballooning can lead to high guest memory swapping. This is guest OS swapping inside the VM and is not to be confused with ESX host swapping, which I will discuss later on. To view balloon activity we use the esxtop uitility again from the COS (see Figure2). From the COS, issue the command esxtop en then press m to display the memory statistics page. Now press f and then i to show the vmmemctl (ballooning) columns.

Figure2 On the top (see Figure2) we see the MEMCTL counter which shows us the overall ballooning activity. The curr and target values are the accumulated values of the MCTLSZ and MCTLTGT as described below. We have to look for the MCTL columns to view ballooning activity on a per VM basis:

MCTL?: indicates if the balloon driver is active Y or not N MCTLSZ: the amount (in MB) of guest physical memory that is actually reclaimed by the balloon driver MCTLTGT: the amount (in MB) of guest physical memory that is going to be reclaimed (targetted memory). If this counter is greater than MCTLSZ, the balloon driver inflates causing

more memory to be reclaimed. If MCTLTGT is less than MCTLSZ, then the balloon will deflate. This deflating process runs slowly unless the guest requests memory.

MCTLMAX: the maximum amount of guest physical memory that the balloon driver can reclaim. Default is 65% of assigned memory.

You can limit the maximum balloon size by specifying the sched.mem.maxmemctl parameter in the .vmx file of the VM. This value must be in MB. Swapping When ballooning isnt possible (if the balloon driver isnt installed for example) or insufficient, the ESX kernel falls back to swapping. Swapping is used by the ESX kernel as a last resort if other techniques fail to satisfy the memory demands. This swapping mechanism pages out machine memory pages which are in use by the VM to the VMs swap file (.vswp file) on disk. As I explained in part 1, this swap file has the size of the VMs limit minus its reservation. ESX kernel swapping is done without any guest involvement which could result in paging out active guest memory. Do not confuse ESX kernel swapping with VM guest OS swapping. To view swap activity we use the esxtop uitility again from the COS (see Figure3). From the COS, issue the command esxtop en then press m to display the memory statistics page. Now press f and then j to show the swap columns.

Figure3 On the top (see Figure3) we see the SWAP counter which shows us the overall swap activity. The curr and target values are the accumulated values of the SWCUR and SWTGT as described below. We have to look for the SW columns to view swap activity on a per VM basis:

SWCUR: the current amount (in MB) of guest physical memory that is swapped out to the ESX kernel VM swap file. SWTGT: the current amount (in MB) of guest physical memory that is going to be swapped (targetted memory). If this counter is greater than SWCUR, the ESX kernel will start swapping and on the other hand if this counter is less than SWCUR, the ESX kernel will stop swapping.

SWR/s: the rate at which memory is being swapped in from disk. A physical memory page only gets swapped in if accessed by the guest OS. SWW/s: the rate at which memory is being swapped out to disk.

ESX memory state: There is one other parameter you should know about, and that is the ESX memory state. The ESX memory state affects what mechanisms are used to reclaim memory if necessary. In the high and soft states, ballooning is favored over swapping. In the hard and low states, swapping is favored over ballooning. To view this counter we use the esxtop uitility again from the COS (see Figure4). From the COS, issue the command esxtop en then press m to display the memory statistics page.

Figure4 The state counter displays the ESX free memory state. Possible values are high, soft, hard and low. If the state is high, then theres enough free machine memory available and theres nothing to worry about. When the state is soft, the ESX kernel actively reclaims memory through ballooning and falls back to swapping only when ballooning is not possible. In the hard state, the ESX kernel relies on swapping to reclaim memory and in the low state the ESX kernel continues to use swapping to forcibly reclaim memory and additionally blocks the execution of VMs that are above there target allocations. Idle memory tax In part 1, I explained how the proportional share systems works and that the more memory you assign to a VM, the more shares it receives. This could mean that a VM which has a large amount of memory assigned can hoard a lot of idle memory while a VM that has less memory assigned is screaming for memory, but according to the share value is not entitled to use it. This is rather unfair and therefore ESX throws in a referee called idle memory tax. The idle memory tax rate specifies the maximum amount of idle memory that can be reclaimed from a VM and defaults to 75%. Let me first quote a phrase from the Resource Management Guide: The default tax rate is 75 percent, that is, an idle page costs as much as four active pages. Well, that makes perfectly sense, right?NOT. I wonder how many people have read this phrase without understanding it? First let me state that the phrase is completely correct, but they forgot to mention the formula which explains why a 75% tax rate equals to a 4 times higher cost for an idle memory page. The idle page cost

is defined as: where = the idle memory tax rate.

If we fill in the default of 75%, we get the following result:

This explains why the phrase from the Resource Management Guide states that an idle memory tax of 75% results in a 4 times higher cost for an idle page. To determine how to divide memory between VMs, a share-per-page ratio is calculated and idle pages are charged extra with this idle page cost. This results in a lower share-per-page ratio for VMs with lots of idle memory and the available memory is more fairly distributed amongst all VMs. Best practice is to avoid VMs hoarding idle memory by allocating only the memory that the VM really needs, before thinking about changing the default idle memory tax value.

You might also like