Few Memory Performance Metrics does not always indicate Performance Problem

This post is the result of an internal discussion. We were discussing when memory performance degrades what to monitor to understand it is under pressure.

Lot of time we assume that some popular metric would be better to monitor memory performance. However, many a times, it leads to something else. That means those are not an indicative to a memory performance issue. If it gets combined with something else then it “may” indicate performance degradation, but not always.

In this aspect we should not use these two metrics, just to understand whether memory is under pressure or not.

  • Mem.consumed
  • Mem.vmmemctl

Let me show you what they essentially indicate.

Mem.consumed is the amount of memory consumed by one or all virtual machines calculated as memory granted less memory saved by sharing. Now the question is why we should not use this. The reason is memory allocation will vary dynamically based on the VM’s entitlement. The most important is that a VM’s entitlement should be greater than its demand.
Similarly Mem.vmmemctl is the amount of ballooned memory. This does not indicate a performance problem as well. The guest operating system determines whether it needs to page out guest physical memory to satisfy the balloon driver’s allocation requests. It is not a problem when it deflates the balloon and release the unused or untouched memory pages. However when it gets combined with Host Swapping then it is a strong indicator that VMs/Hosts are under pressure. Before than that vmmemctl will start paging out the active guest memory pages as well, which in turn is an indicator to the performance degradation.

While Mem.consumed does not have a relation to the consolidation ratio, Mem.vmmemctl has a footprint on it. We rely on Ballooning for memory over commitment and it is not that bad.

About Prasenjit Sarkar

Prasenjit Sarkar is a Product Manager at Oracle for their Public Cloud with primary focus on Cloud Strategy, Oracle Openstack, PaaS, Cloud Native Applications and API Platform. His primary focus is driving Oracle’s Cloud Computing business with commercial and public sector customers; helping to shape and deliver on a strategy to build broad use of Oracle’s Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings such as Compute, Storage, Java as a Service, and Database as a Service. He is also responsible for developing public/private cloud integration strategies, customer’s Cloud Computing architecture vision, future state architectures, and implementable architecture roadmaps in the context of the public, private, and hybrid cloud computing solutions Oracle can offer.

5 thoughts on “Few Memory Performance Metrics does not always indicate Performance Problem

  1. I would say they are important but not likely be the cause of overall perf issue in totality as to my experience, 99.99% of the VM perf issues are tied up with incorrect reservation/limit of memory/cpu at the VM level which is part of the Resource Pool in the cluster by the user.

  2. Here’s how I understand the metrics around swapping.

    Host Swap-Out Rate (KB/s): rate at which stuff is being swapped out to disk, written into .vswp files
    Host Swap-In Rate (KB/s): rate at which stuff is being swapped in from .vswp files back into memory
    Host Swapped (KB): amount of data that is in .vswp file(s) at the moment (sitting there, not being transferred)

    Stuff gets swapped out when the host has less pRAM that’s free (not already allocated) than it prefers. The host proactively frees more memory to be safe. Once data is written to .vswp, it is left there indefinitely–it is only “paged-in on-demand” i.e. when the guest OS (or app in it) requests that data. In other words, data doesn’t get proactively paged-in once there’s more free pRAM.

    Swap-Out Rate > 0 means that the host is currently under memory pressure i.e. some pRAM must be freed up. It does NOT mean that there is a performance problem happening however. Guest OSes are not slowed down by this.

    Swap-In Rate > 0 means that there almost certainly is a performance problem happening now. This happens only on-demand, so whatever data is being read from disk was requested by something in the guest OS. Presumably that app will hate that the data’s coming from slow disk instead of fast RAM (which the app was expecting–the fact that it’s on disk in .vswp is hidden from the GOS/app’s view). This is one of the most solid indicators that someone’s gonna be unhappy with their performance.

    Swapped > 0 means that sometime in the past the host was under memory pressure (when Swap-Out Rate was > 0). It doesn’t say anything at all about whether there’s memory pressure now nor whether any apps are running poorly now. (I’ve seen a cluster with 17GB of data in .vswp files and the VMs were running fine.) Of course, the more stuff is in .vswp files, the greater the chance that some of it will be wanted in the future, and then there would be a performance problem. But you can’t really tell how much of the stuff in .vswp is junk data vs. stuff that will be wanted in the future. So, this is a mighty ambiguous metric.

  3. Craig, I totally agree with you. However Swapped out memory only impact on a VM’s performance when it’s again swap back in. So we cannot even look at the Swapped value to understand if there is a VM Memory performance impact.

    I would say somewhat %Latency can give us a better indicator. So this will tell us how much time this VM is idle as there is Swap in happening at the backend.

    • I think we agree on all counts. Swapped does not tell you whether a performance problem is happening now. I think Swapped doesn’t even really tell you how likely a future performance problem is: it’s too possible that all the swapped stuff is junk data that will never again be requested. There’s too much uncertainty for it to be a useful predictive metric in general. (It could be useful if you find empirically that in your environment it does tend to correlate with eventual performance problems. But until you have that history, I would not set up any alarms on it.)

      And yes: %Latency is an even better indicator of current performance problems. It’s more direct than swap-in rate because it measures time, not just KB/s. Time spent in the swapping-in state (or decompressing state, which %latency also includes) should correlate even better with people being unhappy about the speed of their app.

      IIRC %latency is new with 5.0 (or maybe even 5.1?), but the others go back to at least esx 3.5. It can be a challenge in real environments to pick metrics for alerting and troubleshooting runbooks when certain metrics are not available for all vSphere versions or all types of hardware (storage particularly).

      Also, to emphasize a point in the original post: ballooning (aka memctl) is not a sign of performance problems. In fact it’s often preventing performance problems!

  4. You are absolutely right Craig. And that is the reason I said in this post this is not an indicative to a performance problem. In fact the title itself says that “Few Memory Performance Metrics does not always indicate Performance Problem”.