memstats – Page 2

How do you test how your applications behave under contention? Wrong answers only. I go first (well, technically second).

One of our customers wanted to test and in-house developed app “while host CPU is high” and apparently tried to do this directly on the ESXi shell using openssl speed -multi <number_of_cores> but that failed after a couple of seconds. On my system, I could work around the error by specifying another cypher (des) for the integrated benchmark but that doesn’t solve the limited runtime and as a result, a short drop in utilization between iterations when running this in a loop.

On Linux, if you are in a pinch and don’t have e.g. stress-ng installed, a common way just to keep CPUs busy is to compress /dev/urandom directly into /dev/null with the desired number of threads. Let’s try that on ESXi and check esxtop (-u) in a second session:

[root@esxi:~] dd count=1000000 if=/dev/urandom | bzip2 -9 > /dev/null

[root@esxi:~] esxtop -u
        ID        GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT (..)
    224769     224769 bzip2.2133778       1  131.76   88.15    0.00   11.62 (..)
    224760     224760 dd.2133777          1   13.69   15.78    0.00   83.99 (..)

That seems to work well enough, now let’s run one instance for every core, which we’ll get from the output of sched-stats -t ncpus:

[root@esxi:~] sched-stats -t ncpus
64 PCPUs
32 cores
 2 LLCs
 2 packages
 2 NUMA nodes
 
[root@esxi:~] sched-stats -t ncpus | sed -n 's/^ *\([0-9]\+\) cores$/\1/p'
32

Let’s (roughly) walk through that sed:

-n basically means “don’t print what isn’t explicitly meant to be printed” but if you care, it is better explained e.g. here.
the beginning of the line (^) should either be none or many whitespaces ( *)
- e.g. in case of double digit (10-18) PCPUs and enabled HT / half the cores, since the numerical vales are right aligned
followed by one or many (escaped +) instances in the defined “character class” ([0-9])
- (escaped) parentheses enclosing the substring to be printed later
- more on escaping
followed by “ cores” at the end of the line ($)
p prints the \1st remembered part of the match

I’ll probably have to put a disclaimer somewhere that I’m, as you’ve probably already guessed, by no means a wizard when it comes to regular expressions, sed, awk or anything else shell related. I memorized some common patterns and notations and know enough about the nomenclature to successfully find the rest on e.g. Stack Exchange or sometimes, laboriously combine close but not exact results via trail and error. It’s not magic, it’s basically the equivalent of remembering how to bottom deal and making the rest up as you go along and smile confidently.

Putting this into a script could look something like this:

count=1000000
cores=$(sched-stats -t ncpus | sed -n 's/^\([0-9]\+\) cores$/\1/p')

for i in $(seq 1 ${cores})
do
        dd count=${count} if=/dev/urandom | bzip2 -9 > /dev/null &
done

wait

seq will print a sequence from 1 to the number of cores, at the default increment of 1
- we don’t use $i in the loop, we just want it to run once per available core
& will cause each to run in the background so they’ll run (approximately) in parallel
wait makes sure the script doesn’t exit immediately after running the loop to completion
- (and kill the still running background task too early)

That however, doesn’t quite get us the result we’ve hoped for …

[root@esxi:~] esxtop -u
        ID        GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT (...)
    226227     226227 bzip2.2134035       1   26.31   20.16    0.00   79.67 (...)
    226425     226425 bzip2.2134057       1   25.76   21.20    0.00   78.42 (...)
    226191     226191 bzip2.2134031       1   25.73   20.02    0.00   79.53 (...)
(...)
    226659     226659 bzip2.2134083       1   25.15   19.87    0.00   80.01 (...)
    226335     226335 bzip2.2134047       1   25.11   19.49    0.00   80.37 (...)
    226209     226209 bzip2.2134033       1   25.09   19.51    0.00   80.44 (...)
    226740     226740 esxtop.2134092      1    9.30    7.14    0.00   93.05 (...)
    226452     226452 dd.2134060          1    4.62    3.55    0.00   96.12 (...)
    226416     226416 dd.2134056          1    4.60    3.54    0.00   96.13 (...)
    226362     226362 dd.2134050          1    4.59    3.62    0.00   96.07 (...)
(...)
    226488     226488 dd.2134064          1    4.12    3.28    0.00   96.36 (...)
    226524     226524 dd.2134068          1    4.10    3.30    0.00   96.33 (...)
    226236     226236 dd.2134036          1    3.94    3.29    0.00   96.43 (...)

/dev/urandom shouldn’t block, so what are the instances of bzip2 waiting on? Can dd not provide enough data fast enough, e.g. is there contention around a common resource when running multiple instances, maybe just in combination with bzip2? Using gzip was even worse so maybe it is just the method of compression?

While that might be interesting to get to the bottom of, I have a feeling that would just be troubleshooting busybox, which is providing most of the common *NIX utilities on ESXi.

[root@esxi:~] ls -l /bin | egrep "(bzip2|dd)"
lrwxrwxrwx    1 root     root            35 Jan 12  2022 bzip2 -> /usr/lib/vmware/busybox/bin/busybox
lrwxrwxrwx    1 root     root            35 Jan 12  2022 dd -> /usr/lib/vmware/busybox/bin/busybox

[root@esxi:~] find /bin -type l -exec ls -l {} \; | grep busybox | wc -l
102

So can we create load with any of the non-busybox executables?

[root@esxi:~] find /bin \( -type l -or -type f \) -exec ls -l {} \; | grep -v busybox | wc -l
226

Luckily for us, ESXi ships with xz. Just replacing bzip2 in the script isn’t enough though, because while on my system it will spawn 7 instances, the others will fail:

[root@esxi:~] sh /tmp/stress.sh
xz: (stdin): Cannot allocate memory
xz: (stdin): Cannot allocate memory
xz: (stdin): Cannot allocate memory
xz: (stdin): Cannot allocate memory
xz: (stdin): Cannot allocate memory
xz: (stdin): Cannot allocate memory
(...)

Maybe that was somewhat predictable given that all of this is running under the ssh resource pool:

[root@esxi:~] tail /var/log/vmkernel.log
(...)
(...)Admission failure in path: host/vim/vimuser/terminal/ssh:xz.2269327:uw.2269327
(...)UserWorld 'xz' with cmdline 'xz'
(...)uw.2269327 (1069758) extraMin/extraFromParent: 16417/16417, ssh (601) childEmin/eMinLimit: 197234/204800

To know whether the memory requirements of xz are a deal breaker, we have to check what the current parent resource pool has to offer, for that we need the Group ID (GID):

[root@esxi:~] vsish -e set /sched/groupPathNameToID host/vim/vimuser/terminal/ssh
601

Now I’m just realizing that it gets logged too but I’m leaving the above in since it is handy to know. Talking about the resource pool structure of ESXi is a topic in of itself, at some point I’ll get around to “translating” a couple of older slides or at least to re-record them.

Looking at the capacity stats before and after calling the stress script, we see that there are ~70 MB left unreserved, so the per world requirement of xz is larger than that.

[root@esxi:~] vsish -e get /sched/groups/601/stats/capacity | grep mem-unreserved
   mem-unreserved:801712 KB

[root@esxi:~] sh /tmp/stress.sh
xz: (stdin): Cannot allocate memory
xz: (stdin): Cannot allocate memory
xz: (stdin): Cannot allocate memory
(...)

[root@esxi:~] vsish -e get /sched/groups/601/stats/capacity | grep mem-unreserved
   mem-unreserved:70144 KB

Another way to look at that in detail is via memstats:

[root@esxi:~] memstats -r group-stats -T -g601 -l2 \
> -s gid:name:min:max:conResv:availResv -u mb |
> sed -n '$d;/^-\+/,/.*\n/{//!p}'
---------------------------------------------------------------------------------
     gid name                                min        max    conResv  availResv
---------------------------------------------------------------------------------
     601 ssh                                   0        800        732         69
 1022154   sshd.2263398                        0         -1          4         69
 1022199   sh.2263403                          0         -1          2         69
 1028886   sshd.2264414                        0         -1          3         69
 1028931   sh.2264419                          0         -1          2         69
 1041729   sshd.2266002                        0         -1          3         69
 1041774   sh.2266008                          0         -1          2         69
 1077036   dd.2270364                          0         -1          2         69
 1077045   xz.2270365                          0         -1        101         69
 1077054   dd.2270366                          0         -1          2         69
 1077063   xz.2270367                          0         -1        101         69
 1077072   dd.2270368                          0         -1          2         69
 1077081   xz.2270369                          0         -1        102         69
 1077090   dd.2270370                          0         -1          2         69
 1077099   xz.2270371                          0         -1        102         69
 1077108   dd.2270372                          0         -1          2         69
 1077117   xz.2270373                          0         -1        101         69
 1077126   dd.2270374                          0         -1          2         69
 1077135   xz.2270375                          0         -1        101         69
 1077144   dd.2270376                          0         -1          2         69
 1077153   xz.2270377                          0         -1        101         69
 1077603   memstats.2270441                    0         -1          3         69
 1077612   sed.2270442                         0         -1          2         69

We won’t cover the tool / columns in detail this time around but we can see that each instance of both xz and dd requires about 103 MB which aren’t available, only 69_nice are since the group is limited to 800 MB. I’m also not going to explain the sed, not only because I forgot from where I copy and pasted it together but also because it behaves different on ESXi compared to Linux (the first dashed line shouldn’t be printed) and I feel like I would want to explain that? I mean instead of just saying: “busybox utilities are only based on, not copies of the original ones and can differ in their functionality, tough luck”.

There are multiple options from here on but my preference would be to reduce the memory consumption if at all possible, so off we go to reading man pages. There we find the following two passages:

The memory usage of xz varies from a few hundred kilobytes to several gigabytes depending on the compression settings.
(…)
Select a compression preset level. The default is -6.

If we re-run the script with xz -1, it no longer fails, the memory reservation per instance (32) is down to ~19 MB, quite a few “hundred of kilobytes” but still a massive improvement.

[root@esxi:~] memstats -r group-stats -T -g601 -l2 \
> -s gid:name:min:max:conResv:availResv -u mb |
> sed -n '$d;/^-\+/,/.*\n/{//!p}'
---------------------------------------------------------------------------------
     gid name                                min        max    conResv  availResv
---------------------------------------------------------------------------------
     601 ssh                                   0        800        608        193
 1022154   sshd.2263398                        0         -1          4        193
 1022199   sh.2263403                          0         -1          2        193
 1028886   sshd.2264414                        0         -1          3        193
 1028931   sh.2264419                          0         -1          2        193
 1041729   sshd.2266002                        0         -1          3        193
 1041774   sh.2266008                          0         -1          2        193
 1078062   sh.2270548                          0         -1          2        193
 1078107   dd.2270553                          0         -1          2        193
 1078116   xz.2270554                          0         -1         17        193
 1078125   dd.2270555                          0         -1          2        193
 (...)
 1078665   dd.2270615                          0         -1          2        193
 1078674   xz.2270616                          0         -1         17        193
 1078692   memstats.2270619                    0         -1          3        193
 1078701   sed.2270620                         0         -1          2        193

Thinking back, I realize that I’m running bzip2 with -9 because I thought I’d make it “work harder” with more aggressive compression. If I really just need to retire cycles though, as long as /dev/urandom never blocks (i.e. runs out of randomness) and dd can push it, it should always keep the compression going. Re-testing the initial version of the script with bzip2 -1 didn’t bring any improvement, the single world (thread) test also never really goes above ~85% CPU utilization (%RUN) either. xz -1 beats that, a single world caps a PCPU on its own, but not not at scale (32 in my case) which seems to reduce the average to ~80%.

[root@esxi:~] esxtop -u
        ID        GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT (...)
   1081473    1081473 xz.2271014          1  101.99   80.59    0.00   18.81 (...)
   1081437    1081437 xz.2271010          1  101.93   79.49    0.00   20.77 (...)
   1081203    1081203 xz.2270984          1  101.67   80.52    0.00   18.98 (...)
   1081221    1081221 xz.2270986          1  101.41   79.15    0.00   20.93 (...)
   1081257    1081257 xz.2270990          1  101.40   79.66    0.00   21.43 (...)
   1080969    1080969 xz.2270958          1  101.29   79.33    0.00   20.81 (...)
(...)

Since we already had the man page open, we might be able to reduce the number of parallel dd executions which could be the limiting factor if they are competing for shared and limited busybox resources. There is an option to define the number of worker threads (-T), if that works we’d only need one instance of dd and assuming our hypothesis has some merit …

[root@esxi:~] esxtop -u
        ID        GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT (...)
   1081941    1081941 xz.2271204         33 3719.24 3076.77    0.00  172.66 (...)
   1081932    1081932 dd.2271203          1  102.49   89.03    0.00    9.40 (...)

Group to expand/rollup (gid): 1081941
        ID        GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT (...)
   2271204    1081941 xz                  1   35.26   34.17    0.00   64.22 (...)
   2271205    1081941 xz                  1  114.46   97.80    0.00    0.94 (...)
   2271206    1081941 xz                  1  117.73   94.61    0.00    4.11 (...)
(...)                                                                       (...)
   2271235    1081941 xz                  1   91.14   77.50    0.00   21.25 (...)
   2271236    1081941 xz                  1  118.94   98.20    0.00    0.53 (...)
   1081932    1081932 dd.2271203          1  104.66   95.05    0.00    3.69 (...)
(...)

That looks a lot better! Given that we are running multi-threaded, we have to expand the xz group to see the utilization of the individual worlds, which is ~95% on average. The next bottleneck for further scale is probably dd given that it is nearly capped. Maybe we have to look at increasing the compression level and find some more memory for xz … this would also be necessary for xz -1 if you have more than ~40 cores.

There are many ways to make ESXi cough up the bytes for xz, we could increase the limit of the ssh resource pool, attach xz to an existing, larger one or create a new pool all together. While all of this is sowasvonunsupported^TM, there is still a gradient of “hacky-ness” and likelihood of affecting the host in a negative manner. I’d argue that you shouldn’t touch anything below the host/system or host/vim resource pools but given that you have read so far, you’re probably also not too concerned about doing things by the book. I’d say the compromise is pointing it at host/user for now.

So does increasing the compression preset give use more core- / utilization? Let’s test it and monitor the resulting memory consumption with the following couple of lines:

for i in $(seq 1 9)
do 
	dd count=20000000 if=/dev/urandom 2>&1 | xz ++group=host/user -T 32 -${i} > /dev/null &
	sleep 120
	memstats -r group-stats -T -g4 -l2 -s name:conResv -u mb | grep xz
	wait 
done

I timed the runtime for xz -1 and selected the count value so I could capture 9 iterations within the 60 minute real time duration
- assuming that higher presets will only take slightly longer
++group=host/user is what forces the cartel to run in a specific pool
- note that you can’t just point this anywhere and we’ll skip creating your own for now
the sleep before calling memstats was to allow for any potential ramp up time
- 120 seemed like the maximum necessary after testing at xz -9

[root@esxi:~] dd count=20000000 if=/dev/urandom 2>&1 |
> xz ++group=host/user -T 32 -9 > /dev/null &
[root@esxi:~] while true;
> do memstats -r group-stats -T -g4 -l2 -s name:conResv -u mb | grep xz;
> sleep 10;
> done
   86224   xz.2111427                      14056
   86224   xz.2111427                      16658
   86224   xz.2111427                      19260
   86224   xz.2111427                      21862
   86224   xz.2111427                      24464
   86224   xz.2111427                      27066
   86224   xz.2111427                      29668
   86224   xz.2111427                      32270
   86224   xz.2111427                      34005
   86224   xz.2111427                      36607
   86224   xz.2111427                      39209
   86224   xz.2111427                      40076
   86224   xz.2111427                      40076
   86224   xz.2111427                      40076
(...)

CTRL-C

Anyhow, the preset trial loop will print:

   29092   xz.2101827                        590
   29218   xz.2101880                       1120
   29335   xz.2101933                       2178
   29398   xz.2101980                       2691
   29497   xz.2102031                       5320
   30982   xz.2102238                       5320
   31090   xz.2102290                      10578
   31216   xz.2102347                      21095
   31324   xz.2102402                      40076

So yeah, approximately double the memory consumption per level, and is it worth it for our use case?

quick and dirty esxi cpu stress generator - xz utilization at all compression presets — xz utilization at all compression presets

Arguably not. There might be a tiny increase in utilization but not 40 GB worth, maybe not even 1.5 GB but I’m feeling that xz -4 is a sweet spot in marginally higher utilization yet still civilized memory requirements. The behavior of Core Utilization in the 6^th iterations is odd though, given that Utilization doesn’t change, it would indicate that some of the xz worlds are sharing some cores for a significant amount of time and only slowly ramping up to be scheduled by themselves. Interesting … but not interesting enough for user world behavior while misappropriating ESXi for … well, this :-). If you want to know more about CPU Usage and Utilization, you might want to watch the VMworld 2021 Performance Best Practices from 15 minutes onwards.

TL;DR

All of the above combined leaves us with a simple one-liner:

dd count=100000000 if=/dev/urandom 2>&1 | xz ++group=host/user -T <number_of_worlds> -4 > /dev/null

Now don’t ever use it

Why? Because it isn’t the same as contending with other VMs for resources. If you just want to test how your application behaves under contention, give it a CPU and / or memory limit. That isn’t dynamic of course and the impact is different so if you want to stress the host’s CPU, caches, memory etc. properly, just use a live Linux in a VM and run stress-ng, please.

90%	performance
10%	sarcasm
0%	supported