How do you test how your applications behave under contention? Wrong answers only. I go first (well, technically second).
One of our customers wanted to test and in-house developed app “while host CPU is high” and apparently tried to do this directly on the ESXi shell using openssl speed -multi <number_of_cores>
but that failed after a couple of seconds. On my system, I could work around the error by specifying another cypher (des
) for the integrated benchmark but that doesn’t solve the limited runtime and as a result, a short drop in utilization between iterations when running this in a loop.
On Linux, if you are in a pinch and don’t have e.g. stress-ng
installed, a common way just to keep CPUs busy is to compress /dev/urandom
directly into /dev/null
with the desired number of threads. Let’s try that on ESXi and check esxtop
(-u
) in a second session:
[root@esxi:~] dd count=1000000 if=/dev/urandom | bzip2 -9 > /dev/null
[root@esxi:~] esxtop -u ID GID NAME NWLD %USED %RUN %SYS %WAIT (..) 224769 224769 bzip2.2133778 1 131.76 88.15 0.00 11.62 (..) 224760 224760 dd.2133777 1 13.69 15.78 0.00 83.99 (..)
That seems to work well enough, now let’s run one instance for every core, which we’ll get from the output of sched-stats -t ncpus
:
[root@esxi:~] sched-stats -t ncpus 64 PCPUs 32 cores 2 LLCs 2 packages 2 NUMA nodes [root@esxi:~] sched-stats -t ncpus | sed -n 's/^ *\([0-9]\+\) cores$/\1/p' 32
Let’s (roughly) walk through that sed
:
-n
basically means “don’t print what isn’t explicitly meant to be printed” but if you care, it is better explained e.g. here.- the beginning of the line (
^
) should either be none or many whitespaces (*
)- e.g. in case of double digit (10-18) PCPUs and enabled HT / half the cores, since the numerical vales are right aligned
- followed by one or many (escaped
+
) instances in the defined “character class” ([0-9]
)- (escaped) parentheses enclosing the substring to be printed later
- more on escaping
- followed by “
cores
” at the end of the line ($
) p
prints the\1
st remembered part of the match
I’ll probably have to put a disclaimer somewhere that I’m, as you’ve probably already guessed, by no means a wizard when it comes to regular expressions, sed
, awk
or anything else shell related. I memorized some common patterns and notations and know enough about the nomenclature to successfully find the rest on e.g. Stack Exchange or sometimes, laboriously combine close but not exact results via trail and error. It’s not magic, it’s basically the equivalent of remembering how to bottom deal and making the rest up as you go along and smile confidently.
Putting this into a script could look something like this:
count=1000000 cores=$(sched-stats -t ncpus | sed -n 's/^\([0-9]\+\) cores$/\1/p') for i in $(seq 1 ${cores}) do dd count=${count} if=/dev/urandom | bzip2 -9 > /dev/null & done wait
seq
will print a sequence from 1 to the number of cores, at the default increment of 1- we don’t use
$i
in the loop, we just want it to run once per available core
- we don’t use
&
will cause each to run in the background so they’ll run (approximately) in parallelwait
makes sure the script doesn’t exit immediately after running the loop to completion- (and kill the still running background task too early)
That however, doesn’t quite get us the result we’ve hoped for …
[root@esxi:~] esxtop -u ID GID NAME NWLD %USED %RUN %SYS %WAIT (...) 226227 226227 bzip2.2134035 1 26.31 20.16 0.00 79.67 (...) 226425 226425 bzip2.2134057 1 25.76 21.20 0.00 78.42 (...) 226191 226191 bzip2.2134031 1 25.73 20.02 0.00 79.53 (...) (...) 226659 226659 bzip2.2134083 1 25.15 19.87 0.00 80.01 (...) 226335 226335 bzip2.2134047 1 25.11 19.49 0.00 80.37 (...) 226209 226209 bzip2.2134033 1 25.09 19.51 0.00 80.44 (...) 226740 226740 esxtop.2134092 1 9.30 7.14 0.00 93.05 (...) 226452 226452 dd.2134060 1 4.62 3.55 0.00 96.12 (...) 226416 226416 dd.2134056 1 4.60 3.54 0.00 96.13 (...) 226362 226362 dd.2134050 1 4.59 3.62 0.00 96.07 (...) (...) 226488 226488 dd.2134064 1 4.12 3.28 0.00 96.36 (...) 226524 226524 dd.2134068 1 4.10 3.30 0.00 96.33 (...) 226236 226236 dd.2134036 1 3.94 3.29 0.00 96.43 (...)
/dev/urandom
shouldn’t block, so what are the instances of bzip2
waiting on? Can dd
not provide enough data fast enough, e.g. is there contention around a common resource when running multiple instances, maybe just in combination with bzip2
? Using gzip
was even worse so maybe it is just the method of compression?
While that might be interesting to get to the bottom of, I have a feeling that would just be troubleshooting busybox
, which is providing most of the common *NIX utilities on ESXi.
[root@esxi:~] ls -l /bin | egrep "(bzip2|dd)" lrwxrwxrwx 1 root root 35 Jan 12 2022 bzip2 -> /usr/lib/vmware/busybox/bin/busybox lrwxrwxrwx 1 root root 35 Jan 12 2022 dd -> /usr/lib/vmware/busybox/bin/busybox
[root@esxi:~] find /bin -type l -exec ls -l {} \; | grep busybox | wc -l 102
So can we create load with any of the non-busybox
executables?
[root@esxi:~] find /bin \( -type l -or -type f \) -exec ls -l {} \; | grep -v busybox | wc -l 226
Luckily for us, ESXi ships with xz
. Just replacing bzip2
in the script isn’t enough though, because while on my system it will spawn 7 instances, the others will fail:
[root@esxi:~] sh /tmp/stress.sh xz: (stdin): Cannot allocate memory xz: (stdin): Cannot allocate memory xz: (stdin): Cannot allocate memory xz: (stdin): Cannot allocate memory xz: (stdin): Cannot allocate memory xz: (stdin): Cannot allocate memory (...)
Maybe that was somewhat predictable given that all of this is running under the ssh
resource pool:
[root@esxi:~] tail /var/log/vmkernel.log (...) (...)Admission failure in path: host/vim/vimuser/terminal/ssh:xz.2269327:uw.2269327 (...)UserWorld 'xz' with cmdline 'xz' (...)uw.2269327 (1069758) extraMin/extraFromParent: 16417/16417, ssh (601) childEmin/eMinLimit: 197234/204800
To know whether the memory requirements of xz
are a deal breaker, we have to check what the current parent resource pool has to offer, for that we need the Group ID (GID
):
[root@esxi:~] vsish -e set /sched/groupPathNameToID host/vim/vimuser/terminal/ssh 601
Now I’m just realizing that it gets logged too but I’m leaving the above in since it is handy to know. Talking about the resource pool structure of ESXi is a topic in of itself, at some point I’ll get around to “translating” a couple of older slides or at least to re-record them.
Looking at the capacity stats before and after calling the stress script, we see that there are ~70 MB left unreserved, so the per world requirement of xz
is larger than that.
[root@esxi:~] vsish -e get /sched/groups/601/stats/capacity | grep mem-unreserved mem-unreserved:801712 KB [root@esxi:~] sh /tmp/stress.sh xz: (stdin): Cannot allocate memory xz: (stdin): Cannot allocate memory xz: (stdin): Cannot allocate memory (...) [root@esxi:~] vsish -e get /sched/groups/601/stats/capacity | grep mem-unreserved mem-unreserved:70144 KB
Another way to look at that in detail is via memstats
:
[root@esxi:~] memstats -r group-stats -T -g601 -l2 \ > -s gid:name:min:max:conResv:availResv -u mb | > sed -n '$d;/^-\+/,/.*\n/{//!p}' --------------------------------------------------------------------------------- gid name min max conResv availResv --------------------------------------------------------------------------------- 601 ssh 0 800 732 69 1022154 sshd.2263398 0 -1 4 69 1022199 sh.2263403 0 -1 2 69 1028886 sshd.2264414 0 -1 3 69 1028931 sh.2264419 0 -1 2 69 1041729 sshd.2266002 0 -1 3 69 1041774 sh.2266008 0 -1 2 69 1077036 dd.2270364 0 -1 2 69 1077045 xz.2270365 0 -1 101 69 1077054 dd.2270366 0 -1 2 69 1077063 xz.2270367 0 -1 101 69 1077072 dd.2270368 0 -1 2 69 1077081 xz.2270369 0 -1 102 69 1077090 dd.2270370 0 -1 2 69 1077099 xz.2270371 0 -1 102 69 1077108 dd.2270372 0 -1 2 69 1077117 xz.2270373 0 -1 101 69 1077126 dd.2270374 0 -1 2 69 1077135 xz.2270375 0 -1 101 69 1077144 dd.2270376 0 -1 2 69 1077153 xz.2270377 0 -1 101 69 1077603 memstats.2270441 0 -1 3 69 1077612 sed.2270442 0 -1 2 69
We won’t cover the tool / columns in detail this time around but we can see that each instance of both xz
and dd
requires about 103 MB which aren’t available, only 69nice are since the group is limited to 800 MB. I’m also not going to explain the sed
, not only because I forgot from where I copy and pasted it together but also because it behaves different on ESXi compared to Linux (the first dashed line shouldn’t be printed) and I feel like I would want to explain that? I mean instead of just saying: “busybox
utilities are only based on, not copies of the original ones and can differ in their functionality, tough luck”.
There are multiple options from here on but my preference would be to reduce the memory consumption if at all possible, so off we go to reading man pages. There we find the following two passages:
The memory usage of xz varies from a few hundred kilobytes to several gigabytes depending on the compression settings.
(…)
Select a compression preset level. The default is -6.
If we re-run the script with xz -1
, it no longer fails, the memory reservation per instance (32) is down to ~19 MB, quite a few “hundred of kilobytes” but still a massive improvement.
[root@esxi:~] memstats -r group-stats -T -g601 -l2 \ > -s gid:name:min:max:conResv:availResv -u mb | > sed -n '$d;/^-\+/,/.*\n/{//!p}' --------------------------------------------------------------------------------- gid name min max conResv availResv --------------------------------------------------------------------------------- 601 ssh 0 800 608 193 1022154 sshd.2263398 0 -1 4 193 1022199 sh.2263403 0 -1 2 193 1028886 sshd.2264414 0 -1 3 193 1028931 sh.2264419 0 -1 2 193 1041729 sshd.2266002 0 -1 3 193 1041774 sh.2266008 0 -1 2 193 1078062 sh.2270548 0 -1 2 193 1078107 dd.2270553 0 -1 2 193 1078116 xz.2270554 0 -1 17 193 1078125 dd.2270555 0 -1 2 193 (...) 1078665 dd.2270615 0 -1 2 193 1078674 xz.2270616 0 -1 17 193 1078692 memstats.2270619 0 -1 3 193 1078701 sed.2270620 0 -1 2 193
Thinking back, I realize that I’m running bzip2
with -9
because I thought I’d make it “work harder” with more aggressive compression. If I really just need to retire cycles though, as long as /dev/urandom
never blocks (i.e. runs out of randomness) and dd
can push it, it should always keep the compression going. Re-testing the initial version of the script with bzip2 -1
didn’t bring any improvement, the single world (thread) test also never really goes above ~85% CPU utilization (%RUN
) either. xz -1
beats that, a single world caps a PCPU on its own, but not not at scale (32 in my case) which seems to reduce the average to ~80%.
[root@esxi:~] esxtop -u ID GID NAME NWLD %USED %RUN %SYS %WAIT (...) 1081473 1081473 xz.2271014 1 101.99 80.59 0.00 18.81 (...) 1081437 1081437 xz.2271010 1 101.93 79.49 0.00 20.77 (...) 1081203 1081203 xz.2270984 1 101.67 80.52 0.00 18.98 (...) 1081221 1081221 xz.2270986 1 101.41 79.15 0.00 20.93 (...) 1081257 1081257 xz.2270990 1 101.40 79.66 0.00 21.43 (...) 1080969 1080969 xz.2270958 1 101.29 79.33 0.00 20.81 (...) (...)
Since we already had the man page open, we might be able to reduce the number of parallel dd
executions which could be the limiting factor if they are competing for shared and limited busybox
resources. There is an option to define the number of worker threads (-T
), if that works we’d only need one instance of dd
and assuming our hypothesis has some merit …
[root@esxi:~] esxtop -u ID GID NAME NWLD %USED %RUN %SYS %WAIT (...) 1081941 1081941 xz.2271204 33 3719.24 3076.77 0.00 172.66 (...) 1081932 1081932 dd.2271203 1 102.49 89.03 0.00 9.40 (...) Group to expand/rollup (gid): 1081941 ID GID NAME NWLD %USED %RUN %SYS %WAIT (...) 2271204 1081941 xz 1 35.26 34.17 0.00 64.22 (...) 2271205 1081941 xz 1 114.46 97.80 0.00 0.94 (...) 2271206 1081941 xz 1 117.73 94.61 0.00 4.11 (...) (...) (...) 2271235 1081941 xz 1 91.14 77.50 0.00 21.25 (...) 2271236 1081941 xz 1 118.94 98.20 0.00 0.53 (...) 1081932 1081932 dd.2271203 1 104.66 95.05 0.00 3.69 (...) (...)
That looks a lot better! Given that we are running multi-threaded, we have to expand the xz
group to see the utilization of the individual worlds, which is ~95% on average. The next bottleneck for further scale is probably dd
given that it is nearly capped. Maybe we have to look at increasing the compression level and find some more memory for xz
… this would also be necessary for xz -1
if you have more than ~40 cores.
There are many ways to make ESXi cough up the bytes for xz
, we could increase the limit of the ssh
resource pool, attach xz
to an existing, larger one or create a new pool all together. While all of this is sowasvonunsupportedTM, there is still a gradient of “hacky-ness” and likelihood of affecting the host in a negative manner. I’d argue that you shouldn’t touch anything below the host/system
or host/vim
resource pools but given that you have read so far, you’re probably also not too concerned about doing things by the book. I’d say the compromise is pointing it at host/user
for now.
So does increasing the compression preset give use more core- / utilization? Let’s test it and monitor the resulting memory consumption with the following couple of lines:
for i in $(seq 1 9) do dd count=20000000 if=/dev/urandom 2>&1 | xz ++group=host/user -T 32 -${i} > /dev/null & sleep 120 memstats -r group-stats -T -g4 -l2 -s name:conResv -u mb | grep xz wait done
- I timed the runtime for
xz -1
and selected thecount
value so I could capture 9 iterations within the 60 minute real time duration- assuming that higher presets will only take slightly longer
++group=host/user
is what forces the cartel to run in a specific pool- note that you can’t just point this anywhere and we’ll skip creating your own for now
- the
sleep
before callingmemstats
was to allow for any potential ramp up time- 120 seemed like the maximum necessary after testing at
xz -9
- 120 seemed like the maximum necessary after testing at
[root@esxi:~] dd count=20000000 if=/dev/urandom 2>&1 | > xz ++group=host/user -T 32 -9 > /dev/null & [root@esxi:~] while true; > do memstats -r group-stats -T -g4 -l2 -s name:conResv -u mb | grep xz; > sleep 10; > done 86224 xz.2111427 14056 86224 xz.2111427 16658 86224 xz.2111427 19260 86224 xz.2111427 21862 86224 xz.2111427 24464 86224 xz.2111427 27066 86224 xz.2111427 29668 86224 xz.2111427 32270 86224 xz.2111427 34005 86224 xz.2111427 36607 86224 xz.2111427 39209 86224 xz.2111427 40076 86224 xz.2111427 40076 86224 xz.2111427 40076 (...)
CTRL-C
Anyhow, the preset trial loop will print:
29092 xz.2101827 590 29218 xz.2101880 1120 29335 xz.2101933 2178 29398 xz.2101980 2691 29497 xz.2102031 5320 30982 xz.2102238 5320 31090 xz.2102290 10578 31216 xz.2102347 21095 31324 xz.2102402 40076
So yeah, approximately double the memory consumption per level, and is it worth it for our use case?
Arguably not. There might be a tiny increase in utilization but not 40 GB worth, maybe not even 1.5 GB but I’m feeling that xz -4
is a sweet spot in marginally higher utilization yet still civilized memory requirements. The behavior of Core Utilization
in the 6th iterations is odd though, given that Utilization
doesn’t change, it would indicate that some of the xz
worlds are sharing some cores for a significant amount of time and only slowly ramping up to be scheduled by themselves. Interesting … but not interesting enough for user world behavior while misappropriating ESXi for … well, this :-). If you want to know more about CPU Usage and Utilization, you might want to watch the VMworld 2021 Performance Best Practices from 15 minutes onwards.
TL;DR
All of the above combined leaves us with a simple one-liner:
dd count=100000000 if=/dev/urandom 2>&1 | xz ++group=host/user -T <number_of_worlds> -4 > /dev/null
Now don’t ever use it
Why? Because it isn’t the same as contending with other VMs for resources. If you just want to test how your application behaves under contention, give it a CPU and / or memory limit. That isn’t dynamic of course and the impact is different so if you want to stress the host’s CPU, caches, memory etc. properly, just use a live Linux in a VM and run stress-ng
, please.