valentin.bondz.io – performance, resource management and troubleshooting

is vmotion numa aware - clickbait cover — clickbait cover

Of course vMotion is NUMA aware.

At least for how I’d assume the question is most commonly interpreted, that is:

Assuming the source and destination host have the same NUMA topology and each node has enough available memory, does vMotion precopy the memory to the same NUMA client(s) on the destination and is the VM 100 % NUMA local when resumed, esp. for wide VMs?

I’m not even sure what an alternative interpretation could be, maybe whether the stream worlds are local to the NUMA clients during precopy? I guess that depends more on the relation scheduler and I’d think the worlds would be closer to the NIC. But worth checking … at some future, undefined point in time, maybe.

The pearl clutch worthy question in the title was asked by one of our customers, who wanted to verify documentation from a 3rd party that stated it wasn’t. While I was pretty sure that this was wrong (or at least severely outdated), testing it is fairly straightforward so I’m saving my luck for gambling on answers that are harder to verify. Let’s walk through the verification step by step, first establish the state of the source:

[root@esxi_source:~] for numaOption in $(sched-stats -h |
> sed -n 's/^[ \t]\+: \{4\}\(n.*\)$/\1/p');
> do echo;
> sched-stats -t ${numaOption};
> done

24 PCPUs
12 cores
 2 LLCs
 2 packages
 2 NUMA nodes

groupName       groupID     clientID    homeNode    affinity    nWorlds     vmmWorlds   localMem    remoteMem   currLocal   cummLocal
vm.1466269      2913542     0           0           3           10          10          20905984    0           100         100
vm.1466269      2913542     1           1           3           10          10          20967424    0           100         100

groupName       groupID     clientID    balanceMig  loadMig     localityMig longTermMig monitorMig  loadSwap    localitySwap pageMigRate
vm.1466269      2913542     0           0           0           0           0           0           0           0            0
vm.1466269      2913542     1           0           0           0           0           0           0           0            0

groupName       groupID     clientID    nodeID      time        timePct     memory      memoryPct   anonMem     anonMemPct  avgEpochs   memMigHere
vm.1466269      2913542     0           0           128131      100         20905984    100         43524       50          9           0
vm.1466269      2913542     0           1           0           0           0           0           42812       49          0           0
vm.1466269      2913542     1           0           0           0           0           0           43524       50          0           0
vm.1466269      2913542     1           1           128131      100         20967424    100         42812       49          9           0

nodeID      used        idle        entitled    owed        loadAvgPct  nVcpu       freeMem     totalMem
0           122         11878       0           0           0           10          10495200    33456872
1           149         11851       0           0           0           10          11182620    33554432

The format for the stats is corrected/hop/slit

NodeId   0              1
0        212/ 0/  0     297/ 0/  0
1        297/ 0/  0     212/ 0/  0

I like this one-liner because it gives me most of the information about a host and the current state of the VMs from a NUMA perspective. In ESXi 7.0, the sched-stats option numa-global (sum of all past NUMA migrations on the host) was dropped in favor of numa-latency, so instead of a list of options that differs for current and previous builds, I decided to match against what’s available in the sched-stats help (-h) output.

[root@esxi_source:~] sched-stats -h | sed -n 's/^[ \t]\+: \{4\}\(n.*\)$/\1/p'
ncpus
numa-clients
numa-migration
numa-cnode
numa-pnode
numa-latency

Some of this has and will be repeated over other articles but let’s dissect the sed:

-n basically means “don’t print what isn’t explicitly meant to be printed” but if you care, it is better explained e.g. here.
the beginning of the line (^) is a bunch (escaped +) of whitespaces ([ /t])
- character class of SPACEs and TABs
followed by a : followed by more whiteSPACEs( ), 4 (\{4\}) to be exact
- more on escaping
followed by anything starting with an n (n.*) until the end of the line ($)
- (escaped) parentheses enclosing the substring to be printed later
p prints the \1st remembered part of the match

Without covering too much of the actual sched-stats output, there is currently one VM running on the host (no vCLS VM since the hosts aren’t in a cluster). It is wide, i.e. it has two NUMA clients (PPDs / Physical Proximity Domains, two clientIDs for the same groupID). Both clients are 100% NUMA local (currLocal), which has nothing to do with the actual guest OS or application level locality, just that our scheduling abstraction is (a range of vCPUs and memory with regards to each other). The amount of free physical memory (freeMem) on the two nodes is approximately the same. Don’t forget to scroll to the right for all the output. While some other stuff in there is interesting too, we’ll cover that another time, let’s look at that VM:

[root@esxi_source:~] vmdumper -l | cut -d \/ -f 2-5 | while read path;
> do egrep -oi "DICT.*(displayname.*|numa.*|cores.*|vcpu.*|memsize.*|affinity.*)= .*|
> numa:.*|numaHost:.*|Log for VMware ESX.*" "/$path/vmware.log";
> echo -e;
> done
Log for VMware ESX pid=1466269 version=7.0.3 build=build-20036589 option=Release
DICT                  numvcpus = "20"
DICT                   memSize = "40960"
DICT               displayName = "test-vm"
DICT      numa.autosize.cookie = "200102"
DICT numa.autosize.vcpu.maxPerVirtualNode = "10"
DICT      cpuid.coresPerSocket = "10"
numaHost: NUMA config: consolidation= 1 preferHT= 1 partitionByMemory = 0
numa: Resuming from checkpoint using VPD = 10
numaHost: 20 VCPUs 2 VPDs 2 PPDs
numaHost: VCPU 0 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 1 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 2 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 3 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 4 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 5 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 6 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 7 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 8 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 9 VPD 0 PPD 0 NodeMask ffffffffffffffff
numaHost: VCPU 10 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 11 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 12 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 13 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 14 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 15 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 16 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 17 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 18 VPD 1 PPD 1 NodeMask ffffffffffffffff
numaHost: VCPU 19 VPD 1 PPD 1 NodeMask ffffffffffffffff

This one-liner is sometimes called “vmdumper command”, it’s a personal pet peeve since vmdumper -l really just lists information for the running (instead e.g. registered) VMs from which we can then deduce the working directory:

[root@esxi_source:~] vmdumper -l
wid=1466270     pid=-1  cfgFile="/vmfs/volumes/5cded272-3c5304fc-2308-109836041d9b/test-vm/test-vm.vmx"   uuid="56 4d d6 9a 24 57 89 38-b8 cb 3f 21 b7 db 55 82"  displayName="test-vm"       vmxCartelID=1466269

[root@esxi_source:~] vmdumper -l | cut -d \/ -f 2-5
vmfs/volumes/5cded272-3c5304fc-2308-109836041d9b/test-vm

cut on the (escaped) delimiter / and print the 2nd until the 5th field (-f)

The name of the binary implies the actual major use case, dumping of vmm / vmx memory and associated debugging tasks like sending NMIs (Non-Maskable Interrupts) to the VM, enabling IP (Instruction Pointer) logging or printing relevant information from VMs whos memory can be dumped, i.e. from those that are running …

[root@exi_source:~] vmdumper -h
vmdumper: [options]  <unsync|sync|vmx|vmx_force|samples_on|samples_off|nmi|backtrace>
         -f: ignore vsi version check
         -h: print friendly help message
         -l: print information about running VMs
         -g: log specified text to the vmkernel log

You could replace the part before the while loop with something else that lists the .vmx file or working directory of running VMs:

[root@esxi_source:~] esxcli vm process list | grep "Config" | cut -d \/ -f 2-5
vmfs/volumes/5cded272-3c5304fc-2308-109836041d9b/test-vm

The rest is really just an egrep against all running VM’s vmware.log (which is why over-zealous disabling or reducing VM logging can cause that instrumental information to not be available, see KB 8182749 whether you are doing that). egrep is the same as grep -E (extended RegExp) and for me, mostly a matter of muscle memory, -i means case insensitive, -o only prints the match, nothing before or after, instead of the whole line.

We want to match a couple of options in the DICTionary, i.e. the non-default vmx options the VM was started with. That is why a grep against the vmware.log is more accurate than against the .vmx, which might have changed since the power on (although the circumstance to that are limited, another long story). You can graph regular expressions in web apps like Regexper or Debuggex btw.

Anyhow, the key facts in the above output for this one VM are:

more vCPU than cores in a single physical NUMA node
- refer back to the sched-stats -t ncpus output
more memory than available in a single physical NUMA node
- not that is matters for autosizing
no advanced settings besides setting coresPerSocket to the VPD / PPD size
- which is a good thing in 95% of cases, long story
according to the log, preferHT is true, despite not fulfilling either of the conditions:
- numa.vcpu.preferHT = TRUE (vmx option)
- /Numa/PreferHT = 1 (host advanced option)
  - not visible invmware.log, I checked on the host
- vCPUs > cores per host
- (some other internal advanced settings that aren’t relevant)
- ~~I’m pretty sure that is a bug and I’ll update this another day~~
  - edit: if you paid attention, unlike me who thought this was on a different, bigger host, you’ll have noticed that the VM indeed has more vCPUs than the host has cores. Why leave this in before even initially publishing? To keep you on your feet, let this be a reminder that you shouldn’t believe everything you read on the internet

What did we want to do again? Ah yes, vMotion the VM to another host. Let’s check the state there too, we just care about some of the options though:

root@esxi_destination:~] for numaOption in ncpus numa-clients numa-pnode;
> do echo;
> sched-stats -t ${numaOption};
> done

24 PCPUs
12 cores
 2 LLCs
 2 packages
 2 NUMA nodes

groupName       groupID     clientID    homeNode    affinity    nWorlds     vmmWorlds   localMem    remoteMem   currLocal   cummLocal

nodeID      used        idle        entitled    owed        loadAvgPct  nVcpu       freeMem     totalMem
0           18          11982       0           0           0           0           31990120    33456872
1           6           11993       0           0           0           0           32561460    33554432

Same topology, no VM running and hence more than enough free memory on each node. Before we kick off the vMotion and check the NUMA locality, let’s make sure the VM’s memory isn’t just a bunch of zeros, or even unaccessed from the guest’s point of view, by filling it with random data. There are of course many ways and some are more precise than others, here we don’t need to worry too much so let’s look at the first Google result and use the first solution, mostly because I really like stress-ng (although the custom program, filling memory from e.g. /dev/urandom would be more “surgical”, esp. if you create two instances, one for each vNUMA node).

The link does explain the MemAvailable metric in /proc/meminfo but if you want to know details, check out the actual commit (via StackExchange). The initial question asked for 90% (of free) but I’d say given that we aren’t planning anything else, we should upgrade that to 95% (of available).

root@test-vm:~# cat /proc/meminfo
MemTotal:       41186192 kB
MemFree:        40797100 kB
MemAvailable:   40619916 kB
Buffers:               0 kB
Cached:            39608 kB
SwapCached:            0 kB
(...)

root@test-vm:~# echo $((40619916 / 1024))
39667

root@test-vm:~# free -m
              total        used        free      shared  buff/cache   available
Mem:          40220         342       39839           0          38       39666
Swap:             0           0           0

root@test-vm:~# awk '/MemAvailable/{printf "%d\n", $2 * 0.95;}' < /proc/meminfo
38585135

root@test-vm:~# echo $((38585135 / 1024))
37680

We could go higher but I want to avoid constant swapping / IO at all cost. Well, not at all cost apparently, 5% of available memory would be the precise figure here.

Before we kick it off, let’s look at the stress-ng man page for the test we are planning on running, you’ll see that if you don’t also specify --vm-method, it will cycle through all available ones i.e. what is workload is running at a given point isn’t exactly predictable. That level of determinism might not be necessary here but it could be some other time, let’s specify --vm-method rand-set out of an abundance for caution. And just to avoid any confusion, lower case vm here means virtual memory.

root@test-vm:~# stress-ng --vm-bytes \
> $(awk '/MemAvailable/{printf "%d\n", $2 * 0.95;}' < /proc/meminfo)k \
> --vm-keep --vm-method rand-set --vm 1
stress-ng: info:  [739] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [739] dispatching hogs: 1 vm

You can kind of tell from the logging that this isn’t just going to fill the memory and stop, it will continuously stress the memory, we specified 1 worker thread so it will be some time until all memory is touched / filled. And would you believe it, I actually prepared something on the ESXi host before I started stress-ng to showcase this:

[root@esxi_source:~] memstats -r vm-stats -s name:touched -u mb

 VIRTUAL MACHINE STATS: Sun Aug 14 13:50:08 2022
 -----------------------------------------------
   Start Group ID   : 0
   No. of levels    : 12
   Unit             : MB
   Selected columns : name:touched

--------------------------
           name    touched
--------------------------
     vm.1466269        410
--------------------------
          Total        410
--------------------------

[root@esxi_source:~] for i in $(seq 1000);
> do memstats -r vm-stats -s name:touched -u mb |
> awk -v date=$(date -Iseconds) '$1 ~ /vm.[0-9]+/ {print date","$2}';
> sleep 1;
> done
2022-08-14T14:01:36+0000,410
2022-08-14T14:01:37+0000,410
2022-08-14T14:01:38+0000,410
2022-08-14T14:01:39+0000,410
2022-08-14T14:01:40+0000,820
2022-08-14T14:01:41+0000,1229
2022-08-14T14:01:42+0000,1639
2022-08-14T14:01:43+0000,1639
2022-08-14T14:01:44+0000,2458
2022-08-14T14:01:45+0000,3687
2022-08-14T14:01:46+0000,3687
2022-08-14T14:01:47+0000,4506
2022-08-14T14:01:48+0000,4916
2022-08-14T14:01:49+0000,6144
2022-08-14T14:01:51+0000,6554
2022-08-14T14:01:52+0000,6554
2022-08-14T14:01:53+0000,7373
2022-08-14T14:01:54+0000,7373
2022-08-14T14:01:55+0000,7783
2022-08-14T14:01:56+0000,8192
2022-08-14T14:01:57+0000,8602
2022-08-14T14:01:58+0000,9831
2022-08-14T14:01:59+0000,10650
2022-08-14T14:02:00+0000,12288
(...)
2022-08-14T14:03:15+0000,33588
2022-08-14T14:03:16+0000,33588
(...)
2022-08-14T14:11:52+0000,36864
2022-08-14T14:11:53+0000,36864
2022-08-14T14:11:54+0000,36864
2022-08-14T14:11:55+0000,37274
2022-08-14T14:11:56+0000,37274
2022-08-14T14:11:57+0000,37274
2022-08-14T14:11:58+0000,37274

CTRL-C

-s name:touched will select the “name” (vm.cartel ID) and the one stat we are interested in
-u will change the default unit from KB
- to MB here
- gb is available too since 7.0
the output of date (ISO8601‘ish) is passed as a variable to awk via -v
we match (~ /.*/) the first column ($1) for everything staring with vm followed by … one instance of any character (.) instead of the “.” character since I forgot to escape it with a dash, followed by a bunch (+) of numbers ([0-9])
while true loops can sometimes be hard to exit, I opted to do “more than enough” iterations / with a 1 second sleep
seq will print a sequence from 1 to 1000, at the default increment of 1
- we don’t use $i in the loop, we just want it to run (a maximum of) 1000 times
I do like printing the date and values comma separated, if there is a chance I might want to graph it later, I’ll already have it as .csv

touched, the metric formerly known as active, is a sample (100 random small pages / min) based heuristic to assess how, well, active a VM is based on the % of those pages being touched (read or written) or dirtied (written) over each iteration. Touched by default includes dirtied but the later is available in esxtop as TCHD_W and the vSphere Client’s Advanced Performance Charts as Active Write.

Running it for ~12 minutes shows that pretty much all mapped memory has recently been touched, statistically speaking. This means that the VMs memory should be full of random data, actually continuing to be filled at whatever rate a single worker thread manages.

[root@esxi_source:~] esxtop -u
        ID        GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
   2913542    2913542 test-vm            36  109.16   99.54    0.01 3450.18    3.59    0.15 1869.14    0.08    0.00    0.00    0.00

When I looked at the VM’s performance charts in the UI, I did notice an interesting difference between Active and Active Write , since the latter isn’t available in memstats, I had to get a .csv export from vSphere Client. Why no screenshot? Meh, hard to get right for light and dark mode? For anyone who likes a good ASCII graph (and who doesn’t?) here the plotted, vSphere Client exported metrics:

root@foo:~# plot -y 9:0 -d 30:73 -b 0:40000000 -s ascii \
> -i <(cut -d , -f 3 /tmp/exported.csv) \
> -i <(cut -d , -f 2 /tmp/exported.csv)
 40000000 |                     .---,     .--,
 38620690 |               .---, |   `--, .'  `--, .------, .------------,  .--------
 37241379 |            .--''  `-'      `-'      `-'      `-'            `--'
 35862069 |          .-''     |                                            .--,
 34482759 |         .'        |                                 .----,     |  |
 33103448 |         ||        |                                 |    |     |  |    .
 31724138 |         ||        `--,     .--,    .---,     .--,   |    |     |  |   .'
 30344828 |         ||           |     |  |    |   |     |  |   |    |     |  |   |
 28965517 |       .-''           |     |  |   .'   |     |  |   |    |     |  |   |
 27586207 |       |              |     |  |   |    `--,  |  `--,|    |     |  |   |
 26206897 |       |              |     |  |   |       |  |     ||    `--,  |  `--,|
 24827586 |       |              `--,  |  `--,|       |  |     ||       |  |     ||
 23448276 |       |                 |  |     `'       `--'     ||       |  |     ||
 22068966 |      .'                 `--'                       `'       `--'     `'
 20689655 |      |
 19310345 |      |
 17931034 |      |
 16551724 |      |
 15172414 |      |
 13793103 |      |
 12413793 |      |
 11034483 |     .'
  9655172 |     |
  8275862 |     |
  6896552 |     |
  5517241 |     |
  4137931 |     |
  2758621 |     |
  1379310 |     |
        0 +-----'

Active Write is the lower, more volatile one and it seems the workload does a good bit of just reading and can’t dirty the whole VM’s memory over a one minute sample … this didn’t change with two worker threads either, so the bottleneck is somewhere else. Hmmm … but no, let’s not make this article ADHD-incarnate, I mean more than it already is.

Whether with one or two worker threads, the continued change of memory might affect the switchover time in our little lab and we already got what we wanted from stress-ng (which is filling the memory with random values) …

root@test-vm:~# ps | grep stress
 1300 root     stress-ng --vm-bytes 38586385k --vm-keep --vm-method rand-set --vm 1
 1301 root     {stress-ng-vm} stress-ng --vm-bytes 38586385k --vm-keep --vm-method rand-set --vm 1
 1302 root     {stress-ng-vm} stress-ng --vm-bytes 38586385k --vm-keep --vm-method rand-set --vm 1
 1324 root     grep stress

root@test-vm:~# kill -TSTP 1300 1301 1302

[root@esxi_source:~] esxtop -u
        ID        GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
   2913542    2913542 test-vm            36    0.71    0.67    0.01 3546.22   39.83    0.10 1930.25    0.01    0.00    0.00    0.00

In the past I would have used kill -STOP to suspend the processes but I checked and learned that there is a “politer” method. And it feels good to be nice, even if it just about suspending a couple of threads gently.

Time to prepare the migration in the vSphere Client but just before we hit “Finish” in the dialogue, let’s kick off some monitoring on the destination:

[root@esxi_destination:~] for i in $(seq 200);
> do echo -en "$(date -Iseconds),";
> sched-stats -t numa-pnode |
> awk '$8 ~ /[0-9]+/ {print int($8 / 1024)}' |
> sed 'N;s/\n/,/';
> sleep 1; 
> done
2022-08-14T20:56:36+0000,30774,31336
2022-08-14T20:56:37+0000,30774,31336
2022-08-14T20:56:38+0000,30772,31334
2022-08-14T20:56:39+0000,30772,31334
2022-08-14T20:56:40+0000,30773,31334
2022-08-14T20:56:41+0000,30773,31334
2022-08-14T20:56:42+0000,30768,31330
2022-08-14T20:56:43+0000,30768,31330  <--- vMotion start
2022-08-14T20:56:45+0000,10257,10818  <--- brief allocation check?
2022-08-14T20:56:46+0000,10257,10818  
2022-08-14T20:56:47+0000,28783,31276  <--- free space on node 0 is reducing 
2022-08-14T20:56:48+0000,28783,31276
2022-08-14T20:56:49+0000,26523,31284
2022-08-14T20:56:50+0000,26523,31284
2022-08-14T20:56:51+0000,24250,31268
2022-08-14T20:56:52+0000,24250,31268
2022-08-14T20:56:53+0000,22031,31268
2022-08-14T20:56:54+0000,22031,31268
2022-08-14T20:56:55+0000,19821,31274
2022-08-14T20:56:56+0000,19821,31274
2022-08-14T20:56:57+0000,17608,31278
2022-08-14T20:56:58+0000,17608,31278
2022-08-14T20:56:59+0000,15390,31283
2022-08-14T20:57:00+0000,15390,31283
2022-08-14T20:57:01+0000,13172,31281
2022-08-14T20:57:02+0000,13172,31281
2022-08-14T20:57:03+0000,10952,31281
2022-08-14T20:57:04+0000,10952,31281
2022-08-14T20:57:05+0000,10257,29680 <--- free space on node 1 is reducing
2022-08-14T20:57:06+0000,10257,29680 
2022-08-14T20:57:08+0000,10255,27456
2022-08-14T20:57:09+0000,10253,25234
2022-08-14T20:57:10+0000,10253,25234
2022-08-14T20:57:11+0000,10250,23014
2022-08-14T20:57:12+0000,10250,23014
2022-08-14T20:57:13+0000,10249,20793
2022-08-14T20:57:14+0000,10249,20793
2022-08-14T20:57:15+0000,10246,18572
2022-08-14T20:57:16+0000,10246,18572
2022-08-14T20:57:17+0000,10243,16350
2022-08-14T20:57:18+0000,10243,16350
2022-08-14T20:57:19+0000,10241,14131
2022-08-14T20:57:20+0000,10241,14131
2022-08-14T20:57:21+0000,10239,11907
2022-08-14T20:57:22+0000,10239,11907
2022-08-14T20:57:23+0000,10191,10733 <--- vMotion end
2022-08-14T20:57:24+0000,10191,10733
2022-08-14T20:57:25+0000,10194,10737
2022-08-14T20:57:26+0000,10194,10737
2022-08-14T20:57:27+0000,10201,10745
2022-08-14T20:57:28+0000,10201,10745
2022-08-14T20:57:29+0000,10206,10750
2022-08-14T20:57:31+0000,10206,10750
2022-08-14T20:57:32+0000,10210,10753
2022-08-14T20:57:33+0000,10210,10753
2022-08-14T20:57:34+0000,10210,10754

CTRL-C

the one-liner is similar enough to other ones here, the KB to MB calculation results in a remainder, “cast” to int get’s rid of that
also, no printing date via awk, I fought with formatting two rows but gave up and just echoed it first, removed the newline (-n) and then replaced the later one from memstats via sed
- I’m fully aware of my utter loss of ~~street~~ shell cred and the sneers of awk purists (or really just anyone half-competent) that will rightfully be drawn my way, the shame of cutting corners just because “it does what I need” and not looking as smart as I could on the Internet will haunt me for the foreseeable future

You might ask, why look at the NUMA node’s freeMem (column $8 in numa-pnode) instead of the local memory of the VMs NUMA clients? That is what I did first but it turns out, those values aren’t populated until the VM resumes, i.e. it basically looks like this the whole time during the precopy:

[root@esxi_destination:~] sched-stats -t numa-clients
groupName       groupID     clientID    homeNode    affinity    nWorlds     vmmWorlds   localMem    remoteMem   currLocal   cummLocal
vm.1177915      943151      0           0           3           0           0           0           0           0           0
vm.1177915      943151      1           1           3           0           0           0           0           0           0

But the moment the VM resumes on the destination:

[root@esxi_destination:~] sched-stats -t numa-clients
groupName       groupID     clientID    homeNode    affinity    nWorlds     vmmWorlds   localMem    remoteMem   currLocal   cummLocal
vm.1177915      943151      0           0           3           10          10          20905984    0           100         100
vm.1177915      943151      1           1           3           10          10          20967424    0           100         100

freeMem has no such hang-ups besides the little blip which is probably an allocation check, here the plotted values for both nodes:

root@foo:~# plot -y 6:0 -m -d 39 -b 0:32000 -s ascii \ 
> -i <(cut -d , -f 3 /tmp/plot.txt) \ 
> -i <(cut -d , -f 2 /tmp/plot.txt)
 32000 |
 31158 +-------, .-----------------,
 30316 |       | |                 |
 29474 |       | |                 `-,
 28632 |       | |-,                 |
 27789 |       | | |                 `,
 26947 |       | | |                  |
 26105 |       | | `-,                |
 25263 |       | |   |                `-,
 24421 |       | |   `-,                |
 23579 |       | |     |                |
 22737 |       | |     |                `-,
 21895 |       | |     `-,                |
 21053 |       | |       |                `-,
 20211 |       | |       `-,                |
 19368 |       | |         |                |
 18526 |       | |         |                `-,
 17684 |       | |         `-,                |
 16842 |       | |           |                |
 16000 |       | |           |                `-,
 15158 |       | |           `-,                |
 14316 |       | |             |                `-,
 13474 |       | |             `-,                |
 12632 |       | |               |                |
 11789 |       | |               |                `-,
 10947 |       |-|               `-,                `-----------
 10105 |       `-'                 `----------------------------
  9263 |
  8421 |
  7579 |
  6737 |
  5895 |
  5053 |
  4211 |
  3368 |
  2526 |
  1684 |
   842 |
     0 |

Now that the vMotion is done, we could check how many of the pages were zero / unaccessed via /vmkModules/migrate/migID/*/worldID/*/stats but I’m afraid I’d never finish if I do. So let’s end it here.

P.S.
I think the source of the confusion is this 10+ year old KB and of course not re-validating claims in documentation on a semi regular basis.

P.S.
If you are wondering whether it should be precopy, pre-copy, preCopy, PreCopy or other variations, I did too and given that these seem to be used interchangeably, checked how often it is used in strings / log messages and comments. The clear leader is precopy, followed by pre-copy at a maybe 10-1 ratio, the remainder are one-offs.

90%	performance
10%	sarcasm
0%	supported