AMD Comments on Threadripper 2 Performance and Windows Schedulerby Ian Cutress on January 14, 2019 9:00 AM EST
Users may have been following Wendell from Level1Tech’s battle with researching the reasons behind why some benchmarks have regressed performance on quad-die Threadripper 2 compared to dual-die configurations. Through his research, he found that this problem was limited to Windows, as cross-platform software on Linux did not have this issue, and that the problem was not limited just to Threadripper 2, but quad-die EPYCs were also affected.
At the time, most journalists and analysts noted that the performance was lower, and that the Linux/Windows differences existed, but pointed the finger at the reduced memory performance of the large Threadripper 2 CPUs. At the time, Wendell discovered that removing CPU 0 from the thread pool, after the program starts running, it actually regained all of the performance loss on Windows.
After some discussions about what the issue was exactly, I helped Wendell with some additional testing, by running our CPU suite through an affinity mask at runtime to remove CPU 0 from the options at runtime. The results were negative, suggesting that the key to CPU 0 was actually changing it at run time.
After this, Wendell did his testing on an EPYC 7551 processor, one of the big four-die parts, and confirmed this was not limited to just Threadripper – the problem wasn’t memory, it was almost certainly the Windows Scheduler.
'Best NUMA Node' and Windows Hotfix for 2-NUMA
The conclusion was made that in a NUMA environment, Windows’ scheduler actually assigns a ‘best NUMA node’ for each bit of software and the scheduler is programmed to move those threads to that node as often as possible, and will actually kick out threads that also have the same ‘best NUMA node’ settings with abandon. When running a single binary that spawns 32/64 threads, every thread from that binary is assigned the same ‘best NUMA node’, and these threads will continually be pushed onto that node, kicking out threads that already want to be there. This leads to core contention, and a fully multi-threaded program could spend half of its time shuffling around threads to comply with this ‘best NUMA node’ situation.
The point of this ‘best NUMA node’ environment was originally meant to be for running VMs, such that each VM would run in its own runtime and be assigned different ‘best NUMA nodes’ depending on what else was currently on the system.
One would expect this issue to come up in any NUMA environment, such as dual processors or dual-die AMD processors. It turns out that Microsoft has a hotfix in place in Windows for dual-NUMA environments that disables this ‘best NUMA node’ situation. Ultimately at some point there were enough dual-socket workstation platforms on the market that this made sense, pushing the ‘best NUMA node’ implementation down the road to 3+ NUMA environments. This is why we see it in quad-die Threadripper and EPYC, and not dual-die Threadripper.
Wendell has been working with Jeremy from BitSum, creator of the CorePrio software, in developing a way of soft-fixing this issue. The CorePrio software now has an option called ‘NUMA Disassociator’ which probes which software is active every few seconds and adjusts the thread affinity while the software is running (rather than running an affinity mask which has no affect).
This is a good temporary solution for sure, however it needs to be fixed in the Windows scheduler.
AMD Comments On The Findings
There have been questions about how much AMD/Microsoft know about this issue, who they are in contact with, and what is being done. AMD was happy to make some comments on the record.
AMD stated that they have support and update tickets open with Microsoft’s Windows team on the issue. They believe they know what the issue is, and commends Wendell for being very close to what the actual issue is (they declined to go into detail). They are currently comparing notes with Bitsum, and actually helped Bitsum to develop the original tool for affinity masking, however the ‘NUMA Disassociator’ is obviously new.
The timeline for a fix will depend on a number of factors between AMD and Microsoft, however there will be announcements when the fix is ready and what exactly that fix will affect performance. Other improvements to help optimize performance will also be included. AMD is still very pleased with the Threadripper 2 performance, and is keen to stress that for the most popular performance related tests the company points to reviews that show that the performance in rendering is still well above the competition, and is working with software vendors to push that performance even further.