Intel Lists Knights Mill Xeon Phi on ARK: Up to 72 cores at 320W with QFMA and VNNI

by Ian Cutress on December 19, 2017 7:00 AM EST

75 Comments | Add A Comment

75 Comments

Today it was noticed, without any fanfare from Intel, that Knights Mill information has been uploaded to Intel’s specification database ARK. There are three known SKUs so far, with up to 72 cores and a 320W TDP. The Xeon Phi 'Knights Mill' products are a new iteration on the older 'Knights Landing' family, with a silicon change for additional AVX-512 instructions. As far as we can tell, these parts will only be available as socketable hosts and not as PCIe add-in cards.

In Intel’s approach to AI and compute and the data center, according to their talk at Hot Chips earlier this year, has multiple solutions depending on the workload. The Xeon Scalable Platform is meant for the widest variety of workloads, Xeon Phi for parallel acceleration and AI training, FPGAs for deep learning inference efficiency and the Crest family with dedicated deep learning by design. Each segment has its own roadmap, and the Xeon Phi roadmap has been somewhat confusing of late.

Moving from the 45nm Knights Ferry in 2010, Intel released 22nm Knights Corner in 2012 and 14nm Knights Landing in 2016. Last year, Intel had a contract with the Argonne National Laboratory to build the Aurora supercomputer at using 10nm Knights Hill (KNH) by 2020, which was the original successor to Knights Landing, however the KNH project was officially scrapped last month (Nov 2017) for various reasons, some citing Intel’s delay in releasing 10nm silicon. Intel still intends to follow through with the Aurora contract by 2021, it should be noted.

The new successor to Knights Landing (KNL) is Knights Mill (KNM), which was detailed back at Hot Chips. For the most part, the KNL and KNM chips are almost identical in core counts, frequencies, 36 PCIe 3.0 lanes, using 16GB of high-bandwidth MCDRAM and having six DRAM channels, but KNM offers a small design tweak to allow for better utilization of its AVX-512 units through supporting new sets of AVX-512 instructions as well as a small amount of silicon redesign. For pure performance metrics, this change means that Knights Mill offers only half of the double-precision performance of Knights Landing, but up to 2x in single precision and 4x with variable precision.

Intel Xeon Phi x205 Family (Knights Mill)
	Cores	Base	Turbo	L2	TDP	DRAM	TCASE
7295	72 / 288	1.50 GHz	1.60 GHz	36 MB	320W	DDR4-2400	77ºC
7285	68 / 272	1.30 GHz	1.40 GHz	34 MB	250W	DDR4-2400	72ºC
7235	64 / 256	1.30 GHz	1.40 GHz	32 MB	250W	DDR4-2133	72ºC

The three parts are officially members of the Xeon Phi x205 family, and match their x200 family members where the product numbers align (eg 7290 and 7295). The 7295 has the most cores and threads, with a higher frequency, more total L2 cache, support for DDR4-2400 and the higher temperature support. The 7285 is the middle of the pack, while the low end 7235 gives up some memory support, with only up to DDR4-2133.

The previous generation parts had some integrated OmniPath fabric versions (such as the 7290F) which do not seem to have migrated over to Knights Mill yet, though could potentially in the future. Each of the parts are similar to KNL, using 36 tiles of two cores, all connected with the 2D mesh, and each tile sharing 1MB of L2 cache with one VPU per core.

On the specification side, everything seems pretty much by the book for core counts and frequencies. The one number that stands out is the 320W for the Xeon Phi 7295. This is a high power consumption number for a socketed processor, and if my memory serves me correctly, this is the highest TDP ever assigned to an Intel socketed CPU. This isn't the highest for a socketed processor though - IBM quoted 300-350W for its z14 hardware earlier this year, although it can go up to 500W on consumer request. The reason for this dramatic increase in Intel’s TDP number (we suspect) is three-fold: one, to catch more processors off the production line into the relevant voltage/frequency window, and two, the new instructions offer better utilization of the silicon, and three: double-pumped execution.

AVX-512 Support Propogation by Various Intel CPUs
	Xeon, Core X	General	Xeon Phi
Skylake-SP	AVX512BW AVX512DQ AVX512VL	AVX512F AVX512CD	AVX512ER AVX512PF	Knights Landing
Cannon Lake	AVX512VBMI AVX512IFMA	AVX512F AVX512CD	AVX512_4FMAPS AVX512_4VNNIW	Knights Mill
Ice Lake	AVX512_VNNI AVX512_VBMI2 AVX512_BITALG AVX512+VAES AVX512+GFNI AVX512+VPCLMULQDQ	AVX512_VPOPCNTDQ	AVX512_4FMAPS AVX512_4VNNIW	Knights Mill
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 12 and 13)

The two headline changes on instructions for the new parts revolve around support for Quad FMA (QFMA, or 4FMAPS) for 32-bit floating point, and Vector Neural Network Instructions (VNNI) for 16-bit integers. To do this required changing the execution ports on the core.

Previously each VPU per core would offer identical 512-bit interfaces to the AVX-512 unit, supporting single precision (SP) and double precision (DP) math, with any half-precision (HP) falling under an SP block. In the new design, the ports are asymmetrical: the SP/DP block is split into a separate DP and SP/VNNI blocks, with one DP block removed (likely due to the size of the SP/VNNI block). This is best shown in the diagram above, though it means that DP performance is halved (2 blocks to 1), but SP performance is doubled (2 blocks to 4) and VNNI performance is quadrupled (2 SP blocks to 4 HP blocks).

The QFMA implementation allows for sequential fused-multiply-add to accumulate over four sets of calculations with a single instruction. Technically this adds latency, so the calculation needs enough ILP to hide the latency, but offers a single target for the vector accumulator and allows packing together of 12 aligned cycles in DRAM. Each FMA takes about 3 cycles.

For VNNI, the core takes two 16-bit integer inputs to give one 32-bit integer output for a horizontal dot product operation. The reason for using integer math here is two-fold: one, it enables 31-bits of INT prevision vs 24-bits of mantissa in FP, but also because IEEE standards are easier to implement in INT math.

Support for these new instructions and the silicon redesign means two things: one, the decoder potentially can work less, with more math being packed into a single instruction, but also caters to what Intel thought was a bottleneck in AVX-512 utilization: not enough codes were using all the AVX-512 unit all the time. This is designed to help increase utilization, and therefore would increase power consumption.

The other part of the equation is double pumping the AVX-512 unit, ensuring that the unit is fed at all times while reducing decode power.

Overall the KNM core is looks almost identical to the KNL core, aside from the VPU port changes. It still decodes 2-ops per cycle but gives out-of-order execution with 72-inflight micro-ops. To hide a lot of the latency, 4-way hyperthreading is enabled, hence why a 64-core part gives 256 threads.

Our KNL 7210 Sample

Intel’s major partners already have Knights Mill available for customers, although list pricing is currently not in Intel's Price List. It’s interesting that the data being added to Intel’s ARK platform wasn’t accompanied by any announcement, and seems to be a low-key affair.

Source: Wikichip on Twitter, Intel ARK

Additional

Something we missed in our first look at the specifications. Over the older Knights Landing parts, the new Knights Mill parts also support Intel virtualization technologies, specifically VT-d, VT-x and EPT.

75 Comments

View All Comments

Wilco1 - Tuesday, December 19, 2017 - link
GLIBC uses vector instructions indeed, not REP MOV. For small copies it's significantly slower to use REP MOV, and even with improvements it's unlikely going to beat existing memcpy implementations. It hasn't for the last 30 years...
HStewart - Tuesday, December 19, 2017 - link
I am not unix developer but please research your information

The following is the GLIBC library source available online for MemCpy function

https://github.com/lattera/glibc/blob/master/strin...

In that code it uses a Macro called BYTE_COPY_FWD
I don't have unix source but at least from the following link it use REP MOV

http://justinyan.me/post/1689

My guess the macro is used for different instruction sets but for x86 it used REP MOV.

Please provide proof that on Intel that GLIBC does not used REP MOV - I am not unix developer - so I could be wrong on that platform.
Wilco1 - Tuesday, December 19, 2017 - link
I have already researched this which I exactly why I said what I said. Quoting random ancient bits of source code that isn't used doesn't help your case at all. GLIBC uses an AVX memcpy on modern cores. I've benchmarked REP MOV myself - it's ridiculously slow which is why nobody uses it as a memcpy. The best way to be convinced of this is to benchmark it yourself and compare with an optimized AVX implementation (like in GLIBC). There are benchmarks in GLIBC that you can use, but it's not difficult to do something similar on Windows.
mode_13h - Wednesday, December 20, 2017 - link
I know glibc is vectorized after looking at enough backtraces where somebody passed in a bad pointer, thus resulting in a segfault.

:-/
mode_13h - Wednesday, December 20, 2017 - link
memcpy() is so commonly used that it's a good bet it's well-optimized for whatever your platform. I would never advise people to use anything but memcpy(), if they're writing C. In C++, compilers are usually smart enough to insert memcpy() when possible.
Wilco1 - Tuesday, December 19, 2017 - link
DO you really need to spread false information about how great REP MOV is in every possible article? It's not like many people, including me, have explained why REP MOV is a horribly slow and inefficient instruction which isn't used either by compilers or libraries.
HStewart - Tuesday, December 19, 2017 - link
Have you every look at code that generate by compiler, I shown before the compiler uses REP MOV a lot - and it is used by compilers a lot - at least on x86 based platforms. I am not sure what the ARM equivalent is. But it appears to be "vldl.u8" instruction

https://stackoverflow.com/questions/11161237/fast-...

but this does not look like non standard stuff.

I think we are talking two different worlds here - x86 vs ARM. I have not been around Intel assembly for long time, but Intel has vector instruction - unfortunately it limited to certain process and primary used for large transfer. this instruction enhancement is for Ice Lake and like is designed for smaller transfers and REP MOV is used in that case on those platforms.
Wilco1 - Tuesday, December 19, 2017 - link
No, modern compilers don't use REP MOV. Some compilers can use it when optimizing for size, but never when optimizing for performance. Try building a large amount of code with GCC with -O2 for x64 and count the number of REP instructions in the binary. As I said before, you will be educated when you actually do these experiments yourself.
HStewart - Wednesday, December 20, 2017 - link
This is useless - we are talking two different platforms - I research the code and actually look at unassembled code on x86 platform and REP MOV is used. Please show me compiler output for x64 that does not used REP MOV. I have a lot more knowledge in this area - but I do not used GCC compiler - I used Microsoft compilers including latest Visual Studio 2015 compiler.
Wilco1 - Wednesday, December 20, 2017 - link
Really, try doing the research yourself. I disassembled a random large binary in my system (MicrosoftRawCodec.dll), and this is the full list of rep mov instances:

0000000180077695: F3 BE 3F 7B 83 BF rep mov esi,0BF837B3Fh
000000018007AC94: F3 A5 rep movs dword ptr [rdi],dword ptr [rsi]
000000018007E0A2: F3 3E 8B C1 rep mov eax,ecx
0000000180135035: F3 48 A5 rep movs qword ptr [rdi],qword ptr [rsi]
0000000180135362: F3 48 A5 rep movs qword ptr [rdi],qword ptr [rsi]
00000001802368C4: F3 8E 14 00 rep mov ss,word ptr [rax+rax]

A quickly looked at each, they all look like data being disassembled, for example the best candidate to be a real rep mov goes like this:

000000018007E0A1: 2E
000000018007E0A2: F3 3E 8B C1 rep mov eax,ecx
000000018007E0A6: F3 3E 9B rep wait
000000018007E0A9: 54 push rsp
000000018007E0AA: F4 hlt

So there are ZERO rep mov instances in a big DLL containing well over 700000 instructions.

Intel Lists Knights Mill Xeon Phi on ARK: Up to 72 cores at 320W with QFMA and VNNI

Additional

Related Reading

Post Your Comment

75 Comments

View All Comments

Wilco1 - Tuesday, December 19, 2017 - link

HStewart - Tuesday, December 19, 2017 - link

Wilco1 - Tuesday, December 19, 2017 - link

mode_13h - Wednesday, December 20, 2017 - link

mode_13h - Wednesday, December 20, 2017 - link

Wilco1 - Tuesday, December 19, 2017 - link

HStewart - Tuesday, December 19, 2017 - link

Wilco1 - Tuesday, December 19, 2017 - link

HStewart - Wednesday, December 20, 2017 - link

Wilco1 - Wednesday, December 20, 2017 - link

Log in

Don't have an account? Sign up now