As part of today’s International Supercomputing 2021 (ISC) announcements, Intel is showcasing that it will be launching a version of its upcoming Sapphire Rapids (SPR) Xeon Scalable processor with high-bandwidth memory (HBM). This version of SPR-HBM will come later in 2022, after the main launch of Sapphire Rapids, and Intel has stated that it will be part of its general availability offering to all, rather than a vendor-specific implementation.

Hitting a Memory Bandwidth Limit

As core counts have increased in the server processor space, the designers of these processors have to ensure that there is enough data for the cores to enable peak performance. This means developing large fast caches per core so enough data is close by at high speed, there are high bandwidth interconnects inside the processor to shuttle data around, and there is enough main memory bandwidth from data stores located off the processor.


Our Ice Lake Xeon Review system with 32 DDR4-3200 Slots

Here at AnandTech, we have been asking processor vendors about this last point, about main memory, for a while. There is only so much bandwidth that can be achieved by continually adding DDR4 (and soon to be DDR5) memory channels. Current eight-channel DDR4-3200 memory designs, for example, have a theoretical maximum of 204.8 gigabytes per second, which pales in comparison to GPUs which quote 1000 gigabytes per second or more. GPUs are able to achieve higher bandwidths because they use GDDR, soldered onto the board, which allows for tighter tolerances at the expense of a modular design. Very few main processors for servers have ever had main memory be integrated at such a level.


Intel Xeon Phi 'KNL' with 8 MCDRAM Pads in 2015

One of the processors that used to be built with integrated memory was Intel’s Xeon Phi, a product discontinued a couple of years ago. The basis of the Xeon Phi design was lots of vector compute, controlled by up to 72 basic cores, but paired with 8-16 GB of on-board ‘MCDRAM’, connected via 4-8 on-board chiplets in the package. This allowed for 400 gigabytes per second of cache or addressable memory, paired with 384 GB of main memory at 102 gigabytes per second. However, since Xeon Phi was discontinued, no main server processor (at least for x86) announced to the public has had this sort of configuration.

New Sapphire Rapids with High-Bandwidth Memory

Until next year, that is. Intel’s new Sapphire Rapids Xeon Scalable with High-Bandwidth Memory (SPR-HBM) will be coming to market. Rather than hide it away for use with one particular hyperscaler, Intel has stated to AnandTech that they are committed to making HBM-enabled Sapphire Rapids available to all enterprise customers and server vendors as well. These versions will come out after the main Sapphire Rapids launch, and entertain some interesting configurations. We understand that this means SPR-HBM will be available in a socketed configuration.

Intel states that SPR-HBM can be used with standard DDR5, offering an additional tier in memory caching. The HBM can be addressed directly or left as an automatic cache we understand, which would be very similar to how Intel's Xeon Phi processors could access their high bandwidth memory.

Alternatively, SPR-HBM can work without any DDR5 at all. This reduces the physical footprint of the processor, allowing for a denser design in compute-dense servers that do not rely much on memory capacity (these customers were already asking for quad-channel design optimizations anyway).

The amount of memory was not disclosed, nor the bandwidth or the technology. At the very least, we expect the equivalent of up to 8-Hi stacks of HBM2e, up to 16GB each, with 1-4 stacks onboard leading to 64 GB of HBM. At a theoretical top speed of 460 GB/s per stack, this would mean 1840 GB/s of bandwidth, although we can imagine something more akin to 1 TB/s for yield and power which would still give a sizeable uplift. Depending on demand, Intel may fill out different versions of the memory into different processor options.

One of the key elements to consider here is that on-package memory will have an associated power cost within the package. So for every watt that the HBM requires inside the package, that is one less watt for computational performance on the CPU cores. That being said, server processors often do not push the boundaries on peak frequencies, instead opting for a more efficient power/frequency point and scaling the cores. However HBM in this regard is a tradeoff - if HBM were to take 10-20W per stack, four stacks would easily eat into the power budget for the processor (and that power budget has to be managed with additional controllers and power delivery, adding complexity and cost).

One thing that was confusing about Intel’s presentation, and I asked about this but my question was ignored during the virtual briefing, is that Intel keeps putting out different package images of Sapphire Rapids. In the briefing deck for this announcement, there was already two variants. The one above (which actually looks like an elongated Xe-HP package that someone put a logo on) and this one (which is more square and has different notches):

There have been some unconfirmed leaks online showcasing SPR in a third different package, making it all confusing.

 

Sapphire Rapids: What We Know

Intel has been teasing Sapphire Rapids for almost two years as the successor to its Ice Lake Xeon Scalable family of processors. Built on 10nm Enhanced SuperFin, SPR will be Intel’s first processors to use DDR5 memory, have PCIe 5 connectivity, and support CXL 1.1 for next-generation connections. Also on memory, Intel has stated that Sapphire Rapids will support Crow Pass, the next generation of Intel Optane memory.

For core technology, Intel (re)confirmed that Sapphire Rapids will be using Golden Cove cores as part of its design. Golden Cove will be central to Intel's Alder Lake consumer processor later this year, however Intel was quick to point out that Sapphire Rapids will offer a ‘server-optimized’ configuration of the core. Intel has done this in the past with both its Skylake Xeon and Ice Lake Xeon processors wherein the server variant often has a different L2/L3 cache structure than the consumer processors, as well as a different interconnect (ring vs mesh, mesh on servers).

Sapphire Rapids will be the core processor at the heart of the Aurora supercomputer at Argonne National Labs, where two SPR processors will be paired with six Intel Ponte Vecchio accelerators, which will also be new to the market. Today's announcement confirms that Aurora will be using the SPR-HBM version of Sapphire Rapids.

As part of this announcement today, Intel also stated that Ponte Vecchio will be widely available, in OAM and 4x dense form factors:

Sapphire Rapids will also be the first Intel processors to support Advanced Matrix Extensions (AMX), which we understand to help accelerate matrix heavy workflows such as machine learning alongside also having BFloat16 support. This will be paired with updates to Intel’s DL Boost software and OneAPI support. As Intel processors are still very popular for machine learning, especially training, Intel wants to capitalize on any future growth in this market with Sapphire Rapids. SPR will also be updated with Intel’s latest hardware based security.

It is highly anticipated that Sapphire Rapids will also be Intel’s first multi compute-die Xeon where the silicon is designed to be integrated (we’re not counting Cascade Lake-AP Hybrids), and there are unconfirmed leaks to suggest this is the case, however nothing that Intel has yet verified.

The Aurora supercomputer is expected to be delivered by the end of 2021, and is anticipated to not only be the first official deployment of Sapphire Rapids, but also SPR-HBM. We expect a full launch of the platform sometime in the first half of 2022, with general availability soon after. The exact launch of SPR-HBM beyond HPC workloads is unknown, however given those time frames, Q4 2022 seems fairly reasonable depending on how aggressive Intel wants to attack the launch in light of any competition from other x86 vendors or Arm vendors. Even with SPR-HBM being offered to everyone, Intel may decide to prioritize key HPC customers over general availability.

Related Reading

POST A COMMENT

149 Comments

View All Comments

  • plenty000 - Wednesday, June 30, 2021 - link

    Plenty <a href="https://www.plentythingz.nl/"> Thingz</a> is een e-commerce website, hier kun je elk product vinden. Spin, wiebel en race; het kan allemaal met onze kinderauto! Deze wiebelende auto houdt je kind actief en ontwikkelt zijn evenwichts- en coördinatievaardigheden tijdens het spelen. Reply
  • SanX - Wednesday, June 30, 2021 - link

    hahahaha While many of you switched into Waiting Mode for the new "miracle" design, counting days till the mid 2022, delayed till end 2022, and finally getting it in mid 2023... i will tell that i am not interested exactly right now. My apps which use parallel algebra will have exactly 0 (=zero) benefits from this DRAM memory no matter how fast and what size Reply
  • mode_13h - Thursday, July 1, 2021 - link

    > My apps which use parallel algebra will have exactly 0 (=zero) benefits from this DRAM memory

    It's true that some apps will not benefit from it.

    The key questions are: "how many will?" and "by how much?" I'm sure the precise config details will also have a lot to do with it.
    Reply
  • Tomatotech - Friday, July 2, 2021 - link

    My apes also will not benefit from this. This new chip is useless for them.

    They prefer to swing from tree branches and eat bananas.
    Reply
  • raddude9 - Thursday, July 1, 2021 - link

    So why no mention of latency in the article. Sure, HBM will improve bandwidth, but memory latency is and always has been the Achilles heel of HBM technologies. HBM 1.0 ran the memory at just 500Mhz, and although newer version have improved the speed, they are still well behind DDR4 and DDR5 when it comes to clock speed and latency. Reply
  • mode_13h - Friday, July 2, 2021 - link

    > HBM 1.0 ran the memory at just 500Mhz, and although newer version have improved the speed

    Yeah, so why are you talking about HBM 1.0? In version 2.0, they at least doubled it, and then there's HBM2E and now HBM3.

    > they are still well behind DDR4 and DDR5 when it comes to ... latency

    I'd like to see some data supporting that claim.
    Reply
  • raddude9 - Saturday, July 3, 2021 - link

    Oh sure, HBM have dramatically improved the latency situation with the more recent versions, but due to the power constraints of stacking multiple memory dies they generally run at lower clock speeds than regular memory, which seems to why the latency suffers.
    As for supporting info, there are a few studies out there eg:
    https://arxiv.org/pdf/2005.04324.pdf
    A quote from theat article: "Shuhai identifies that the latency of HBM is 106.7 ns while the latency of DDR4 is 73.3 ns,"

    It's going to come down to the type of application being run, some will prefer the high bandwidth of HBM but others will suffer from higher latency.
    If only someone could combine HBM and Stacked SRAM...
    Reply
  • mode_13h - Sunday, July 4, 2021 - link

    > "Shuhai identifies that the latency of HBM is 106.7 ns while the latency of DDR4 is 73.3 ns,"

    That's hardly night-and day. Also, the DDR4 is a lot better than what I've seen on Anandtech's own memory latency benchmarks. I think their test isn't accounting for the added latency of the DDR4 memory sitting on an external DIMM.

    Finally, the latency of HBM should scale better at higher queue depths. And a 56-core / 112-thread server CPU is going to have some very deep queues in its memory hierarchy.
    Reply
  • raddude9 - Monday, July 5, 2021 - link

    No, there's not a massive difference between HBM and DDR4 any more but barring some kind of breakthrough HBM will continue to have higher latency.

    I think it's going to come down to the application being run more than things like queue depths. One of the downsides of the HBM approach now is that many of the workloads that would have taken advantage of that approach have already migrated over to GPU's and won't be returning any time soon.

    Still, I'm sure it'll only be a few years before some company gives us a combination of stacked SRAM and HBM on chip with DDR5 for further memory expansion. Can't wait
    Reply
  • TheJian - Sunday, July 4, 2021 - link

    IAN, any info on Intel 3nm chips coming 2022 from TSMC (in hand according to Linus TT vid 24hrs ago - already working, so not some joke, I knew gelsinger was holding back...LOL)? LInus said niche servers and laptop stuff at least. Probably due to having to leave out superfin stuff, so niche where that missing part won't matter perhaps. Otherwise, I'd just buy all 3nm I could and produce the crap out of those servers that fit (without giving up intel secrets, a 3nm TSMC server from Intel isn't going to lose to 5nm AMD TSMC, even without Intel special sauce), or gpus until the cows come home. Or simply revive your dead HEDT platform and soak up as much 3nm as you can for 2022 and 2023. Every wafer not sold to AMD/NV is a win for Intel and you can make MASS income on gpus right now.

    AMD is so short on wafers they're having to fire up 12nm chips again. So yeah, pull an apple and buy up more than you need if possible and even bid on 2nm ASAP. As long as you can keep your fabs 100%, take every wafer you can from TSMC and flood the gpu market with 3nm gpus. Price how you like, they will sell out completely anyway. You can price to kill AMD/NV NET INCOME, or price to take all that EBAY scalper money by just selling direct for 3x normal launch price etc. :)

    I don't know why anyone thinks AMD is winning. It doesn't matter if your chip is the best if you can't make enough because you keep making consoles for 500mm^2 on the best nodes and making $10-15 on a $100 soc. Those should be SERVER/HEDT/PRO GPU. You'd be making BILLIONS per year instead of ~500mil or losses (see last 15yrs). No, one time tax breaks don't count as a 1B+ NET INCOME Q. They are 4yrs into this victory dance and still can't crack Q4 2009 1B+ NET INCOME Q. Yet the stock has went up 10-15x from then, while shares outstanding have doubled (meaning worth half whatever stock price was then, $2-10 for a decade), and assets have dropped basically in half (though it is coming back slowly). Their stock crashed the same way when Q's dropped back and the people punished the stock from $10-2 again 2009+. You are looking at the same story now people. IF AMD can't get wafers to make chips they're stuck with great tech that can't be sold. Nothing illegal about Intel buying all the 3nm they can to enter the gpu market for round 2 (round 1 was 6nm, and it killed AMD warhol IMHO...LOL).

    Intel can write billions in checks for wafers from TSMC and make money on NEW products (discrete gpu for example). They pissed away 4B+ a year for 4-5yrs on mobile with contra revenue which is why the fabs ended up where they are today (that 20B should have been in 10/7nm and we'd be in a whole other game today). Either way, AMD can't stop Intel's checks and the pissed away 4yrs chasing share instead of INCOME. Now it's time to pay the BIG CHECKS, and, well, only Apple, Intel, NV etc, have them in that order (others in there after Intel, but you get the point).

    3nm Intel HEDT chips could do the exact thing to threadripper that it did to Intel HEDT (doesn't exist today really). Priced right, threadrippers would become hard to sell and with Intel volumes surely 3nm cheaper for them already. Time to turn the tables, and it's even legal this time, and much easier with so many options to slow AMD down, cause issues for NV too, and take a shot at apple's bow while at it, since they're coming directly at everyone anyway (cpu and gpu and GAMING big time). 3nm TSMC won't need Intel's superfin stuff to beat AMD/NV 5nm stuff, so no risk there giving it up to china theft. Business is war, surely Pat is on this angle.
    Reply

Log in

Don't have an account? Sign up now