A Stacked CPU: Intel’s Foveros

The previous designs of Intel, AMD and Qualcomm are what we call monolithic designs – everything on the processor happens on a single piece of physical silicon. When everything is on a single piece of silicon, it makes data management inside the processor a lot easier and simpler, it makes designing the processor a lot simpler, and manufacturing and assembly can be streamlined when only dealing with one physical element to the processor.

However, there have been moves in the industry to deviate from these single monolithic designs, as the benefits of trying something different are starting to offer beneficial points of differentiation within a product portfolio. It can lead to optimizations on different parts of the processor, it can be advantageous for cost reasons, and it also can expand silicon products beyond traditional manufacturing limits as well.

Monolithic vs Chiplets

You may be aware that recent AMD desktop processors are built on a ‘chiplet’ design. This is where multiple pieces of silicon are connected through wires in the green PCB in order to create a single ‘processor’. By using separate chiplets, each individual chiplet can either be focused on a single task (and be manufactured in the most efficient way for that task) or it can be a one of a repeated unit designed to scale out the compute performance.

For example, a processor core that contains logic circuits might aim for performance, and thus might require a very speed optimized layout. This has different manufacturing requirements compared to something like a USB controller, which is built to a series of specifications as per the USB standard.

Under a traditional monolithic regime, the single piece of silicon will use a singular manufacturing process that has to be able to cater for both situations – both the processor core logic and the USB controller. By having different parts of the overall design separated in different pieces of silicon, each one optimized for the best manufacturing scenario. This only works as long as the connectivity between the chips works, and it potentially enables a better mix of performance where you need it, and better efficiency (or cheaper cost) where you need it as well.

Of course, there are trade-offs: additional connectivity is required, and each chiplet needs to be able to connect to other chiplets – the total physical design area of the combined chiplets is often greater than what a single piece of silicon would offer because of these connectivity additions, and it could become costly to assemble depending on how many parts are involved (and if those parts are manufactured in different locations). Ultimately, if some chiplets are on an expensive manufacturing process, and some are on a cheaper manufacturing process, then we get the benefits of the expensive process (power, performance) without having to spend the money to build everything on that process, overall saving money.

One other benefit that a chiplet process can bring is total silicon size of the product. Standard monolithic silicon designs, due to the manufacturing process technologies we use today, have an upper bound of how big a single piece of silicon can be. By implementing chiplets, suddenly that upper limit isn’t much of a concern unless each chiplet reaches that limit - using multiple chiplets can give a total silicon area bigger than a single monolithic chip design. An example of this is AMD’s Rome CPUs, which total an area of over 1000 square millimetres, while the single largest monolithic silicon die is NVIDIA’s A100 GPU, coming in at 826 square millimetres.


NVIDIA's A100 GPU, with one big monolithic die and six high bandwidth memory dies.

 

Already in Market: AMD Chiplets

To put this into context of a modern design, AMD’s Ryzen processors use one or more ‘compute chiplets’ combined with a single ‘peripheral’ chiplet (often called an IO die). The compute chiplets are built on TSMC’s high-performance 7nm manufacturing node which extracts peak performance and power from the design. The ‘peripheral’ chiplet, which is not so peak performance focused but more tuned to standards like SATA, PCIe and USB, can be built on a manufacturing node where efficiency is more important, and also the cost can be lower, such as GlobalFoundries’ cheaper 14nm manufacturing node. Put together, these chiplets form a singular product.


AMD's Rome with 1 big IO die and eight compute dies

AMD had to overcome a lot of hurdles to get here, such as developing a chip-to-chip connectivity standard (known as Infinity Fabric), managing the power of the connectivity, but also physical manufacturing, such as ensuring all the individual chiplets match the same height for the heatspreader and cooler that goes on top to be effective.

One of the benefits of AMD going this route, according to the company, is that it allows them to scale the parts of their design that are easiest to scale for performance (the compute cores), and also manage where they think the future of compute is going. The other big benefit is that total die size of one of AMD’s server CPUs is larger than what can be manufactured in a single piece of silicon. 

This sort of chiplet based approach also lends people to believe that AMD could swap out a compute based chiplet for a graphics based chiplet, or an AI-focused chiplet, and thus AMD could in the future offer different variants of its products depending on customer requirements for different workloads that the organization might have.

Coming to Market: Intel Chiplets

For Lakefield, Intel also goes down the chiplet route. But instead of placing the chiplets physically alongside each other like AMD, the chiplets are stacked on top one another. This creates a physically smaller processor package in the x-y dimensions, which is a critical component for laptop and small form factor mobile designs that Lakefield is aiming towards.

This stacked design replaces the tradeoff of physical space for one of cooling. By placing two high-powered bits of silicon on top of each other, managing thermals becomes more of an issue. Nonetheless, the physically smaller floorplan (along with a design focused to embed more control into the processor) in the x-y directions helps build thinner and lighter systems.

For the two stacked chiplets in the middle, the top chiplet is built on Intel’s high-performance 10nm+ manufacturing node and contains the 1+4 compute core configuration, as well as the graphics and the memory controller. The bottom chiplet contains the ‘peripheral’ components that are not as performance related, such as security controller, USB ports, and PCIe lanes. This is built on Intel’s cheaper 22nm manufacturing node.

Because this chiplet is on the bottom and has connections for power to pass through, Intel technically calls the lower chiplet an ‘active interposer’. An interposer is a term commonly used when chiplets are connected through a base piece of silicon, rather than through a green package PCB, because it allows communication between chiplets to be faster and more efficient, but it is a more expensive implementation.

What makes it an active interposer, rather than the passive interposers we have seen on some GPUs in recent years, is that it contains functional logic, such as the USB ports, the security, the chipset functions, and others. The passive interposers are just connection passthroughs, taking advantage of faster signaling. Active interposers include functional logic and have an associated power consumption that goes along with that.

The reason I bring this up is because there is some debate as to whether an active interposer is true 3D stacking as traditionally interpreted, or more akin to 2.5D stacking, which is what we commonly call a passive interposer. For those users who read more about Lakefield beyond AnandTech, you are likely to see both used.

Getting Stacked: DRAM and NAND vs Lakefield

The use of stacking is not necessarily new to the world of semiconductors. Both computer random access memory, such as DRAM, and storage components, such as NAND Flash, have implemented multiple layer technology for many years. What makes these elements different is the way they are stacked, plus also the power of the components involved.

The two main ways of stacking silicon together are through simple wire bonding, where the layers are not directly connected, or with Through Silicon Vias (TSVs), which are akin to stacks running through the layers.

This is an image of Samsung’s NAND wire bonding technique, where multiple layers have separate connections to a base die. There is no direct connection between layers other than the act of physically coming together.

This is ‘Through Silicon Via’ (TSV) stacking, whereby each layer has a vertical channel that connects to the die above and below it. It allows for direct connection through the stack for fast access, which is useful when NAND has 64 or more layers. It can be quite difficult to do as well, but NAND manufacturers are experts in this methodology.

However, DRAM and NAND Flash are not the high-powered elements of a computer. Even the most dense memory configurations look to contribute single digit of milliwatts of power per layer when in use. Applying these techniques to high-powered computer chips is a bit more complex.

Stacking with Lakefield

What Intel is doing with Lakefield, with its stacking, is putting together multiple layers of high-performance compute into a single product. Also, while most DRAM and NAND Flash implementations stack silicon on top of each other, and then use external wire bonding or TSVs, to provide connectivity - for Intel’s Lakefield, the connectivity goes through the silicon, as with a traditional interposer (as mentioned above), and uses a die-to-die bonding to provide the communications.

Intel calls its stacking technology ‘Foveros’. It uses a novel design of die-to-die connectivity.

At the bottom is the base packaging material that connects all the signals going out into the system (power, USB, display). On top of this is the base silicon peripheral die, the active interposer, containing things like the USB control, storage control, security, and such.

Between the base silicon peripheral die and the top logic compute die is a method of connecting the two, in this case we have a solder ball array with a 50 micron pitch. This is essentially a ‘balls-on-balls’ technique, but with two silicon dies of different process node manufacturing techniques.

These connections will come in three flavors: structural, data, and power. Creating these bumps and ensuring they deliver what is intended is a hard problem – electrical issues, such as capacitance, and computational issues, such as maintaining a clock frequency, have to be managed, along with achieving targets in data rate bandwidth as well as power.

Here is the main introduction slide that Intel presented at the ISSCC conference regarding the die-to-die interface. Unfortunately these were the quality of the pictures as presented (the unreadable aspect ratios are also native to the presentation).

As mentioned, maintaining the clock coherency at speed and low power is a concern, and here’s what Intel did, with each connection operating at 500 mega-transfers per second. The key point on this slide is the power: 0.2 picojoules of energy consumed per bit transferred. If we extrapolate this out to a memory bandwidth of 34 GBps (maximum memory bandwidth of Lakefield), this equates to 54 millwatts of power for the data transfer.

0.2 pJ/bit is one of the benefits of keeping the transmission of the data ‘inside’ the silicon, but moving between the two layers. This is an order of magnitude better than the numbers quoted by AMD for its first generation EPYC server processors, which used data transfer links within the CPU package – AMD quoted 2 pJ/bit transfer by comparison.

Here’s a slide from Intel’s 2018 Hot Chips talk about new data transfer and connectivity suggestions. On the left is the ‘on-board’ power transfer through a PCB, which runs at 20 pJ/bit. In the middle is on-package data transfer, akin to what AMD did with 1st Gen EPYC’s numbers, around 1-5 pJ/bit (depends on the technique), and then we get on-silicon data movement, which is more 0.1 pJ/bit. Intel’s Foveros die-to-die interconnect is most like the latter.

Intel’s Hybrid CPU Lakefield: All You Need To Know Lakefield: Top Die to Bottom Die
Comments Locked

221 Comments

View All Comments

  • ichaya - Sunday, July 5, 2020 - link

    The chart shows <10% power for <30% perf, and <20% power for <50% perf. That seems like 2-3x perf/watt difference as well. The A13 has a total of 28MB of cache shared between the CPU+GPU, where as this seems to have 6MB for the 4+1 CPU cores sans L1 caches.

    I'd love to see an Anandtech article on how Apple's large caches help with the code density differences between x86-64/ARM and with lower clock speeds, power consumption.
  • Wilco1 - Sunday, July 5, 2020 - link

    The code density of AArch64 is significantly better than x86_64, so even at same cache sizes Arm has an advantage.
  • ichaya - Wednesday, July 8, 2020 - link

    Source? Everything I've read says x86-64 still has a diminishing but slight advantage in code density. If anything, lower clock speeds are helping Apple by avoiding memory pressure issues at higher clock speeds. I highly doubt AArch64 could perform the same as x86-64 with equal caches at any clock speed. uArch differences could outweigh these differences, but I've seen evidence of this given how large Apple's caches have been.
  • ichaya - Wednesday, July 8, 2020 - link

    * I've seen no evidence of this given how large Apple's caches have been.

    Correcting the last sentence in post above.
  • Wilco1 - Wednesday, July 8, 2020 - link

    No, x86 has never had good code density, 32-bit x86 is terrible compared to Thumb-2. x86_64 has worse code density than 32-bit x86, and it gets really bad if you use SIMD instructions.

    Try building a large binary on both systems using the same compiler and compare the .text sizes. For example I use all of SPEC2017 built with identical GCC version and options. AArch64 code is generally 10-15% smaller.

    Many AArch64 cores already have higher IPC - yes that absolutely means they are faster than x86 cores at the same clock frequency using similar sized caches.

    This https://images.anandtech.com/graphs/graph15578/115... shows Neoverse N1 has ~28% higher IPC than EPYC 7571 and ~21% higher IPC than Xeon Platinum 8259 on SPECINT2017. While Naples has 2x8MB LLC on each chiplet, the Xeon has 36MBytes, more than the 32MB in Graviton 2 (both also have 1MB L2 per core).

    Recent cores like Cortex-A78 and Cortex-X1 are 30-50% faster than Neovere N1. Do the math and see where this is going. 2020 is the year when AArch64 servers outperform the fastest x86 servers, 2021 may be the year when AArch64 CPUs outperform the fastest x86 desktops.
  • ichaya - Saturday, July 11, 2020 - link

    If you compare with -march=x86-64 or with a specific uArch like -march=haswell you'll get comparable code sizes to -march=armv8.4-a. But form the runtime code density differences I've seen, x86-64 still seems to have a slight advantage.

    From the article you linked the image from (https://www.anandtech.com/show/15578/cloud-clash-a... "If we were to divide the available cache on a per-thread basis, the Graviton2 leads the set at 1.5MB, ahead of the EPYC’s 1.25MB and the Xeon’s 1.05MB." ARM's system-level cache is good idea, as is shared L2 in Apple's A* chips. But cache advantages per thread in Graviton and A* seem to signal it's not the uArch making the difference. Similar cores to Graviton's cores with less cache, do a lot worse. Not being able to clock higher than 2.5Ghz also seems to signal that the uArch/interconnects cannot keep up with memory pressure.

    To the extent that die sizes of these chips (Graviton 2 is 7nm, Epyc 7571 and Intel Xeon 8259CL are 14nm) are comparable, it's features like AVX2/SMT that seem to have been replaced with cache in the benchmarks in the article. I'll be looking forward to A* chips to see how they might stack up in Laptops and Desktops, but these are the doubts I still have.
  • ichaya - Saturday, July 11, 2020 - link

    Correct link in post above: https://www.anandtech.com/show/15578/cloud-clash-a...
  • Wilco1 - Saturday, July 11, 2020 - link

    Runtime code density? Do you mean accurately counting total bytes fetched from L1I and MOP cache? x86 won't look good because of the inefficiency of byte-aligned instructions, needing 2 extra predecode bits per byte and MOPs being very wide on x86 (64 bits in SandyBridge)... It clearly shows why byte-sized instructions are a bad idea.

    The graph I posted is for single-threaded performance, so the amount of cache per-thread is not relevant at all. Arm's IPC is higher and thus it is a better micro architecture than Skylake and EPYC 1. IPC is also ~12% better than EPYC 7742 based on https://www.anandtech.com/show/14694/amd-rome-epyc...

    In terms of all-core throughput the fastest EPYC 7742 does only ~30% better than Graviton 2 on INTrate2006. That's pretty awful considering it has 8 times the L3 cache (yes eight times!!!), twice the threads, runs at up to 3.4GHz and uses twice the power...

    In terms of die size, EPYC 7742 is ~3 times larger in 7nm, so it's extremely area inefficient compared to Graviton 2. So any suggestion that cache is used to make a weak core look better should surely be directed at EPYC?

    Graviton 2 is a very conservative design to save cost, hence the low 2.5GHz frequency. Ampere Altra pushes the limits with 80 Neoverse N1 cores at 3.3GHz base (yes that's base, not turbo!). Next year it will have 128 cores, competing with 128 threads in EPYC 3. Guess how that will turn out?
  • ichaya - Sunday, July 12, 2020 - link

    Code density and decoding instructions are separate things. Here's an older paper on code density of a particular program: http://web.eece.maine.edu/~vweaver/papers/iccd09/l...

    Single threaded workloads are obviously going to do better with a shared system-level and in Apple's case, shared L2 caches. Sharing caches is something that Intel is closer to than AMD. You cannot compare INTrate2006 or any single threaded benchmark running on an ARM where all system-level caches are available for one thread with an Epyc 7742 where only 1 CCX's L3 caches are available to one thread. That would be 32MB on Graviton 2 vs 16MB on an AMD EPYC 2 CCX. So, AMD is being 30% faster with 1/2 the cache and clocked 30% higher than Graviton 2.

    I will definitely give credit to efficient shared system/L2 cache usage to Graviton 2, A*, and other ARM chips, but comparing power usage when there are 64 cores of AVX2 on chip when there's nothing comparable on another is an irrelevant comparison if there ever was one.
  • Wilco1 - Sunday, July 12, 2020 - link

    The complexity and overhead of instruction decoding is closely related with the ISA. Byte-aligned instructions have a large cost, and since they don't give a code density advantage, it's an even larger cost! Again if you want to study code density, compare all of SPEC or a whole Linux distro. Code density of huge amounts of compiled code is what matters in the real world, not tiny examples that are a few hundred bytes!

    Well EPYC 7742 is only 21% faster single threaded while being clocked 36% faster. Sure Graviton 2 has twice the L3 available, but the difference between 16 and 32MBytes is hardly going to be 12%. If every doubling gave 10% then the easiest way to improve performance was to keep doubling caches!

    AVX isn't used much, surely not in SPEC, so it contributes little to total power consumption (unless you're trying to say that x86 designers are totally incompetent?). At the end of the day getting good perf/W matters to data centers, not whether a core has AVX or not.

Log in

Don't have an account? Sign up now