Next Generation Arm Server: Ampere’s Altra 80-core N1 SoC for Hyperscalers against Rome and Xeon
by Dr. Ian Cutress on March 3, 2020 11:30 AM EST- Posted in
- CPUs
- Arm
- SoCs
- GIGABYTE Server
- Neoverse
- Neoverse N1
- Ampere
- Packet
- Graviton2
- Altra
Several years ago, at a local event detailing a new Arm microarchitecture core, I recall a conversation I had with a number of executives at the time: the goal was to get Arm into 25% of servers by 2020. A lofty goal, which hasn’t quite been reached, however after the initial run of Arm-based server designs the market is starting to hit its stride with Arm’s N1 core for data-centers getting its first outings. Out of those announcing an N1-based SoC, Ampere is leading the pack with a new 80-core design aimed at cloud providers and hyperscalers. The new Altra family of products is aiming to offer competitive performance, performance per watt, and scalability up to 210 W and tons of IO, for any enterprise positioning.
The Arm Server Market: 2010-2019 (Abridged)
We’ve seen companies such as Broadcom/Cavium/Marvell, Calxeda, Huawei, Fujitsu, Phytium, Annapurna/Amazon, AppliedMicro/Ampere, and even AMD put Arm-based IP into silicon and subsequently into the server market. Up until recently, most designs have been fairly lackluster – with companies either developing their own core on an Arm architecture license and not getting a performance lift, or using the standard Arm cores and not finding the right mix of performance, power, and software uptake needed to drive home the design. As a result, we’ve seen multiple companies fall by the wayside, be acquired, or limit their activities to specific customers and keep very hush-hush.
First Generation Ampere eMAG, Built on Applied Micro designs
A big example of the ‘be acquired’ type of company was Annapurna, whom Amazon acquired and eventually released its Graviton2 processor in recent months. This chip has 64 cores based on Arm’s N1 design, which is the leading microarchitecture layout for Arm server chips at this point. To that end, Ampere (who originally purchased Applied Micro) is now set to release its second generation product, with 80 of the N1-based cores, and it now has a name: Altra.
Ampere Altra
Ampere has already given a number of details away about Altra in an announcement late last year, however this time around we have concrete details and the company has performance projections. On the back of its first generation eMAG product, Ampere is looking to offer better-than-Graviton2 performance to any cloud provider or hyperscaler who isn’t called Amazon, given that Graviton2 is built by Amazon and only available to Amazon. In that regard, Ampere has taking Arm’s full recommendations for its N1 design, building a chip with the most number of cores that N1 is designed to support.
As with other N1-based products, Altra will be single threaded, ensuring that each thread has its own core, its own resources, and removing any potential core-sharing thread security issues that have occurred recently. The Altra SoC is built with containers in mind, ensuring high-levels of quality of service with multiple customers on the same chip, and additional RAS features to ensure consistent performance.
The N1 core is by design what we’ve covered when Arm detailed the microarchitecture design last year. There is a 4-cycle 64 KB L1I/L1D caches per core, along with a 9-11 cycle 1 MB of private L2 per core. This is partnered with 32 MB of system wide LLC distributed through the SoC mesh, and all these caches are ECC with SECDED operation. It’s worth noting that 32 MB across 80 cores is less per core than Amazon’s Graviton2, which has 32 MB for 64 cores. 32 MB is actually half of what Arm recommends, as in Arm’s presentation it stated that it would expect a 64-core design to have 64 MB.
On top of the 80 cores, the SoC will also have eight DDR4-3200 memory channels with ECC support, up to 4 TB per socket. There are also 128 PCIe 4.0 lanes, with which the CPU can use 32 of them to hook up to another CPU for dual socket operation. The dual socket system can then have a total of 192 PCIe 4.0 lanes between it, as well as support for up to 8 TB of memory. We are told that it’s actually the CCIX protocol that runs over these PCIe lanes, which means 25 GB/s per x16 linkup. That’s good for 50 GB/s in each direction.
Each of the PCIe lanes can be bifurcated down to x8/x4/x2, and every different variant of the Altra SoC will only be segmented on core count and frequency: all CPUs will have 4 TB support and 128 lanes of PCIe 4.0. Each CPU can also support up to four CCIX-based accelerators.
Altra is built on TSMC’s 7nm, and while is technically an Arm v8.2 design, it does borrow a couple of features from 8.3 and 8.5, namely hardware based mitigations for side channel attacks and a couple of other small micro-architectural features.
Each of the 80 cores is designed to run at 3.0 GHz all-core, and Ampere was consistent in its messaging in that the top SKU is designed to run at 3.0 GHz at all times, even when both 128-bit SIMD units per core are being used (thus an unlimited turbo at 3.0 GHz). The CPU range will vary from 45W to 210W, and vary in core count - we suspect these SKUs will be derived from the single silicon design, and it will depend on demand as well as binning as to what comes out of the fabs. Exact SKUs are going to be announced later this year.
Also on security, Ampere was keen to point out that its new SoC will have two control processors: an SM Pro and a PM Pro. These allow for server manageability, up to SBSA Level 4, as well as Secure Boot, RAS error reporting, and advanced power management/temperature control.
Ampere will be launching with two reference designs for Altra, one in single socket called Mt. Snow, and one in dual socket called Mt. Jade. Each design will be available in 1U and 2U form factors, with PCIe 4.0 and CCIX attach, and up to 16 memory modules per socket. We know that the partner for the single socket is the GIGABYTE Server team, however the dual socket partner has not be announced yet. We have been told that the CPUs are socketed, which makes mass scale production and testing (at least on our side) a little easier.
Projected Performance
Ampere has some performance numbers, which as always we take with a grain of salt. These include 2.23x the performance on SPEC2017_int rate over a single 28-core Intel Xeon Platinum 8280, and 1.04x over a single 64-core AMD EPYC 7742. This is obviously extended into a number of claims about improved TCO. Ampere didn’t provide similar numbers for SPEC2017_fp, because the company states that the SoC has been developed with INT workloads in mind. Exact power/performance numbers were not given, but based purely on TDP, which is somewhat of an unreliable metric at times. We’ll wait to run our own numbers in due course.
Developing a Roadmap: 2021, 2022
One of the key questions going into our briefings with Ampere is how closely they are working with Arm on the next generation enterprise server core designs for upcoming SoCs. They weren’t keen to position themselves as Arm’s key partner in this venture (which might be Amazon, given they were first), but did state that there is a lot of collaboration and feedback that goes into the future designs. As a result, Ampere is able to formally declare a long-term roadmap for its product portfolio.
In this instance, Ampere is stating that today it has the 80-core Altra design on 7nm. In 2021, it will launch its Mystique product, which is currently in development (and when asked, Ampere told us will share the same socket as Altra). In 2022, Ampere will launch Siryn, and at this time the product has been defined and requires development.
Having a sustained product cadence has been critical to a number of processor designs in the last couple of decades – it tells potential ODM partners and customers that the company is in for the long haul, and committed to future developments with targets to meet. Obviously with Ampere tying itself to Arm’s roadmap helps in those product definition stages. It’s a feature that has crippled previous Arm designs from coming to market – without a clear roadmap, customers are unwilling to invest in a one-generation wonder and provide long term support for it. There’s always the issue as to whether any investment funding might run out, so Ampere’s goal here with Altra is to be the obvious answer to Graviton2 for the other hyperscalers. With that large market on offer, the goal is to be profitable and self-sustaining as quickly as possible, which then in turn gives potential customers even more confidence.
Next Stage for Altra
At this point, Ampere has stated to us that Altra is currently sampling with its key customers who are looking to deploy the hardware. From previous experience, the key customers who are involved early tend to get priority for deployment, and in that respect Ampere has stated that an official SKU list will come to market mid-year, along with pricing, and with official SPEC submissions. Hopefully at that time we will also get instance pricing from the companies intending to deploy the new chip.
We’re currently in talks with Ampere in order to obtain Altra for in-house testing when they feel it is ready. We have a version of Ampere’s previous generation eMAG workstation that just arrived in the labs, which should help us provide a good base-line from the previous design to the new one. Stay tuned for our coverage of eMAG and Altra!
Related Reading
- 80-Core N1 Next-Gen Ampere, ‘QuickSilver’: The Anti-Graviton2
- Arm Server CPUs: You Can Now Buy Ampere’s eMAG in a Workstation
- Ampere Computing: Arm is Now an Investor
- Ampere eMAG in the Cloud: 32 Arm Core Instance for $1/hr
65 Comments
View All Comments
Threska - Tuesday, March 3, 2020 - link
Machine Learning.The_Assimilator - Wednesday, March 4, 2020 - link
Their "Processor Complex" slide specifically calls out fp16 for ML, yet the manufacturer doesn't provide any FP performance numbers. Translation: it's bad. Really, really bad. "I wouldn't use it even if I found it dumpster diving" bad.The wannabe Arm server vendors had an open window of opportunity when Intel was the only game in server town and set their pricing to exorbitant levels of greed as a consequence. Then AMD launched Zen, the cost of server x86 CPUs dropped back down to acceptable levels, and the window slammed firmly shut.
We'll probably continue to see a few more in-house Arm server CPU designs from the cloud providers for their bottom-of-the-barrel pricing tiers, but nobody else is going to waste their money on putting trash Arm chips in your server when they can get a real x86 CPU that can actually do real work like floating point operations and AVX. I give Ampere another two years before the realities of the market relegate it to the annals of history; at best they can hope to be purchased by someone like Microsoft or Oracle.
CiccioB - Wednesday, March 4, 2020 - link
You may be impressed by the number of jobs that do NOT require floating point or AVX calculations to have a useful result.This chip has been thought for cloud and edge work, not for scientific simulations.
Wanting to have numbers for tasks they are not been thought for is stupid.
Believing that the gross datacenter work is scientific simulations stupid.
For how cheap AMD server chips are, they are not cheap or performant enough for certain jobs.
Wilco1 - Wednesday, March 4, 2020 - link
Neoverse N1 has incredibly high FP performance for its tiny size, so I would not be surprised if someone will use these as HPC nodes, for transcoding or in renderfarms etc. Even if it doesn't beat Rome on FP, it should be very close. Higher density, better perf/Watt and perf/dollar than AMD are the major selling points.CiccioB - Thursday, March 5, 2020 - link
It depends if you can distribute efficiently the work to run on parallel cores, which means how fast are internal communication between cores and how good is memory access for each of them.As said, having lots of FPUs does not mean you are good at scientific simulations. You just score well at single core FP synthetic benchmarks.
bananaforscale - Wednesday, March 4, 2020 - link
Am I imagining things or are they using superheroes as CPU names?littlemoule - Wednesday, March 4, 2020 - link
failed product with no software support. enough said.GreenReaper - Sunday, March 8, 2020 - link
I was running on an ARM server over four years ago; it has plenty of support for server operations:https://inkbunny.net/j/202221
Whether it makes sense for hosts to deploy is another matter. The momentum here is with AMD, based on what I've seen recently with the cloud products at Oracle, Google, etc.
abufrejoval - Thursday, March 5, 2020 - link
For me the main question remains open: Why is this chip economically viable?From what I understand we get performance on a similar scale as a Rome 64 core, but AMD went through hoops to break up the silicon into chiplets for many good reasons.
Yet this seems very much a monolithic chip with plenty of I/O that won't scale well to 7nm: It can't be all that small, yet unless it is quite a bit smaller than the equivalent x86, why does it stand a chance?
Is it perhaps a much better energy proportional compute performance curve under partial and bursty loads?
Trading LLC SRAM area for cores seems pretty significant, too: It hints at workloads with predictable latencies, trading spin-up delay against peak performance.
I am very hopeful a deeper dive into that will come, but until then I am quite puzzled.
Wilco1 - Thursday, March 5, 2020 - link
A Neoverse N1 core with 1MB L2 is ~2.5x as dense as a Zen 2 core with 512KB L2. You can fit all 80 cores plus 32MB L3 in 2 chiplets! The similar Graviton 2 is roughly estimated as 340mm^2. That's less than one-third of the die area of Rome...