AppliedMicro's X-Gene 3 SoC Begins Sampling: A Step in ARM's 2017 Server Ambitions
by Johan De Gelas on March 15, 2017 8:15 AM EST- Posted in
- ARMv8
- Xeon
- AppliedMicro
- X-Gene
- SoCs
- Enterprise CPUs
- 16nm
- MACOM
- X-Gene 3
There has been a lot of recent movement in the ARM Server SoC space, with three major players. The third player, AppliedMicro, has been acquired by MACOM. MACOM has announced that the third generation 16-nanometer FinFET Server-on-a-Chip (SoC) solution, X-Gene 3, is sampling to "lead customers". Despite all the products so far on ARMv8, the server world continues to mature and to move forward.
The AppliedMicro X-Gene 3
Back in 2015, we reviewed the 40 nm 8-core X-Gene 1 (2.4 GHz, 45W), which found a home in HP's Moonshot processors. Performance wise the SoC was on par with the Atom C2750 (8 cores @ 2 GHz), but consumed twice as much power, which led in our review to an overall negative conclusion. The power consumption issue was understandable: it was baked on a very old 40 nm process. But the performance was rather underwhelming, as we expected more from a 4-issue superscalar processor at 2.4 GHz. The Atom core, by comparison, was only a dual-issue design and offered similar performance at a lower frequency.
Moving forward, we got the X-Gene 2. This was a refresh of the first design, but built on 28 nm. It was still at 2.4 GHz, but with a lower power consumption (35 W TDP) and a smaller die size of around 100 mm². Despite the relatively lackluster CPU performance, the overall efficiency increase meant that the X-Gene 2 did find a home in several appliances where CPU performance was not the top priority, such as switches and storage devices.
MACOM, the new owners of the X-Gene IP, claim that the new X-Gene 3 is a totally different beast. The main performance claim is that it should be >6 times faster in SPECintRate than X-Gene 1 or 2. That performance increase is mostly because the new SoC has 4 times as many cores: 32 rather than 8. Besides the 32 ARMv8-A 64-bit cores in X-Gene 3, it will also include eight ECC capable DDR4-2667 memory channels, supporting up to 16 DIMMs (max. 1 TB), and 42 PCIe Gen 3.0 lanes.
MACOM's reference X-Gene 3 platform has everything working at near full speed: all 32 cores are functional and run as fast as 3.3 GHz. The SoC design gives 32 MB of L3 cache through a coherent network, and we are told is 'at full speed'. PCIe, USB and integrated SATA ports all work at full speed also. Memory is initially limited to 2400 MT/s instead of 2667 MT/s, but considering that the current memory market only offers buffered DDR4 DIMMs at 2400, that is not an immediate issue.
That set of specifications is impressive, but if the X-Gene 3 really wants to be a "Cloud SoC", performance has to be competitive. We look forward to testing.
The ARM Competition
The other two players are Cavium and Qualcomm.
Cavium has been on a buying spree as of late, acquiring Broadcom Vulcan IP and also Qlogic, a network/storage vendor. If Cavium can inject all that IP in it's Thunder-X server SoC line, its next generation could be a very powerful contender.
Qualcomm will have its 48-core Centriq-2400 SoC ready by the second half of this year, and it will run Windows Server.
Predicted Performance Analysis: Xeon-D Alternative
The only performance figures for X-Gene 3 we have seen so far are the ones found in a Linley Group white paper that can be accessed here:
Based on testing of the current configuration of 3.0GHz CPU frequency and DDR4-2400, the company expects the chip to deliver a SPECint_rate2006 (peak) score of at least 500 when running at its peak speed of 3.3GHz and DDR4-2667 and with some additional hardware and compiler tuning.
That benchmark value is the basis for the claim of "6x more powerful than the predecessor". We can somewhat predict how this can be possible, since SPECInt_Rate2006 scales almost perfectly: 32 cores instead of 8 already give us a 4 times increase. In order to get an overall 6x bump in performance then, each core must be (overall, including frequency) about 50% faster.
Most of the performance boost will come from the frequency: as the SoC can boost to 3.3 GHz on X-Gene 3 over the 2.4 GHz X-Gene 2, this translates to a 37.5% increase. The rest of the gains are most likely related to IPC improvement, in branch prediction and TLB architecture. All in all, 6 times higher performance is not an outrageous claim, but there are few snakes in the grass to consider.
Firstly, MACOM extrapolates from numbers at 3 GHz to 3.3 GHz. Thus the final frequency for the parts is still at the whim of tweaking and optimization, and may result in an increase in TDP over 125W. Also to note is that "additional hardware and compiler tuning is necessary", which is a general term for expected software improvements. While that might turn out to be true, other companies have promised similar and been unable to deliver, so until there's some proof it might be hard to determine at this point.
Last year APM estimated that the new X-Gene 3 would achieve 550 SPECInt_Rate2006 at 3 GHz. That claim has been revised to 500 at 3.3 GHz.
The graph above also seems to show SPEC scores run with GCC, as most published scores place the Xeon E5-2580v4 at 669. While we favor results obtained with GCC too as they more realistic, based on experience we are wary that the graph above could paint a rosy picture of X-Gene's performance.
The Linley Group states:
“X-Gene 3 can handle a broad range of cloud workloads, including scale-up and scale-out applications. The processor excels on big data, particularly in-memory databases, because of its high memory bandwidth."
The 8-channel 32 core X-Gene 3 achieves 67 GB/s. It is weird that the paper, written in March 2017, still mentions the use of DDR4-2133. If we compare the results to the typical Xeon scores we have measured in previous reviews, we get the following:
Our testing methodology is described here.
For those of you who are not familiar with Stream: the CPU does not matter much. When there are enough cores/threads to generate sufficient demand on the memory subsystem, the peak bandwidth numbers are observed regardless of additional cores (see testing done by Dell). In some circumstances adding more cores actually gets a net decrease. So despite including Intel's top model in the graph above, there is no performance benefit.
The 8-channel X-Gene 3 achieves, with 32 cores, somewhere between 24% (compared to the best result of the Xeon in ICC) and 63% better bandwidth than a similar single Xeon system with DDR4 at the same speed. But an Intel system with the same amount of memory channels would still be better. For comparison, but not listed on the chart, in our test of a single CPU Power8 system, it achieved 91 GB/s due to its memory subsystem (using Centaur chips), despite our relatively simple GCC settings and the use of DDR3-1333. The X-Gene 3 bandwidth numbers are vastly superior to those of the X-Gene 1 (19 GB/s, see here), but it is worth noting that X-Gene 1 had only 4 channels using DDR3-1600.
The X-Gene 3 results are more than respectable, but the official quotes from the Linley paper that 'the processor excels on big data' would seem to come across as an exaggeration without any direct benchmark data to back it up. As always at AnandTech, we like to make conclusions on hard data, and look forward to being able to verify the claims.
Conclusion
From the announcement and released data, the new X-Gene 3 core would appear to be the fastest ARM-v8a server SoC announced in the market so far. The engineers behind X-Gene deserve some applause for their tenacity, and for gradually improving their product to the point where it is a serious threat to the lower to mid-end of the Xeon E5 range. But those numbers need to be externally verified.
There are still a number of uncertainties, for sure. The bandwidth numbers are good, but not impressive. The power usage has not been tested, and only publishing SPECInt_rate2006 estimates (that already have been revised downwards) does not by itself as a guarantee of good overall server performance for the platform.
One thing is interesting: the arrival of the X-Gene 3 puts a lot of pressure on Intel's decision to artificially curtail the Xeon D* platform. Intel's fastest Xeon D (D-1587) offers lot of performance with 16 cores and 32 threads as 2.3 GHz, all inside a low 65W TDP - but the Xeon D has only 2 memory channels, can support only 128 GB of memory, and costs $1754 list price.
*We say curtail, based on Xeon-D being based on Broadwell and rather than updating to the latest microarchitecture. Intel's recent release of new networking focused Broadwell-based Xeon-D parts suggests that an update to the platform might be far off in the distance.
From what we can tell, the X-Gene 3 is rumored to cost less than $1200. At that price, it offers much more memory bandwidth and capacity, given its 8-channels and support for up to 1 TB. So although we have some reservations, we welcome the X-Gene 3 to be the cat among the Xeon D pigeons.
Additional Images
While at MWC, Anton was able to score some images of an X-Gene 3 system being demonstrated at the show. Despite it being a mobile show, given the size of ARMs presence, perhaps it might not be unexpected to see some of them on display. The unit was at the Kontron booth, and the date code on the heatspreader puts the manufacturing timing at 2016, week 53.
Related Reading
- Applied Micro's X-Gene: The First ARMv8 SoC
- ARM Challenging Intel in the Server Market: An Overview
- X-Gene 1, Atom C2000 and Xeon E3: Exploring the Scale-Out Server World
24 Comments
View All Comments
deltaFx2 - Monday, March 20, 2017 - link
Spec int rate, 2006 at least, is a bit of a dopey benchmark in that, beyond a certain number of cores, it turns into a bandwidth benchmark. i.e. memory bandwidth matters more than core performance or even memory latency. That's how Cavium is able to post half-decent scores in that benchmark. Their single-thread performance is worse than ARM A53 but they throw a lot of cores at it and a good number of memory channels.Krysto - Friday, March 24, 2017 - link
Good to see it's quite competitive at 14nm, too. It will be interesting to see how it compares against Qualcomm's 10nm Centriq 2400.However, I do think that it would be smart of them to release X-Gene 4 on 7nm, 2nd half of 2019 at the latest.
deltaFx2 - Saturday, March 25, 2017 - link
@Krysto: MACOM is trying to get rid of the xgene line. At a price, if possible. I think that's smart, as was Broadcom's move to sell off Vulcan to Cavium. Only Qualcomm has the resources (mostly $$) to aggressively push ARM into the server space (maybe, if their shareholders are willing to bear the losses). And I'm confident it's going to be a hard slog. ARM has to bring something fundamental to the table that x86 does not, and cannot, and I haven't seen it yet. Other than a nebulous idea of 'competition'. Honestly, if I wanted to change ISAs, I'd go with POWER. They have the performance, and have demonstrated they know what they're doing (they invented the field, pretty much).Chakravaka - Tuesday, April 4, 2017 - link
It is funny how they don't post the TDP of this xgene 3 chip. If you ask any data center operator the most important thing is performance per watt.From my personal experience, when ARM processors attempt to catch-up with the Intel SPECint rate per thread performance their power consumption shoots up way over Intel CPU's power consumption(atleast 2X). Intel Broadwell TDP is very difficult to beat and in data centers this becomes a huge problem for ARM processors. In addition when you pack 32+ cores in single chip and all the cores start doing memory transfers, the memory channels will saturate in only 16 cores. So there is no real advantage of having 32 core chip for memory bandwidth hungry apps. In fact it is bad thing to have so many cores in a chip for those apps.
Another important things these ARM vendors never tell you is floating point performance of all ARM cpus is very bad compared to Intel. In most of the real life workloads the floating point performance matters and Intel is unbeatable in that category so far. The HPC customers may augment their systems with GPUs and FPGAs but most of the data center customers don't do that although it has started to change . Even if GPUs and FPGAs are used in increasing number, the ARM CPUs don't provide any advantage over Intel . So the only thing Intel has to do to maintain its market share is lower its price. This will surely drive all ARM vendors out of business because their only offering so far is lower price.