ARM moves at an aggressive pace, pushing out new processor IP on a yearly cadence. It needs to move fast partly because it has so many partners across so many industries to keep happy and partly because it needs to keep up with the technology its IP comes into contact with, everything from new process nodes to higher quality displays to artificial intelligence. To keep pace, ARM keeps multiple design teams in several different locations all working in parallel.

At its annual TechDay event last year, held at one such facility in Austin, Texas, ARM introduced the Mali-G71 GPU—the first to use its new Bifrost GPU architecture—and the Cortex-A73 CPU—a new big core to replace the A72 in mobile. Notably absent, however, was a new little core.

Another year, another TechDay, and another ARM facility (this time in Cambridge, UK)—can only mean new ARM IP. Over the span of several days, we got an in-depth look at its latest technologies, including DynamIQ, the Mali-G72 GPU, the Cortex-A75, and (yes, finally) the successor to the A53: Cortex-A55.

The A53 was announced alongside the A57 and has been in use for several years, both on its own or as the little core in a big.LITTLE configuration. It’s been hugely successful, with more than 40 licensees and 1.7 billion units shipped in just 3 years. But during this time ARM introduced new big cores on a yearly cadence, moving from A57 to A72 to A73. The A53 remained unchanged, however, even as the performance gap between the big and little cores continued to grow.

Predictably then, the focus for A55 was on improving performance. The A53’s dual-issue, in-order core, which serves as the starting point for A55, already delivers good throughput, so ARM focused on improving the memory system. A new data prefetcher, an integrated L2 cache that reduces latency by 50%, and an extra level of L3 cache (among other changes) give the A55 significantly better memory performance—quantified by a nearly 2x improvement in the LMBench memory copy test. The numbers provided by ARM also show an 18% performance gain in SPECint 2006 and an even bigger 38% gain in SPECfp 2006 relative to the A53. These numbers, as well as the others shown in the chart, comparing the A55 and A53 are at the same frequency, same L1/L2 cache sizes, same compiler, etc. and are meant to be a fair comparison. The actual gains should actually be a little higher, because partner SoCs will benefit from adding the L3 cache, which these numbers do not include.

The additional performance does not come for free, however. Power consumption is up 3% relative to the A53 (iso-process, iso-frequency), but power efficiency still improves by 15% when running SPECint 2000 because of its higher performance.

The A55 includes several new features too that will help it expand into new markets. Virtual Host Extensions (VHE) are very important for the automotive market and the advanced safety and reliability features, including architectural RAS support and ECC/parity for all levels of cache are critical for many applications, including automotive and industrial. There’s new features for infrastructure applications too, including a new Int8 dot product instruction (useful for accelerating neural networks). Because A55 is compatible with DynamIQ, it also gets cache stashing and access to a 256-bit AMBA 5 CHI port.

When ARM announced the A73 last year, it talked a lot about improving sustained performance and working within a tight thermal envelope. In other words, the A73 was all about improving power efficiency. The A75 goes in a different direction: Taking advantage of the A73’s thermal headroom, ARM focused on improving performance while maintaining the same efficiency as the A73.

Our previous performance testing revealed mixed results when comparing the A73 to the A72—not too surprising given the significant differences in microarchitecture—with the A73 generally outpacing the A72 by a small margin for integer tasks but falling behind the older CPU in floating point workloads. Things look better for the A75, at least based on ARM’s numbers, which show noticeable gains over the A73 in both integer and floating-point workloads as well as memory streaming.

The graph above shows that the A75 operating at 3GHz on a 10nm node achieves better performance and the same efficiency as an A73 operating at 2.8GHz on a 10nm node, which means the A75 consumes more power. How much more is difficult to tell based on this one simple graph. We know that the A73 is thermally limited when using 4 cores (albeit less so than the A72), so the A75 definitely will be as well. This is not a common scenario, however. Most mobile workloads only fire up 1-2 cores at a time and usually only in short bursts. ARM obviously felt comfortable enough using the A73’s extra thermal headroom to boost performance without negatively impacting sustained performance.

ARM wants to push the A75 into larger form-factor devices with power budgets beyond mobile’s 750mW/core too by pushing frequency higher. Something like a Chromebook or a 2-in-1 ultraportable come to mind. At 1W/core the A75 delivers 25% higher performance than the A73 and at 2W/core the A75’s advantage bumps up to 30% when running SPECint 2006. If anything, these numbers highlight why it’s not a good idea to push performance with frequency alone, as dynamic power scales exponentially.

ARM targeted the A73 specifically at mobile by focusing on power efficiency and removing some features useful for other applications to simplify the design, including no ECC on the L1 cache and no option for a 256-bit AMBA 5 CHI port. With A75, there’s now a clear upgrade path from A72. For the server and infrastructure markets, A75 supports ECC/parity for all levels of cache and AMBA 5 CHI for connecting to larger CCI, CCN, or CMN fabrics, and for automotive and other safety critical applications there’s architectural RAS support, protection against data poisoning, and improved error management.

On the next few pages, we’ll dive deeper into the technical details and features of ARM’s new IP, including DynamIQ (the next iteration of big.LITTLE), Cortex-A75, and Cortex-A55.

Comments Locked


View All Comments

  • Meteor2 - Monday, May 29, 2017 - link

  • Paul A. Clayton - Monday, May 29, 2017 - link

    The A55 design is constrained not merely by area and power but also by configurability. Being able to vary the L1 cache sizes from 16 KiB to 64 KiB means that the pipeline structure and cycle time is not optimized for one size. Targeting multiple processes and design factors (e.g., SRAM libraries can be tuned for different performance/area/power tradeoffs) also constrains optimization.

    While ARM might have had in mind a particular implementation for optimization (for which it might provide hard cores), it is still limited to providing acceptable designs for other implementations. Some microarchitectural optimizations might strongly depend on implementation details which are outside of ARM's control.

    There are probably also higher-risk design possibilities that were not explored simply because the resources were not available. Having multiple design teams with similar targets typically would mean wasting effort, but such provides a potential for a better design. It would be difficult for ARM to charge for the cost of unused designs given that other designs are available.

    Targeting a broad range of workloads also means a design will tend to be worse than a design targeting a narrower range of workloads.
  • Kevin G - Monday, May 29, 2017 - link

    Of course they could but would those changes have permitted it still be within the design constraints of the A55? Small die size and lower power are two characteristics that are not compromised for the A55. Faster is easy to do with more power but considering that the A55 is the little core, higher power consumption is to be avoided. Similarly a faster core might be done with a larger die area. There are trade offs here but the pressure from ARM's customers is to keep this as small as possible.

    Considering those constraints, I considering any improvements to be rather impressive. If there is a silver bullet that ARM could have used to make it faster/smaller/consumer less power in these designs without violating the constraints they have in place, I'd like know what it was.
  • tipoo - Monday, May 29, 2017 - link

    Alas, still waiting to find out how different Apples Zephyr is from standard Little cores like it. It's nearly twice as big.
  • jjj - Monday, May 29, 2017 - link

    "What will be the goal for the next core, which will be coming from ARM’s Austin team that produced the A72? "

    That was my main question too but my hope was that the next core is aiming for much higher IPC. They need it for server and dual big core configs in mobile on 7nm.
    Or maybe they don't quite need it really, A75 is really fast and if the next core adds 15-20% higher IPC combined with higher clocks enabled by the process, that's quite a lot and rather amazing from a perf density perspective.

    Not much talk about area, any clue how A75 + DynamIQ compare to previous solutions - ofc the cache part is easy to factor in.

    It is interesting that A75 scales better with higher clocks, any guesses for clocks at 2W? A laptop with 4b4L would be rather nice.

    A55 not targeting higher clocks seems a bit odd, would mean that power goes down if folks move from A53 on 16FF to A55 on 7nm so maybe ARM has another update before 7nm.
  • Meteor2 - Monday, May 29, 2017 - link

    I think these are still mobile CPUs. It's up to Cavium et al to do ARM ISA-compatible designs for servers. ARM's not that bothered; the mobile market is far larger.
  • jjj - Monday, May 29, 2017 - link

    ARM is very eager to go server and just a year ago ARM was targeting 25% share in server by 2020. This gen does highlight infrastructure as they call it, a large segment where they've been gaining share and the next step is server.
    7nm is where it starts really, TSMC has the HPC version of the process and ARM needs to be ready too with the core that follows A75.
    What's is unclear is the strategy. A75 is already desktop class so they could just increase IPC some more but maybe they can aim higher. It seems that the Austin team got an extra year to work on the next core so that's 3 years, could be an entirely new design.
  • Kevin G - Monday, May 29, 2017 - link

    ARM in the server space is sound much like the hype of Linux on the desktop: always 'next year'.

    The challenges ARM designs have had have been to simply get out to market. AMD's Seattle chip is indeed out but suffered two years of delays and most of the design wins have evaporated due to it. AMD's K12 efforts are MIA right now. Similarly Cavium's ThunderX line is interesting but not the game changer it was hyped to be. Broadcom has exited the ARM server market after promoting an interesting design (SMT on ARM!). Applied Micro's efforts for ARM servers have been lost to corporate mergers. Caldexa folded years ago.

    The one interesting ray of hope is that there are indeed some customers like Microsoft, Facebook, Google and Amazon who are interesting in ARM's low power nature to certain workloads. Microsoft has a version of Windows Server running on ARM but is not releasing it publicly, rather keeping it tied to their Azure cloud services. I have yet to hear where MS has gotten their ARM hardware from though. Google has dipped their toe into chip development for their deep learning efforts and it would be a straight forward process to piece together their own server designs from licensed IP blocks now that they have the in-house expertise to do it (saying they can and them doing it are two different things). In the end, the big cloud providers who could have spurred the ARM server space for everyone may keep the ARM server idea private to themselves while the rest of the market gets to deal with x86. Considering that x86 is perceived as higher power and higher cost, this serves the cloud providers well as it give incentive for companies to migrate to their cloud solutions instead of looking at ARM alternatives.

    The other difficulty for ARM in the market place right now is that Intel preemptively released their response: the Xeon D. Intel was doing a performance/watt play there and it paid for for the low end server market. In most cases, the Xeon D for a pure single socket server was a better choice than the Xeon E5 1xxxx or Xeon E3 line up. I suspect that Intel management sees Xeon D as 'too good' and thus hasn't been quick to bring an updated Sky Lake version to market.
  • Wilco1 - Monday, May 29, 2017 - link

    Please read: - it says both Vulcan and XGene are alive. You forgot to mention QC's Centric (48 cores on 10nm, available this year). There are also 64-core/256GB DRAM beasts made by HiSilicon.
  • jjj - Monday, May 29, 2017 - link

    If you assume a 15-20% IPC gain over A75 for ARM's 7nm core and clock it past 4GHz for server, that's somewhat the worst case scenario for where ARM is in server in 2018-2019. We can assume DinamIQ evolves a bit by then too.
    That wouldn't be bad at all and ARM has extraordinary perf density. They might deliver more than that, we'll see.

Log in

Don't have an account? Sign up now