ARM told us to expect some of the first 64-bit ARMv8 based SoCs to ship in 2014, and it looks like we're seeing just that. Today Qualcomm is officially announcing its first 64-bit SoC: the Snapdragon 410 (MSM8916). 

Given that there's no 64-bit Android available at this point, most of the pressure to go to 64-bit in the Android space is actually being driven by the OEMs who view 64-bit support as a necessary checkbox feature at this point thanks to Apple's move with the A7. Combine that with the fact that the most ready 64-bit IP from ARM is the Cortex A53 (successor to the Cortex A5/A7 line), and all of the sudden it makes sense why Qualcomm's first 64-bit mobile SoC is aimed at the mainstream market (Snapdragon 400 instead of 600/800).

I'll get to explaining ARM's Cortex A53 in a moment, but first let's look at the specs of the SoC:

Qualcomm Snapdragon 410
Internal Model Number MSM8916
Manufacturing Process 28nm LP
CPU 4 x ARM Cortex A53 1.2GHz+
GPU Qualcomm Adreno 306
Memory Interface 1 x 64-bit LPDDR2/3
Integrated Modem 9x25 core, LTE Category 4, DC-HSPA+

At a high level we're talking about four ARM Cortex A53 cores, likely running around 1.2 - 1.4GHz. Having four cores still seems like a requirement for OEMs in many emerging markets unfortunately, although I'd personally much rather see two higher clocked A53s. Qualcomm said the following about 64-bit in its 410 press-release:

"The Snapdragon 410 chipset will also be the first of many 64-bit capable processors as Qualcomm Technologies helps lead the transition of the mobile ecosystem to 64-bit processing.”

Keep in mind that Qualcomm presently uses a mix of ARM and custom developed cores in its lineup. The Snapdragon 400 line already includes ARM (Cortex A7) and Krait based designs, so the move to Cortex A53 in the Snapdragon 410 isn't unprecedented. It will be very interesting to see what happens in the higher-end SKUs. I don't assume that Qualcomm will want to have a split between 32 and 64-bit designs, which means we'll either see a 64-bit Krait successor this year or we'll see more designs that leverage ARM IP in the interim. 

As you'll see from my notes below however, ARM's Cortex A53 looks like a really good choice for Qualcomm. It's an extremely power efficient design that should be significantly faster than the Cortex A5/A7s we've seen Qualcomm use in this class of SoC in the past.

The Cortex A53 CPU cores are paired with an Adreno 306 GPU, a variant of the Adreno 305 used in Snapdragon 400 based SoCs (MSM8x28/8x26).

The Snapdragon 410 also features an updated ISP compared to previous 400 offerings, adding support for up to a 13MP primary camera (no word on max throughput however).

Snapdragon 410 also integrates a Qualcomm 9x25 based LTE modem block (also included in the Snapdragon 800/MSM8974), featuring support for LTE Category 4, DC-HSPA+ and the usual legacy 3G air interfaces.

All of these IP blocks sit behind a single-channel 64-bit LPDDR2/3 memory interface.

The SoC is built on a 28nm LP process and will be sampling in the first half of 2014, with devices shipping in the second half of 2014. Given its relatively aggressive schedule, the Snapdragon 410 may be one of the first (if not the first) Cortex A53 based SoCs in the market. 

A Brief Look at ARM's Cortex A53

ARM's Cortex A53 is a dual-issue in-order design, similar to the Cortex A7. Although the machine width is unchanged, the A53 is far more flexible in how instructions can be co-issued compared to the Cortex A7 (e.g. branch, data processing, load-store, & FP/NEON all dual-issue from both decode paths). 

The A53 is fully ISA compatible with the upcoming Cortex A57, making A53 the first ARMv8 LITTLE processor (for use in big.LITTLE) configurations with an A57

The overall pipeline depth hasn't changed compared to the Cortex A7. We're still dealing with an 8-stage pipeline (3-stage fetch pipeline + 5 stage decode/execute for integer or 7 for NEON/FP). The vast majority of instructions will execute in one cycle, leaving branch prediction as a big lever for increasing performance. ARM significantly increased branch prediction accuracy with the Cortex A53, so much that it was actually leveraged in the dual-issue, out-of-order Cortex A12. ARM also improved the back end a bit, improving datapath throughput. 

The result of all of this is a dual-issue design that's pushed pretty much as far as you can without going out-of-order. Below are some core-level performance numbers, all taken in AArch32 mode, comparing the Cortex A53 to its A5/A7 competitors:

Core Level Performance Comparison
All cores running at 1.2GHz DMIPS CoreMark SPECint2000
ARM Cortex A5 1920 - 350
ARM Cortex A7 2280 3840 420
ARM Cortex A9 r4p1 - - 468
ARM Cortex A53 2760 4440 600

Even ignoring any uplift from new instructions or 64-bit, the Cortex A53 is going to be substantially faster than its predecessors. I threw in hypothetical SPECint2000 numbers for a 1.2GHz Cortex A9 to put A53's performance in even better perspective. You should expect to see better performance than a Cortex A9r4 at the same frequencies, but the A9r4 is expected to hit much higher frequencies (e.g. 2.3GHz for Cortex A9 r4p1 in NVIDIA's Tegra 4i). 

ARM included a number of power efficiency improvements and is targeting 130mW single-core power consumption at 28nm HPM (running SPECint 2000). I'd expect slightly higher power consumption at 28nm LP but we're still talking about an extremely low power design.

I'm really excited to see what ARM's Cortex A53 can do. It's a potent little architecture, one that I wish we'd see taken to higher clock speeds and maybe even used in higher end devices at the same time. The most obvious fit for these cores however is something like the Moto G, which presently uses the 32-bit Cortex A7. Given Qualcomm's schedule, I wouldn't be surprised to see something like a Moto G update late next year with a Snapdragon 410 inside. Adding LTE and four Cortex A53s would really make that the value smartphone to beat.

Comments Locked


View All Comments

  • michael2k - Monday, December 9, 2013 - link

    Do you know how computers work?
    Registers actively store the work in progress of the CPU. Adding two numbers takes three registers. Adding 12 sets of numbers in parallel takes 36 registers.

    Increasing your register count 10 fold allows you 10x improvement in performance, assuming 10 available execution units to do work. The way register files work, you can also work on more bits at a time too! Instead of adding 2 ints you can add 20 ints, 10 doubles, or 5 floats at a time.

    Your car analogy is entirely baseless here.
  • xdrol - Monday, December 9, 2013 - link

    I'm pretty sure it's you who does not know how do registers work..

    You don't get 10x performance from 10x registers. You can maybe, if you are very-very lucky, get 1/10th usage of the main memory. That is faster, but even if your program has 50% memory operations (unrealistically high) and 50% other, then you get from 50%+50% -> 5%+50% execution time from the 10x registers, that is 1.8x speedup, all things being super optimal. In exchange, the registers themselves use more power.

    You get 10x performance from 10x execution units. That will give you *more than* 10x power (due to how pipelines work) too. Even Haswell has only 7 execution units..
  • michael2k - Tuesday, December 10, 2013 - link

    I explained my post perfectly. 10x registers and 10 available execution units = 10x improvement in performance. I apologize if my hyperbole threw you for a loop, I was trying to explain a concept.

    10 execution units with only 3 registers = 1 add per clock. 10 execution units with 30 registers = 30 adds per clock. I also hinted at parallel processing; if you have 5 floats that need to be added to 5 floats (or multiplied, or accumulated, or whatever), you can do that in one clock cycle now.

    That's all theoretical, to be sure; the reality is that ARMv8 has 32 128bit registers, and is useful for SIMD (single instruction, multiple data) operations:

    Anand already covered this. AES saw an 8x improvement, for example, and DGEMM nearly 2x.
  • Exophase - Tuesday, December 10, 2013 - link

    AES saw a huge improvement because AArch64 has instructions specifically to assist AES acceleration which Geekbench is leveraging. If DGEMM uses double precision then it'd have seen a big improvement due to AArch64 adding support for double precision SIMD. The smaller improvements (and one notable regression) in the integer tests could be from the increased register count but possibly also from other factors, like for example if Cyclone is more efficient with conditional select in AArch64 than predication in AArch32.

    As for register counts, AArch64 actually has 31 64-bit general purpose registers + 1 64-bit stack pointer and 32 128-bit SIMD registers.
  • Arbee - Tuesday, December 10, 2013 - link

    More registers = the compiler can generate better, more efficient code. This is why some software runs up to 20% faster on x64 vs. x86 with just a recompile.

    As for lower power, listen to some AT podcasts about the "race to sleep" concept. All other things being equal, a phone that finishes a task faster can use less battery.
  • Wilco1 - Tuesday, December 10, 2013 - link

    Actually "race to sleep" uses more power because you are running the CPU at a higher frequency and voltage. It's always better to spread tasks across multiple CPUs and run at a lower frequency and voltage, even if that means it takes longer to complete.
  • WeaselITB - Tuesday, December 10, 2013 - link

    Um, no. Not in the slightest. Race to sleep is the best fit we've yet come up with given our current technologies (i.e., constant running power and fixed performance-to-sleep transition requirements).
    for some examples.
  • Wilco1 - Tuesday, December 10, 2013 - link

    LOL. Not all CPUs are 130W extreme edition i7's on an extremely leaky process!!!

    A < 5W mobile core on a modern low power process has very little leakage (unlike the i7), so it is always better to scale the clock and voltage down as much as possible to reduce power consumption. Big.LITTLE takes that one step further by moving onto a slower, even more efficient core. Running as fast as possible on the fastest core is only sure way to run your batteries down fast.
  • michael2k - Tuesday, December 10, 2013 - link

    Anand has been talking about 'race to sleep' as it applies to mobile CPUs since 2010 now. So, no, it isn't always better to scale the clock down, or it hasn't been in practice.
  • Wilco1 - Tuesday, December 10, 2013 - link

    That link doesn't say anything about "race to idle". Basically if you look at the first graph it shows that the 3 devices have different idle consumption simply due to using different hardware (the newest hardware wins as you'd expect). Anand concludes the device with the lowest idle consumption uses less energy over a long enough timeframe eventhough it may use far more power when being active. True of course, but that has nothing to do with "race to idle".

    Let me show you a link that explains why running slower uses less energy:

    Look at the right side of the graph "Heterogeneous CPU operation". That shows performance vs the amount of power consumed. As you can see it is not linear at all, and the more performance you require, the more the graph curves to the right (which means less efficient). To paraphrase Anand: "Based on this graph, it looks like it takes more than 3x the power to get 2x the performance of the A7 cluster using the Cortex A15s." So if you did "run to idle" on the A15, you'd use at least 50% more energy to execute the same task on the A7. Of course the A7 runs slower and so returns to idle later than the A15, but it still uses less energy overall.

Log in

Don't have an account? Sign up now