Given the timing of yesterday's Cortex A53 based Snapdragon 410 announcement, our latest Ask the Experts installment couldn't be better. Peter Greenhalgh, lead architect of the Cortex A53, has agreed to spend some time with us and answer any burning questions you might have on your mind about ARM, directly.

Peter has worked in ARM's processor division for 13 years and worked on the Cortex R4, Cortex A8 and Cortex A5 (as well as the ARM1176JZF-S and ARM1136JF-S). He was lead architect of the Cortex A7 and ARM's big.LITTLE technology as well. 

Later this month I'll be doing a live discussion with Peter via Google Hangouts, but you guys get first crack at him. If you have any questions about Cortex A7, Cortex A53, big.LITTLE or pretty much anything else ARM related fire away in the comments below. Peter will be answering your questions personally in the next week.

Please help make Peter feel at home here on AnandTech by impressing him with your questions. Do a good job here and I might be able to even convince him to give away some ARM powered goodies...

Comments Locked


View All Comments

  • name99 - Wednesday, December 11, 2013 - link

    It's far worse. This is not a secret and not worth asking about. The two target completely different spaces --- expensive and high performance vs dirt cheap and adequate performance.
  • sverre_j - Tuesday, December 10, 2013 - link

    I have just been through the extensive instruction ARMv8 set ( and there must be several hundred instructions in total), so my question is whether ARM believes that compilers, such as gcc, can be set up to take advantage of most of the instruction set, or whether one will still depend on assembly coding for a lot of the advanced stuff.
  • Peter Greenhalgh - Wednesday, December 11, 2013 - link

    Hi Sverre,

    The AArch64 instruction set in the ARMv8 architecture set is simpler than the ARMv7 instruction set. The AArch32 instruction set in the ARMv8 architecture is an evolution of ARMv7 with a few extra instructions. From this perspective, just the same as compilers such as GCC can produce optimised code for the billions of ARMv7 devices on the market I don’t see any new challenge for ARMv8 compilers.
  • SleepyFE - Tuesday, December 10, 2013 - link

    Now that you have a 64bit ISA are you planning something bigger (size wise)? So far ARM CPU-s are built into SOC-s but i would like to know if you are going to make an A1000 core that will be four large cores with Mali 600 and will compete for a space in the desktops. It makes sense since all major systems (Linux, WindowsRT, iOS) are already running on ARM CPU-s.
    This is less of a question and more of a request.
  • name99 - Wednesday, December 11, 2013 - link

    The time is not yet right.
    The top of the line ARM ISA CPU, Cyclone, has IPC comparable with Intel --- which is great --- BUT at a third of Intel's frequency. Apple (and the rest of ARM) have to get to at least close to Intel's frequency while not losing that IPC.
    Not impossible, but no trivial; and until that happens the CPUs are just not that interesting for the desktop.

    The first step (which I expect Apple to take with the A8) would be an architecture like Sandy Bridge and later:
    - smallish high bandwidth per core L2's
    - unified large L3 shared by all cores and graphics [Cyclone has something that plays this role, but it's effectively an "off-chip" cache as far as the cores are concerned, being about 150 cycles away from the cores
    - ring (or something similar) tying together the cores, L3 slices, graphics and memory controller

    Done right, I expect this gets Apple to same IPC as before, but 2x the frequency, in 20nm FinFET.

    Of course that's still not good enough. Then for the A9 they have to add in a new µArch to either ramp up the IPC significantly, or improve the circuits and physical enough to turbo up to near 4GHz for reasonably long periods of time...
    As I said, not impossible, but there is still plenty of work to do.
  • SleepyFE - Wednesday, December 11, 2013 - link

    Noone said the first CPU has to be perfect. Considering low end PC's and laptops it's a good idea to start selling to OEM's. That way you can get the ball rolling on software development. Also the GPU does not have to be good since you would use a discrete one (might finally force AMD and Nvidia to write good linux drivers).
  • mercury555 - Wednesday, December 11, 2013 - link

    Though that seems like a logical progression for ARM, single thread performance is no where closer to that of Intel.
  • MrSpadge - Tuesday, December 10, 2013 - link

    Topic: ultimate Fusion

    Given the flexibility ARM has with the instruction set (compared to x86) I would like to know where ARM sees itself going mid- to long-term. The specific question being: how can we get strong single threaded performance (like in a fat Intel core) and a massive amount of energy efficient number crunchers for parallel tasks (like GPU cores)? The current state of treating them as co-processors (CUDA, OpenCL etc.) and trying to bring them closer to the cores (HSA) ultimately seem like like crutches to me, because it still takes significant effort on the software side to actually use those units.

    What I imagine as the "ultimate Fusion" of these ressources is a group of fat integer cores (like in AMDs modules, Haswells with 2/4 way HT, with big.LITTLE.. whatever you want) sharing a large pool of GPU-shader-like number crunchers, presented to like like regular floating point units now. Dispatching instructions to these cores should be as simple as using the FPU from the software side. Sure, latency would go up (hence some faster scalar local units might still be needed) but throughput could go up by orders of magnitude. Even a single thread might get access to all of them, or in case of many threads there'd be excellent load-balancing. The GPU and maybe other functions would use them as well. The number of integer / FP cores / execution units could relatively easily be scaled, depending on the application (server, HPC, all-round).

    Intel and AMD have the hardware building blocks, but apart from the next version of SSE/AVX I don't think there is any chance to implement such functionality in x86 efficiently. And it surely wouldn't be backwards compatible, hence take years or tens of years to trickle down the software stacks. The ARM software is much younger and more agile, as Apples quick and almost completely seamless transition to 64 bit iOS has shown. I'd even say: if anyone could pull something like this off it's ARM. What do you think?
  • name99 - Wednesday, December 11, 2013 - link

    I wonder if this sort of fusion is ultimately a bad idea.

    Even at the basic HW level, tying the GPU in with the CPU is tough because the two are so different, and it doesn't help to destroy the primary value of the GPU in this quest.
    Specifically, using the same memory space clearly has value (in performance and programmer ease). Which means using the same virtual address space and TLBs.
    Again not in itself too problematic.
    But then what if we decide that we use that TLB to support VM on the GPU side? Now life gets really tough because GPUs are not set up for precise exceptions...
    (Using the TLB to track privilege violations is less of a problem because no-one [except debuggers!] cares if the exception generated bubbles up to the OS hundreds of instructions away from its root cause.)

    WRT to the more immediate issue, the implication seems to be that a unified instruction set could be used to drive both the CPU and GPU. While this sounds cool, I fear that it's the same sort of issue --- a TREMENDOUS amount of pain to solve a not especially severe problem.
    The issue is that the processing model of the GPU is just different from a CPU --- that's a fact. Making it the same as a CPU is to throw away the value of the GPU. But since these models are so different, the only feasible instructions would seem to be some sort of "create a parameter bock then execute it" instructions --- at which point, how is this any more efficient or useful than the current scheme of using the existing CPU instructions to do this?

    I think we can gauge the value of this idea, to some extent, by the late Larrabee. Intel seem (as far as I can tell) to have started with a plan vaguely like what's described --- let's make the GPU bits more obviously part of the CPU, using more or less standard CPU concepts --- and it flat out did not work. It's mutated into the Knights SomethingOrOther series which, regardless of their value or not as HPC accelerators cards, no longer look like any part of the future of GPUs or desktop CPUs.

    I've talked about this before. CS engineers are peculiarly susceptible to the siren song of virtualization and masquerading because the digital world is so malleable. But not all virtualization is a good idea. The 90s spent god knows how much money on the idea of process and network transparent objects in various forms, from OLE to CORBA, but it all went basically nowhere; what won in that space was the totally non-transparent HTTP/HTML combo, I would say because they actually mapped onto the problem properly, rather than trying to make the problem look like a pre-existing solution.
  • MrSpadge - Wednesday, December 11, 2013 - link

    Some valid concerns, for sure. And I didn't say it would be easy :) But I think I can adress at least some of them.

    First, my idea is not to fuse CPU and GPU into each other. It's about sharing that pool of shaders, which eats a major amount of transistors and power budget in both chips and ultimately limits their performance (provided you can feed and cool the beasts). In current AMD APUs 2 cores in a module share the 2 FPUs because these units are simply huge. Intel is already on the way to 512 bit AVX, requiring even more transistors & area. Yet their throughput pales in comparison to GPUs. And to use them all we have to go fully multi-threaded, with all its software and synchronization issues. If what I have in mind works perfectly a single CPU core could easily get access to the entire pool of shaders/FPUs, if needed. It just fires off the instructions to these massively parallel, high latency FPUs instead of the local scalar one and gets massive throughput. That's the ultimate load-balancing and very efficient use of those transistors, if it works well.

    The hard-wired logic in the GPU cores (TMUs, ROPs, rasterizer etc.) would still remain. At the point where they'd usually dispatch instructions to their shaders they would now also go into that "sea of FPUs".

    Sure, internal and external bandwidth, registers and such would all need to scale to hide the increased latency from putting the execution units further away from the CPU/GPU cores. But if these costs become too large one could segment the whole thing again, like combining 1 to 4 GCN compute units with one CPU module. The amount of raw FPU horsepower available to the CPU could still increase tremendously, while the "fast path" local scalar FPU could be reduced from 2x128 bit (or more) to one double precision unit again.

    You see, I'd not necessarily want or need a unified instruction set for CPU and GPU, just the same micro-ops (or however you want to call them) to access the shaders /FPUs. Larrabee is almost a "traditional many-core CPU" in comparison ;) (if there already is such a thing)

Log in

Don't have an account? Sign up now