Today at the annual Hot Chips conference, AMD’s new CTO Mark Papermaster unveiled the first details about the Steamroller x86 CPU core.

Steamroller is the third instantiation of AMD’s Bulldozer architecture, first conceived in the mid-2000s and finally brought to market in late 2011. Committed to this architecture for at least one more design after Steamroller, AMD has settled on roughly yearly updates to the architecture. For 2012 we have the introduction of Piledriver, the optimized Bulldozer derivative that formed the CPU foundation for AMD’s Trinity APU. By the end of the year we’ll also see a high-end desktop CPU without processor graphics based on Piledriver.

Piledriver saw a switch to hard edge flip flops, which allowed for a considerable decrease in power consumption at the expense of careful design and validation work. Performance didn’t change, but AMD saw a 10% - 20% reduction in active power. Piledriver also brought some scheduling efficiency improvements, but prefetching and branch prediction were the two other major design improvements in Piledriver.

Steamroller is designed to keep the ball rolling. It takes fundamentals from the Bulldozer/Piledriver architectures and offers a healthy set of evolutionary improvements on top of them. In Intel speak Steamroller wouldn’t be a tick as it isn’t accompanied by a significant process change (28nm bulk is pretty close to 32nm SOI), but it’s not a tock as the architecture is mostly enhanced but largely unchanged. Steamroller fits somewhere in between those two extremes when it comes to changes. 

Front End Improvements

One of the biggest issues with the front end of Bulldozer and Piledriver is the shared fetch and decode hardware. This table from our original Bulldozer review helps illustrate the problem:
Front End Comparison
  AMD Phenom II AMD FX Intel Core i7
Instruction Decode Width 3-wide 4-wide 4-wide
Single Core Peak Decode Rate 3 instructions 4 instructions 4 instructions
Dual Core Peak Decode Rate 6 instructions 4 instructions 8 instructions
Quad Core Peak Decode Rate 12 instructions 8 instructions 16 instructions
Six/Eight Core Peak Decode Rate 18 instructions (6C) 16 instructions 24 instructions (6C)
Steamroller addresses this by duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. Don’t expect a doubling of performance since it’s rare that a 4-issue front end sees anywhere near full utilization, but this is easily the single largest performance improvement from all of the changes in Steamroller. 
The penalties are pretty obvious: area goes up as does power consumption. However the tradeoff is likely worth it, and both of these downsides can be offset in other areas of the design as you’ll soon see.

Steamroller inherits the perceptron branch predictor from Piledriver, but in an improved form for better performance (mostly in server workloads). The branch target buffer is also larger, which contributes to a reduction in mispredicted branches by up to 20%. 

Execution Improvements

AMD streamlined the large, shared floating point unit in each Steamroller module. There’s no change in the execution capabilities of the FPU, but there’s a reduction in overall area. The MMX unit now shares some hardware with the 128-bit FMAC pipes. AMD wouldn’t offer too many specifics, just to say that the shared hardware only really applied for mutually exclusive MMX/FMA/FP operations and thus wouldn’t result in a performance penalty. 
The reduction of pipeline resources is supposed to deliver the same throughput at lower power and area, basically a smarter implementation of the Bulldozer/Piledriver FPU. 

There’s no change to the integer execution units themselves, but there are other improvements that improve integer performance. 
The integer and floating point register files are bigger in Steamroller, although AMD isn’t being specific about how much they’ve grown. Load operations (two operands) are also compressed so that they only take a single entry in the physical register file, which helps increase the effective size of each RF. 
The scheduling windows also increased in size, which should enable greater utilization of existing execution resources. 
Store to load forwarding sees an improvement. AMD is better at detecting interlocks, cancelling the load and getting data from the store in Steamroller than before.
Cache Improvements & Looking Forward
Comments Locked


View All Comments

  • CeriseCogburn - Wednesday, August 29, 2012 - link

    Another amd liar, and here's the proof:

    More BS from the amd bs artists of the web.
  • GaMEChld - Thursday, August 30, 2012 - link

    Wait, I'm confused, who was lying about what? I'm not sure what that toms hardware link was supposed to prove, since both of those guys were talking about BF3 on Ultra settings, and Ultra was not tested on that page you linked. Better dial back the blind AMD hatred, since you were attacking people who were arguing about FPS and price, not Intel and AMD.
  • Spunjji - Thursday, August 30, 2012 - link

    Cerise is a special kind of chimp.
  • Galidou - Thursday, August 30, 2012 - link

    Last time I said something like that to Cerise, he told me I was in a crysis and I had to take midol, careful about what you say around him.
  • CeriseCogburn - Friday, October 12, 2012 - link

    Mr dupemeister got the rez wrong, the framerate wrong, then the his cpu recommendation wrong, then he couldn't comprehend when he went to the link, as it clearly shows his crap cpu pick losing to the cheaper Intel chip, after he claimed his crap amd pick was the best bang and 20 bucks more. LOL
    But you ragging amd fans who cannot stand an insult expect us all to stand your constantly insulting lies them smile pretty and thank you for your stupid treachery and lies.
    Right ?
    Okay, thank you so much for having the midol disability that prevents you from being able to think clearly or get anything correct.
  • CeriseCogburn - Friday, October 12, 2012 - link

    You're both blind, mind numbed, idiot doofy bats. Here is his quote idiot #2

    " that said, the best bang for the bug gaming cpu is the AMD FX4100 for about $140. Why go weak i3 dual core when you can go mid range quad from AMD for $20 more."

    Look at the link again, brain dead core amd fan.
  • Spunjji - Thursday, August 30, 2012 - link

    Bahahahaha, you're such a tool. xD
  • Galidou - Thursday, August 30, 2012 - link

    All that counter-offensive for nothing... He never said he runs it on ULTRA. Boy people nowadays thinks you have to play the games on ultra or else you're just not playing it at all. A radeon 6870 or a 550 ti runs the game at that resolution on high details and it's BEAUTIFUL with over 50 fps...
  • Galidou - Thursday, August 30, 2012 - link

    Well actually he said ultra but there must be some options not enabled like msaa, I beleive it'S totally possible to get that on a 140$ card. I built myself a pc for one of my friend that total with the case did cost me around 350$ total, without hard drive and power supply and that totally runs Battlefield 3 easily on 1600*900 not ultra but almost.
  • Origin64 - Thursday, August 30, 2012 - link

    Unfortunately 60 fps and full hd are the standard these days, so picking custom fps targets and lower resolutions doesn't really count as far as im concerned

Log in

Don't have an account? Sign up now