Today at the annual Hot Chips conference, AMD’s new CTO Mark Papermaster unveiled the first details about the Steamroller x86 CPU core.

Steamroller is the third instantiation of AMD’s Bulldozer architecture, first conceived in the mid-2000s and finally brought to market in late 2011. Committed to this architecture for at least one more design after Steamroller, AMD has settled on roughly yearly updates to the architecture. For 2012 we have the introduction of Piledriver, the optimized Bulldozer derivative that formed the CPU foundation for AMD’s Trinity APU. By the end of the year we’ll also see a high-end desktop CPU without processor graphics based on Piledriver.

Piledriver saw a switch to hard edge flip flops, which allowed for a considerable decrease in power consumption at the expense of careful design and validation work. Performance didn’t change, but AMD saw a 10% - 20% reduction in active power. Piledriver also brought some scheduling efficiency improvements, but prefetching and branch prediction were the two other major design improvements in Piledriver.

Steamroller is designed to keep the ball rolling. It takes fundamentals from the Bulldozer/Piledriver architectures and offers a healthy set of evolutionary improvements on top of them. In Intel speak Steamroller wouldn’t be a tick as it isn’t accompanied by a significant process change (28nm bulk is pretty close to 32nm SOI), but it’s not a tock as the architecture is mostly enhanced but largely unchanged. Steamroller fits somewhere in between those two extremes when it comes to changes. 

Front End Improvements

One of the biggest issues with the front end of Bulldozer and Piledriver is the shared fetch and decode hardware. This table from our original Bulldozer review helps illustrate the problem:
Front End Comparison
  AMD Phenom II AMD FX Intel Core i7
Instruction Decode Width 3-wide 4-wide 4-wide
Single Core Peak Decode Rate 3 instructions 4 instructions 4 instructions
Dual Core Peak Decode Rate 6 instructions 4 instructions 8 instructions
Quad Core Peak Decode Rate 12 instructions 8 instructions 16 instructions
Six/Eight Core Peak Decode Rate 18 instructions (6C) 16 instructions 24 instructions (6C)
Steamroller addresses this by duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. Don’t expect a doubling of performance since it’s rare that a 4-issue front end sees anywhere near full utilization, but this is easily the single largest performance improvement from all of the changes in Steamroller. 
The penalties are pretty obvious: area goes up as does power consumption. However the tradeoff is likely worth it, and both of these downsides can be offset in other areas of the design as you’ll soon see.

Steamroller inherits the perceptron branch predictor from Piledriver, but in an improved form for better performance (mostly in server workloads). The branch target buffer is also larger, which contributes to a reduction in mispredicted branches by up to 20%. 

Execution Improvements

AMD streamlined the large, shared floating point unit in each Steamroller module. There’s no change in the execution capabilities of the FPU, but there’s a reduction in overall area. The MMX unit now shares some hardware with the 128-bit FMAC pipes. AMD wouldn’t offer too many specifics, just to say that the shared hardware only really applied for mutually exclusive MMX/FMA/FP operations and thus wouldn’t result in a performance penalty. 
The reduction of pipeline resources is supposed to deliver the same throughput at lower power and area, basically a smarter implementation of the Bulldozer/Piledriver FPU. 

There’s no change to the integer execution units themselves, but there are other improvements that improve integer performance. 
The integer and floating point register files are bigger in Steamroller, although AMD isn’t being specific about how much they’ve grown. Load operations (two operands) are also compressed so that they only take a single entry in the physical register file, which helps increase the effective size of each RF. 
The scheduling windows also increased in size, which should enable greater utilization of existing execution resources. 
Store to load forwarding sees an improvement. AMD is better at detecting interlocks, cancelling the load and getting data from the store in Steamroller than before.
Cache Improvements & Looking Forward
Comments Locked


View All Comments

  • just4U - Thursday, August 30, 2012 - link

    Overall.. the FX4100 is a better chip in a multi purpose computer. (in my opinion) Normally you will get a better Motherboard (more feature rich) then what you'd get from paying for the Intel boards as well.. so it's all a factor. As to these $469 dream machine.. well hell..

    A standard gaming rig that I'd be comfortable building (without an OS and at cost) will run .. $574.00
    - FX 4100
    - 8 G PC 1280
    - Gigabyte 970A -D3
    - Radeon 7770 1G
    - 1 TB Western Digitial Blue
    - LG DVDRW
    - Antec One Casing
    - Corsair Builder 500W

    Now going intel you could get an I3 and a H61 at a similiar price.. or you could go for a comparable MB to the Amd one for aprox $10 more over this system.

    To get it down into the $400 range I'd have to hmmm.. No video look for a $30 PSU and a $25 Case (save $50..) that would get it around the $450 range.. Drop down to 4G of ram would bring it into the 430.. (get where im going with this?)
  • just4U - Thursday, August 30, 2012 - link

    That was thru price match here in Canada btw.. I know you can still get some things cheaper down south via combo's and such.. but putting together a half ways competent gaming machine with new computer parts.. in the $450 range.. Good luck with that. Unless you got some killer sale on it's not going to happen.. and forget about the rebates as we all know how those work out 75% of the time. At the end of the day you'd seriously have to compromize with your hardware selection just to get it into such a budget.
  • Spunjji - Thursday, August 30, 2012 - link

    It's cool dude, the people here who aren't mentally defective already know that what you're saying is broadly accurate, pissing matches over AMD/Intel aside. There's not much sense demonstrating it to the rest... :/
  • Spunjji - Thursday, August 30, 2012 - link

    You have a funny definition of "STOMP".
  • Galidou - Sunday, September 2, 2012 - link

    3 fps more in one benchmark out of 8 and the other 7 are equal, quite a nice stomping of a core i3 vs a fx 4100. Oh and the core i3 can't be overclocked... what an amazing stomping.
  • CeriseCogburn - Friday, October 12, 2012 - link

    It can be overclocked, and dumb dumb, the op liar said $20 more for your amd loser.

    So in this case, the amd fanboys blow their freaking brains out through their backsides again, they actually LOSE, and pay $20 more.

    Thanks, I'll keep that in mind when you idiots all collude in the GPU reviews, and pull the EXACT OPPOSITE in clan idiot mode and fail to notice how stupid you all are even after it is explicitly pointed out, and "coming to grips" "with reality" and admitting you supported the big fat lie, of course will never occur.

    That's the fruitcake liar amd fan. Of course anyone else who takes exception to it, they are in the wrong...

    The amd fanboy mind is a terribly wasted thing, throw it out.
  • Spunjji - Thursday, August 30, 2012 - link

    You couldn't really be a more transparent shill. Nobody mentioned "non competitive consumer screwing" here, yet you post an essay countering said imaginary comments backed up by some hand-waving and supposition which is disproven by easily-obtained facts. You started a whole argument though, so gz on that.
  • nicamarvin - Thursday, August 30, 2012 - link

    15% IPC improvement right out of the box? keep dreaming, Ivy max performance boost is 5% on "some" benches and 1 to 3% on most benches and seeing how it cant OC as much as SandyB I say AMD will catch up with Intel sooner than most of you thought

    keeping in mind that AMD plans to encrease IPC by 15% on each of their updated Modules, PileDriver is already doing just that(15% IPC performance boost clock per clock against BD) and that Piledriver module was lacking the L3 cache the BD Module had and still was pulling the 15% performance boost
  • seapeople - Friday, August 31, 2012 - link

    Ivy Bridge also came out with higher clocks for the same price, so add in the 1-5% IPC advantage and you get close to the 10-15% advantage mentioned.

    Note that he didn't say IPC.
  • nicamarvin - Friday, August 31, 2012 - link

    SB can Oc much higher than Ivy, so thats a moot point, whats a 1-5% IPC gain when SB can OC 10% higher than Ivy? I suspect Haswell will not OC as high as the best SB could

Log in

Don't have an account? Sign up now