As part of this evening’s AMD Capsaicin event (more on that later), AMD’s Chief Architect and SVP of the Radeon Technologies Group has announced a new Radeon Pro card unlike anything else. Dubbed the Radeon Pro Solid State Graphics (SSG), this card includes M.2 slots for adding NAND SSDs, with the goal of vastly increasing the amount of local storage available to the video card.

Details are a bit thin and I’ll update this later this evening, but in short the card utilizes a Polaris 10 Fiji GPU and includes 2 PCIe 3.0 M.2 slots for adding flash drives to the card. These slots are then attached to the GPU (it’s unclear if there’s a PCIe switch involved or if it’s wired directly), which the GPU can then use as an additional tier of storage. I’m told that the card can fit at least 1TB of NAND – likely limited by M.2 MLC SSD capacities – which massively increases the amount of local storage available on the card.

As AMD explains it, the purpose of going this route is to offer another solution to the workset size limitations of current professional graphics cards. Even AMD’s largest card currently tops out at 32GB, and while this is a fair amount, there are workloads that can use more. This is particular the case for workloads with massive datasets (oil & gas), or as AMD demonstrated, scrubbing through an 8K video file.

Current cards can spill over to system memory, and while the PCIe bus is fast, it’s still much slower than local memory, plus it is subject to the latency of the relatively long trip and waiting on the CPU to address requests. Local NAND storage, by comparison, offers much faster round trips, though on paper the bandwidth isn’t as good, so I’m curious to see just how it compares to the real world datasets that spill over to system memory.  Meanwhile actual memory management/usage/tiering is handled by a combination of the drivers and developer software, so developers will need to code specifically for it as things stand.

For the moment, AMD is treating the Radeon Pro SSG as a beta product, and will be selling developer kits for it directly., with full availability set for 2017. For now developers need to apply for a kit from AMD, and I’m told the first kits are available immediately. Interested developers will need to have saved up their pennies though: a dev kit will set you back $9,999.

Update:

Now that AMD’s presentation is over, we have a bit more information on the Radeon Pro SSG and how it works.

In terms of hardware, the Fiji based card is outfit with a PCIe bridge chip – the same PEX8747 bridge chip used on the Radeon Pro Duo, I’m told – with the bridge connecting the two PCIe x4 M.2 slots to the GPU, and allowing both cards to share the PCIe system connection. Architecturally the prototype card is essentially a PCIe SSD adapter and a video card on a single board, with no special connectivity in use beyond what the PCIe bridge chip provides.

The SSDs themselves are a pair of 512GB Samsung 950 Pros, which are about the fastest thing available on the market today. These SSDs are operating in RAID-0 (striped) mode to provide the maximum amount of bandwidth. Meanwhile it turns out that due to how the card is configured, the OS actually sees the SSD RAID-0 array as well, at least for the prototype design.

To use the SSDs, applications need to be programmed using AMD’s APIs to recognize the existence of the local storage and that it is “special,” being on the same board as the GPU itself. Ultimately the trick for application developers is directly streaming resources from  the SSDs treating it as a level of cache between the DRAM and system storage. The use of NAND in this manner does not fit into the traditional memory hierarchy very well, as while the SSDs are fast, on paper accessing system memory is faster still. But it should be faster than accessing system storage, even if it’s PCIe SSD storage elsewhere on the system. Similarly, don’t expect to see frame buffers spilling over to NAND any time soon. This is about getting large, mostly static resources closer to the GPU for more efficient resource streaming.

To showcase the potential benefits of this solution, AMD had an 8K video scrubbing demonstration going, comparing performance between using a source file on the SSG’s local SSDs, and using a source file on the system SSD (also a 950 Pro).

The performance differential was actually more than I expected; reading a file from the SSG SSD array was over 4GB/sec, while reading that same file from the system SSD was only averaging under 900MB/sec, which is lower than what we know 950 Pro can do in sequential reads. After putting some thought into it, I think AMD has hit upon the fact that most M.2 slots on motherboards are routed through the system chipset rather than being directly attached to the CPU. This not only adds another hop of latency, but it means crossing the relatively narrow DMI 3.0 (~PCIe 3.0 x4) link that is shared with everything else attached to the chipset.

Though by and large this is all at the proof of concept stage. The prototype, though impressive in some ways in its own right, is really just a means to get developers thinking about the idea and writing their applications to be aware of the local storage. And this includes not just what content to put on the SSG's SSDs, but also how to best exploit the non-volatile nature of its storage, and how to avoid unnecessary thrashing of the SSDs and burning valuable program/erase cycles. The SSG serves an interesting niche, albeit a limited one: scenarios where you have a large dataset and you are somewhat sensitive to latency and want to stay off of the PCIe bus, but don't need more than 4-5GB/sec of read bandwidth. So it'll be worth keeping an eye on this to see what developers can do with it.

In any case, while AMD is selling dev kits now, expect some significant changes by the time we see the retail hardware in 2017. Given the timeframe I expect we’ll be looking at much more powerful Vega cards, where the overall GPU performance will be much greater, and the difference in performance between memory/storage tiers is even more pronounced.

Source: AMD

Comments Locked

120 Comments

View All Comments

  • osxandwindows - Monday, July 25, 2016 - link

    Interesting stuff. Wonder if this can be used as additional local storage for the entire system?
  • Intel999 - Monday, July 25, 2016 - link

    Yes, they stated that the SSD could be used as local storage too.
  • osxandwindows - Monday, July 25, 2016 - link

    Thats nice.
  • ddriver - Tuesday, July 26, 2016 - link

    I honestly don't see the point in this. It might be useful for older mobos which don't have M2 slots as a way to have a fast SYSTEM storage, but for storage for the GPU it doesn't make sense.

    Currently SSDs cannot even max out the bandwidth of a PCI-E 3 slot, and GPUs themselves are interfaced through x16 slots, which means they get much better than SSD bandwidth with system ram (of which you can have plenty) and can also access system SSD without any significant penalty, at least relative to the typical bandwidth, iops and latency of SSD. Lastly, it is not like SSDs are ANYWHERE NEAR the bandwidth and latency of GPU memory, 2-3 Gb/s in the case of the fastest SSDs is SLOW AS A SLOTH compared to the 200-300 Gb/s for a typical upper mid range GPU.
  • Spunjji - Tuesday, July 26, 2016 - link

    It doesn't have to be anywhere near the bandwidth of the GPU memory to be relevant. That's what tiers of storage are about. What it does mean is you can keep storage as close as possible to where it needs to be. It also means you can store way, way more data there than you can fit in system Ram.

    There are probably only limited scenarios where this makes sense, though, because the systems it will be used in will most likely have other SSD storage on the PCIe bus somewhere else in the system.
  • ddriver - Tuesday, July 26, 2016 - link

    The purpose of having it closer is so that it can be accessed faster. But when the medium is so slow, it defeats the purpose. Latency is not a real issue in large data sets, as it can be completely masked out by buffering. Performance gains from this will be minuscule and likely offset by the hit on the thermal budgets as elaborated in the comment below.

    SSD bandwidth is simply dreadfully low for GPU compute scenarios, and moving the SSD closer will not make the SSD any faster than it already is, it is simply going to remove some negligible overheads, at the extra cost of the hardware, the loss of thermal headroom and throttling performance losses and the need to change your code to target that extra functionality.

    This will make more sense for something like Optane, if it lives up to the performance hype and is not overly expensive, but for flash SSDs it is entirely pointless.
  • DanNeely - Tuesday, July 26, 2016 - link

    It'd be useless for realtime rendering like in our games; but the amount of data used in professional rendering for Hollywood is insane. Ars puts it above 64GB per frame (and with frame rates well below 1 per second). In that case using up not just all of the GPUs ram but all of the CPUs ram as well and needing to stream data from SSDs is plausible. If that's the case moving the storage closer to the GPU to cut latency should speed things up; especially in cases where you've got multiple GPUs in your renderbox and don't have enough PCIe lanes to give them all that much dedicated SSD bandwidth.

    That said I'm interested in if there're any benchmarks that could be run by tech sites that hit these levels of detail; or if we'll have to hope for posts from VFX studios to see how much these actually can help.
  • ddriver - Tuesday, July 26, 2016 - link

    This is a very arbitrary number, and it involves actual assets, and it concerns software rendering done on the CPU in large rendering farms with fairly slow and latent interconnect. AMD is demoing it with video, but looking at those 850 mb/sec for the non-ssd configuration, it does look like the benchmark was written to give an artificial advantage to showcase their concept in an unrealistic scenario which will fail to deliver in practice.

    With PCIe you have DMA, and with a x16 v3 slot you have ~11 GB/sec bandwidth (out of ~16 theoretical). Latency is quite low too, in the realm of tens of microseconds. Also, modern GPUs can work asynchronously, meaning that you don't have to waste processing time to do the transfers, you can transfer the next buffer while you process the current one, eliminating all subsequent latency penalties, leaving only the initial one, which is quite literally nothing for such workloads.

    What I am saying it is practically 100% possible to get the benefits AMD demos "with SSD" without integrating the SSD onto the GPU. SSDs are far from being even close to saturating a PCIe x16 link, and the latency is negligible next to that of the SSD. You don't really save much by moving SSDs to the GPU, but you increase the cost, complexity and TDP. So it is not merely redundant, I am highly skeptical that the PROs will be able to outweigh the CONs.
  • close - Wednesday, July 27, 2016 - link

    I bet the patented "ddriver 5.25" HDD" would be much better for this purpose - bothe faster and cheaper than any SSD or current day HDD according to your back-of-the-napkin calculations. It must be a huge burden to be the best engineer around but with nothing to show for it. ;) [/s]

    At frames in the tens or hundreds of GBs there's only so much you can do with any kind of RAM memory before it starts spilling. And when it does you might be better off with an additional storage tier, albeit slower but closer to the GPU.
  • eachus - Sunday, July 31, 2016 - link

    "The purpose of having it closer is so that it can be accessed faster. But when the medium is so slow, it defeats the purpose. Latency is not a real issue in large data sets, as it can be completely masked out by buffering."

    Except when it can't. I used to write this type of software professionally, I still do on occasion now that I am retired. A good example of the latency problem would be an implementation of the Simplex algorithm for linear programming problems. Rather than recompute the basis it is normal to take some number of steps creating vectors such that you multiply the basis by a vector at each step. After so many steps you recompute the basis (select a square matrix out of the data and invert it). So 50 (say) steps, then select the new basis and invert it.

    For most problem sizes, the time to collect the new basis dominates the algorithm. (If you go too many steps without recomputing, you start taking wrong steps.) The amount of data in the new basis is (relatively) small so bandwidth is not a problem, but the complete dataset is often more than n Gigabytes, where n is the size you can keep in main memory. ;-)

    You can create problems which need most of main memory to store the basis, but those problems are usually (transportation or assignment) problems that have special solutions.

    I've also dealt with other algorithms that need access to a large dataset, but where the next section to be read in is selected almost at random. (Some graph partitioning problems...)

Log in

Don't have an account? Sign up now