Using a PCIe Slot to Install DRAM: New Samsung CXL.mem Expansion Module
by Dr. Ian Cutress on May 11, 2021 4:10 AM EST- Posted in
- Interconnect
- Samsung
- DRAM
- DDR5
- Compute Express Link
- CXL
- PCIe 5.0
- CXL.memory
In the computing industry, we’ve lived with PCIe as a standard for a long time. It is used to add any additional features to a system: graphics, storage, USB ports, more storage, networking, add-in cards, storage, sound cards, Wi-Fi, oh did I mention storage? Well the one thing that we haven’t been able to put into a PCIe slot is DRAM – I don’t mean DRAM as a storage device, but memory that actually is added to the system as useable DRAM. Back in 2019 a new CXL standard was introduced, which uses a PCIe 5.0 link as the physical interface. Part of that standard is CXL.memory – the ability to add DRAM into a system through a CXL/PCIe slot. Today Samsung is unveiling the first DRAM module specifically designed in this way.
CXL: A Refresher
The original CXL standard started off as a research project inside Intel to create an interface that can support accelerators, IO, cache, and memory. It subsequently spun out into its own consortium, with over 50+ members, and support from key players in the industry: Intel, AMD, Arm, IBM, Broadcom, Marvell, NVIDIA, Samsung, SK Hynix, WD, and others. The latest standard is CXL 2.0, finalized in November 2020.
The CXL 1.1 standard covers three sets of intrinsics, known as CXL.io, CXL.memory and CXL.cache. These allow for deeper control over the connected devices, as well as an expansion as to what is possible. The CXL consortium sees three main areas for this:
The first type is a cache/accelerator, such as an offload engine or a SmartNIC (a smart network controller). With the CXL.io and CXL.cache intrinsics, this would allow the network controller to sort incoming data, analyze it, and filter what is needed directly into the main processors memory.
The second type is an accelerator with memory, and direct access to the HBM on the accelerator from the processor (as well as access to DRAM from the accelerator). The idea is a pseudo-heterogeneous compute design allowing for simpler but dense computational solvers.
The third type is perhaps the one we’re most interested in today: memory buffers. Using CXL.memory, a memory buffer can be installed over a CXL link and the attached memory can be directly pooled with the system memory. This allows for either increased memory bandwidth, or increased memory expansion, to the order of thousands of gigabytes.
CXL 2.0 also introduces CXL.security, support for persistent memory, and switching capabilities.
It should be noted that CXL is using the same electrical interface as PCIe. That means any CXL device will have what looks like a PCIe physical connector. Beyond that, CXL uses PCIe in its startup process, so currently any CXL supporting device has to also support a PCIe-to-PCIe link, making any CXL controller also a PCIe controller by default.
One of the common questions I’ve seen is what would happen if a CXL-only CPU was made? Because CXL and PCIe are intertwined, a CPU can’t be CXL-only, it would have to support PCIe connections as well. That being said, from the other direction: if we see CXL-based graphics cards for example, they would also have to at least initialize over PCIe, however full working modes might not be possible if CXL isn’t initialized.
Intel is set to introduce CXL 1.1 over PCIe 5.0 with its Sapphire Rapids processors. Microchip has announced PCIe 5.0 and CXL-based retimers for motherboard trace extensions. Samsung today is the third announcement for CXL supported devices. IBM has a similar technology called OMI (OpenCAPI Memory Interface), however that hasn’t seen wide adoption outside of IBM’s own processors.
Samsung’s CXL Memory Module
Modern processors rely on memory controllers for attached DRAM access. The top line x86 processors have eight channels of DDR4, while a number of accelerators have gone down the HBM route. One of the limiting factors in scaling up memory bandwidth is the number of controllers, which can also limit capacity, and beyond that memory needs to be validated and trained to work with a system. Most systems are not built to simply add or remove memory the same way you might do with a storage device.
Enter CXL, and the ability to add memory like a storage device. Samsung’s unveiling today is of a CXL-attached module packed to the max with DDR5. It uses a full PCIe 5.0 x16 link, allowing for a theoretical bidirectional 32 GT/s, but with multiple TB of memory behind a buffer controller. In much the same way that companies like Samsung pack NAND into a U.2-sized form factor, with sufficient cooling, Samsung does the same here but with DRAM.
The DRAM is still a volatile memory, and data is lost if power is lost. (I doubt it is hot swappable either, but weirder things have happened). Persistent memory can be used, but only with CXL 2.0. Samsung hasn't stated if their device supports CXL 2.0, but it should be at least CXL 1.1 as they state it currently is being tested with Intel's Sapphire Rapids platform.
It should be noted that a modern DRAM slot is usually rated maximum for ~18W. The only modules in that power window are Intel’s Optane DCPMM, but a 256 GB DDR4 module would be in that ~10+ W range. For a 2 TB add-in CXL module like this, I suspect we are looking at around 70-80 W, and so to add that amount of DRAM through the CXL interface would likely require active cooling as well as the big heatsink that these renders suggest.
Samsung doesn’t give any details about the module they are unveiling, except that it is CXL based and has DDR5 in it. Not only that, but the ‘photos’ provided look a lot like renders, so it’s hard to state if they have an aesthetic unit available for photography, or if there’s simply a working controller in a bring-up lab somewhere that has been validated on a system. Update: Samsung has confirmed these are live shots, not renders.
As part of the announcement Samsung quoted AMD and Intel, indicating which partners they are more closely working with, and what they have today is being validated on Intel next-gen servers. Intel’s next-gen servers, Sapphire Rapids, are due to launch at the end of the year, in line with the Aurora supercomputing contract set to be initially shipped by year end.
Related Reading
- Compute eXpress Link 2.0 (CXL 2.0) Finalized: Switching, PMEM, Security
- CXL Consortium Formally Incorporated, Gets New Board Members & CXL 1.1 Specification
- CXL Specification 1.0 Released: New Industry High-Speed Interconnect From Intel
- Intel Agilex: 10nm FPGAs with PCIe 5.0, DDR5, and CXL
- Synopsys Demonstrates CXL and CCIX 1.1 over PCIe 5.0: Next-Gen In Action
- Microchip Announces PCIe 5.0 And CXL Retimers
- DDR5 Memory Specification Released: Setting the Stage for DDR5-6400 And Beyond
- Here's Some DDR5-4800: Hands-On First Look at Next Gen DRAM
- Insights into DDR5 Sub-timings and Latencies
47 Comments
View All Comments
flashmozzg - Tuesday, May 11, 2021 - link
LPDDR5 != DDR5Yojimbo - Tuesday, May 11, 2021 - link
Alder Lake, to be released in 2021 (latest rumors say November), will have PCIe 5. It can support either DDR5 or DDR4 - but not both at once, from what I understand - so we will have to wait and see what consumer options are actually available from motherboard manufacturers.Santoval - Wednesday, May 12, 2021 - link
Most likely Alder Lake-S and above (the HEDT version, assuming there will be one) will get DDR5, Alder Lake-H will get LPDDR5 or DDR4 and the low power U/Y variants will probably support LPDDR5 (solely or not will depend on LPDDR5's projected prices in the second half of the year).Tomatotech - Tuesday, May 11, 2021 - link
Interesting idea 2.For comparison, fast PCie 4.0 SSDs do about 8GB/s, and presumably PCIe 5.0 SSDs will do about 16-20GB/s in the real world.
This has a theoretical max of 32 GT/s = 256GB/s but will not get anywhere near that. Still, it might get to around 64 or 128 GB/s for bulk transfer. The real advantage of DRAM over PCIe will be the god-like small file and random access speeds eg for databases.
The fastest HDDs got to around 500 KB/s (IIRC) for random. Good SSDs currently get to around 100 MB/s for random single queue, 200MB/s for random with deep queues. Optane gets to 500/700 MB/s for an arm and a leg.
If this can do random access at DRAM speeds, i.e. several GB/s for around the same price as Optane, it could be a winner. As it's over PCIE it doesn't need to be top speed DDR5 RAM. It could use older slower and cheaper DDR4 and still be a winner. Maybe they're using DDR5 for lower heat, lower power use, and higher capacity?
Let's consider possible costs. Mainstream memory runs at about $5 per GB, so 256GB of this would cost around $1200 for the DRAM alone (once DDR5 has become mainstream) then double that for being rare and server level, so it looks like this will cost about $2,500 for 256GB. For a proper server capacity, maybe $10,000 for 1TB.
I'm not so sure this will compete with Optane after all. Optane runs at about $1,000 per TB, considerably cheaper. But there are companies willing to pay for DRAM in the TB levels for their systems.
mode_13h - Tuesday, May 11, 2021 - link
> For a proper server capacity, maybe $10,000 for 1TB.Think bigger. Modern server CPUs already support multi-TB of direct-attached RAM.
Tomatotech - Tuesday, May 11, 2021 - link
They do yes, but I hear actually installing that much RAM can be a pain. (I don't have any direct experience at that level.) This CXL drive offers a quick way to plug in an extra few TB of almost DRAM level memory.Calin - Wednesday, May 12, 2021 - link
With servers, you don't plug extra memory. You buy them fully populated, and by the time you need more memory you throw them out and buy new systems (faster processors, faster I/O, lower power for the same performance, higher performance in the same power/cooling budget, more cores for more virtual machines, better security i.e. no/less Spectre/...).And 512/768GB of RAM would be standard fare for Intel processors without the "extra memory" tax, while 2TB would be accessible by both AMD and Intel.
And the "extra memory" Intel tax, while expensive, pales in comparison to the cost of 2TB of server RAM.
This "CXL" memory is also slower, and behind another level of indirection (OS support, HW support, ...) - so it would need to be either much cheaper to replace normal RAM, or be used above the normal memory.
mode_13h - Wednesday, May 12, 2021 - link
I thought Ice Lake SP didn't have an "extra memory" tax.schujj07 - Monday, May 17, 2021 - link
All depends on the length of your lease. We started with 512GB RAM in some servers on a multi year lease. A little over a year later our workloads changed and additional RAM was required for these servers. At that point I went and I took out the 16x 32GB RDIMMs and put in 16x 64GB LRDIMMs.Depending on your Intel CPU there is a RAM tax but that has changed with Ice Lake.
While CXL memory will be slower than DRAM, it will most likely be faster than Optane RAM and available to anyone not just Intel.
Santoval - Wednesday, May 12, 2021 - link
"If this can do random access at DRAM speeds, i.e. several GB/s for around the same price as Optane, it could be a winner."I strongly doubt the IOPS will be anywhere close to that of DRAM from DIMM slots. CXL or not the PCIe link will still be the latency bottleneck. PCIe 5.0 slots will also need to move even closer to the CPU than PCIe 4.0 slots, otherwise their signal integrity will take a severe hit. PCIe 6.0 switches to PAM4 encoding, but PCIe 5.0 will still use NRZ encoding. All else being equal, since PCIe 5.0 just has double the clocks of PCIe 4.0, the signal will be twice as weak (all else will *not* be equal, since PCIe 5.0 motherboards will be optimized for PCIe 5.0, their PCBs will probably have even more layers etc, but you get my point).
CXL.mem devices like the above are intended for capacity and bandwidth (mostly capacity), not latency critical and high IOPS stuff, since latency will be poorer. How much poorer? I have no idea, it might be anywhere between 10 to 100 times worse (I would guess 30 to 50 times worse). What's certain is that since this device uses DDR5 the bottleneck will be in the PCIe link, not the memory. In contrast the bottleneck of non volatile Optane DIMMS lies in Optane itself.