Inside OCZ's Factory: How SSDs Are Madeby Kristian Vättö on May 20, 2015 8:30 AM EST
Now that the hardware side of the drive is ready, it's time to put some intelligence (the firmware) inside.
The firmware download is done by custom PC setups that consist of normal PC hardware (if you look closely, you can see ASUS' logo on a motherboard or two) running some sort of a Linux distro with OCZ's custom firmware download tool. If you zoom into the monitor you can see that in this case the system is applying firmware to 240GB ARC100 drives.
Once the firmware has been loaded, the drives will move to run-in testing. OCZ has developed a custom script that writes and reads all LBAs eight times with the purpose of identifying bad blocks. If a drive has more bad blocks than a preset threshold allows, it will be pulled away and either fixed or destroyed. The scripts also test performance using common benchmarking tools (e.g. AS-SSD and ATTO) to ensure that all drives meet the spec.
Currently OCZ has two different test setups. One half of the test systems are regular PCs that are very similar to the firmware download systems, whereas the other half are custom racks pictured above. OCZ is looking to move all testing to rack-based cabins since one cabin can simultaneously test 256 drives, which is far more efficient than having dozens of PC setups around that can only test a handful of drives each at a time. The test regime is the same in both cases, so it's purely a matter of space and labor efficiency.
At the moment SATA based drives are tested through the host, which means that the IO commands are sent by the host similar to how we test SSDs. For PCIe drives, however, OCZ is developing a Manufacturing Self Test (MST) that is essentially a custom firmware that is loaded into the drive, which then reads and writes all LBAs to test for bad blocks. The benefit of MST is the fact that it bypasses the host interface (i.e. all IO commands are generated by the controller/firmware), making the test cycle faster as the host overhead is removed.
Additionally, every month a sample of finished drives go through a more rigid tests called Ongoing Reliability Testing (ORT) to ensure that nothing has changed in production quality. The tests consist of Thermal Cycle Test (TCT) where the drive is subjected to thermal shocks to validate the quality of manufacturing and Reliability Demonstration Test (RDT) where drives are tested at elevated temperature (~70°C) to demonstrate that the mean time before failure (MTBF) meets the specification.
The run-in testing hasn't changed much since Toshiba took over, but Toshiba did help OCZ to align to its quality standards. All the processes running today have been inspected by Toshiba and meet the strict standards set by the company. Note that the purpose of run-in testing isn't to screen for firmware bugs, but to ensure that the hardware is functional. The firmware development and validation is done before the mass production begins and after Toshiba took over OCZ has modified its development process to increase the quality and reliability of its products.
OCZ's whole philosophy has actually changed since the previous CEO left the company because in the past OCZ always tried to be the first to the market at any cost and tried to cover every possible micro-niche, which resulted in too many product lines for the resources OCZ had. Nowadays OCZ is putting a lot of effort into product qualification and it no longer has a dozen products in development at the same time, meaning that there's now sufficient resources to properly validate every product before it enters mass production.
The run-in testing may seem light with only eight full LBA read/write spans, but honestly I don't think it's necessary to hammer a drive for days because any apparent hardware flaw should surface very quickly. Basically, the hardware either works or it doesn't, and once the drive leaves the factory it's more likely to fail due to firmware anomaly than a physical hardware failure.
Post Your CommentPlease log in or sign up to comment.
View All Comments
caleblloyd - Tuesday, May 19, 2015 - linkPagination links are broken, on mobile at least... Can't navigate to page 2 to see the factory :(
Kristian Vättö - Wednesday, May 20, 2015 - linkOn my end everything seems to work fine (even on mobile). What happens if you try to access the second page directly?
close - Wednesday, May 20, 2015 - linkI have to ask, as some things look surprising to me:
1) So every new SSD already has 8 times the capacity of data already written to it? Or is it just QC and natch testing?
2) I always imagined the FW write process as being automated. But this looks like a lot of manual work to connect each drive by hand and write the FW. Again, is this the standard process or only during the initial testing phases?
close - Wednesday, May 20, 2015 - linkAnd on the same note, I always assumed the labeling process is automated. Either they have really low volume or labor is THAT cheap.
menting - Wednesday, May 20, 2015 - linki'm not the official answer, but it should already have 8 times the capacity of data written in, and then the firmware should zero out the counts.
close - Wednesday, May 20, 2015 - linkWhat I'm not sure is if this happens to all drives or to selected drives, assuming that if a few drives are OK the whole batch must be. Also, the testing is done after writing the FW. Is the FW "pre-configured" to ignore the first 8 writes per LBA or do they go through connecting them to PCs all over again to reset the written data counter?
dreamslacker - Wednesday, May 20, 2015 - linkThey would do it for every SSD. The actual usable capacity of the modules aren't fixed or a known quantity until you actually test every cell. During this phase, you will also know which cells are 'bad' and to discard/ repair the SSD if the remaining usable cell count is lower than the set limits.
The usable cells will then be mapped into the table so the controller knows what cells to avoid using.
This procedure is done on mechanical disk drives too since the actual platter capacity isn't a fixed number either.
As for the write or test process, it depends on the volume and the manufacturer. If volumes are high enough, you might not even have workers handling the F/W write or test process. A fully automated robotic arm and conveyor belt system would handle the drives and label them accordingly. Leaving the workers to package the drives.
MikhailT - Wednesday, May 20, 2015 - link1. Correct, this is what is known as the "burn-in" period. You have to write to every single NANDs or even hard drive platters a few times to make sure it is working. Many company burn in computers as well, they finish building it and then run a custom automated tool to benchmark it severely for several hours before they can ship it to you.
Think about the electronics, 90% of the defects (my experience and others I've talked to) are usually found within the first few days of use. That's usually a sign that the company did not properly burn/test the device in before shipping it to you.
2. It depends on the experience of the company. It cost a lot of money (machines are expensive and you have to hire people to figure these things out) to start automating the stuff and it would actually be cheaper initially to do it by hand as you have less volume to work with. As you get more money from your business revenue and volume starts to ramp, you then hire a few folks to figure out how to automate things, if it is cheaper and worth, you then invest hundred of thousands of dollars or millions to buy these equipment. That's why in the first page, it talks about this in phase 3 about committing the funds in terms of millions of dollars.
close - Thursday, May 21, 2015 - linkI assumed this is done before assembling the product. So you bin the chips, check them for errors, etc. before you solder them to a PCB. This way even if you're not the manufacturer of the NAND you still get to differentiate between chips and put the better ones in better products.
If you do the burn in and checking for defects AFTER they're soldered you're basically guaranteeing that all defects will be remedied at extra cost.
Kristian Vättö - Thursday, May 21, 2015 - linkNAND binning is usually done by the NAND manufacturer or packager, but there may (or actually will) still be bad blocks. The purpose of run-in testing is to identify the bad blocks so that the controller won't use them for storage as that could potentially lead to performance issues or data loss.