Since NVDIMMs are a relatively recent invention, most people aren't very familiar with them. I'm going to cover a lot of general background information before I start talking about what Panasas is doing. ================================================================ First, some caveats: - While Panasas has been working on NVDIMMs, I personally have not. What follows is my understanding of the state of the world, but it might not be completely accurate. - So far, we have been working with "traditional" NVDIMMs: DRAM backed by NAND. **This is not 3D X-Point.** That said, some of the work should be applicable to X-Point DIMMs, when they become available. ("Traditional" is used very loosely here; NVDIMMs were still at the "neat-experiment" stage when I first heard about them in late 2012 or early 2013.) - The systems we are using can only have a single NVDIMM, due to motherboard/BIOS limitations. Some of what we have done might not work in multi-NVDIMM systems. - JEDEC has standardized some aspects of identifying DRAM/NAND NVDIMMs, and monitoring their power status. However, the motherboard/BIOS we are using do not support that interface; instead, they support the proprietary interfaces from a few of the NVDIMM vendors. Therefore, that's what our software works with as well. - Some of our work is upstreamable, but some of it will remain proprietary. ================================================================ Next, a "quick" review of how DRAM/NAND NVDIMMs work: - Hardware support for NVDIMMs is required in the power-supplies and motherboard. - The DRAM on the DIMM is connected to the memory bus, as with any other DIMM. However, the BIOS reports the SMAP entries associated with the NVDIMM as having a different type from normal DRAM, so the operating system can handle it differently. In the case of FreeBSD, it is simply ignored by add_smap_entries(), so it doesn't end up in the normal physical memory map. (We aren't using UEFI interfaces at this time; presumably, GetMemoryMap() also flags them with a different type than those used for regular memory, and they are ignored by add_efi_map_entries().) - During normal operations, the DRAM on the NVDIMM behaves just like regular DRAM. All the usual behaviors WRT CPU caching, memory controller buffers, etc. are the same. - The power-supplies have enough internal energy storage that they can continue to have good DC output for a few milliseconds after AC input failure is detected. That allows them to assert a power-fail signal to the motherboard and keep it running for a few moments. - When the motherboard sees the power-fail signal is asserted, the CPU is halted and a timer starts; the timer is long enough that uncommitted write-buffers are drained. When the timer expires, the ADR (Automatic Data Refresh) signal to the DIMM sockets is asserted. Note: While the memory controller write buffers are drained, the CPU caches are not automatically flushed. - When the NVDIMM controller sees the ADR signal is asserted, it disconnects the DRAM from the memory bus, and keeps the contents refreshed using a backup power source of some type (i.e. a battery, supercap, etc.). After this point, the motherboard can safely lose power. - While on backup power, the NVDIMM controller copies the contents of the DRAM to the NAND. When the copy is complete, the NVDIMM powers itself off. - When the system is powered back on, the BIOS eventually queries the NVDIMM for its status. If the NVDIMM indicates that has a saved memory image, the BIOS can then instruct the NVDIMM to restore the image back into DRAM. - The BIOS waits for any NVDIMM restore to complete before handing off to the bootloader. -------------------------------- Also, a quick review of how X-Point NVDIMMs are supposed to work. I don't have any first-hand experience with these, so this is based on publicly-available information: - As with DRAM/NAND NVDIMMs, they're connected to the memory bus, and the BIOS reports them with yet another different type for SMAP/GetMemoryMap(). - CPU caching, memory controller buffers, etc. are the same as for DRAM. (I think?) - Unlike DRAM, X-Point is inherently stable, so nothing needs to be done to either save or restore the contents. Once it's in the DIMM, that's it. ================================================================ Now, on to what Panasas has been doing! The first thing to note is that raw non-volatile memory is not actually all that useful. In particular, consider what happens if you're in the middle of updating a data structure in non-volatile memory when you lose power. When the system comes back up, that structure is only partially updated. So, you need some sort of transactional layer on top. The folks at pmem.io have lots of stuff related to that. I haven't looked too closely, but my understanding is that it should work (with various degrees of efficiency) on any storage device node. Conveniently, Panasas has implemented a basic NVDIMM driver, which exposes the NVDIMM as a device node. It locates the SMAP, looks for entries of the correct (proprietary) type, and maps them into KVA w/ pmap_mapdev_attr(PAT_WRITE_BACK). It also sets up handlers for open/read/write/mmap/ioctl. Though we haven't tested it, I believe that is sufficient to work with the pmem.io libraries. That said, we aren't using the pmem.io code; it's a user-space library, and we need persistent memory access from the kernel. Instead, we implemented a persistent-memory library which works both in the kernel and in user-space, and used that to create a custom persistent-memory filesystem (PMFS). It's even POSIX compliant - one of the things the developer did was use it for builds! Some known problems, limitations, etc are: - Right now, the driver only works with SMAP, so we can't use UEFI. That's certainly doable, but we haven't had a reason to do it yet. Furthermore, it's looking for a proprietary type, rather than JEDEC; that too is a doable change. - Getting at the SMAP info is a bit complicated, requiring multiple preload_search_* calls. It would be nice if there was a more convenient way. - Right now, we don't have anything which monitors the status of the DRAM/NAND NVDIMM's backup power source. This will eventually use the NVDIMM vendor's proprietary API, but again, we can implement the JEDEC version when we have JEDEC-compatible hardware, firmware, and docs. - We open and map files on PMFS from within the kernel. The kernel doesn't allow page faults to be serviced from mappings within the kernel's vm_map, but does allow them to be serviced from a sub-map. As memory-mapped files are demand-paged, we have to create and destroy sub-maps. Unfortunately, the KPI for creating a sub-map, vm_map_create(), does not have a corresponding way to destroy them. On top of that, there's a limit of MAX_KMAP=10 objects in the "mapzone" UMA zone. We've put in some dirty hacks to get around this, but it's only needed if you're memory-mapping the files into kernel-space. - Bypassing the VM cache, since the PMFS is already memory-speed, and we want accesses to be directly against the NVDIMM address space. Again, we've hacked around this, but we need to come up with a proper solution that's upstreamable.