Since NVDIMMs are a relatively recent invention, most people aren't very
familiar with them. I'm going to cover a lot of general background
information before I start talking about what Panasas is doing.

================================================================

First, some caveats:

- While Panasas has been working on NVDIMMs, I personally have not. What
follows is my understanding of the state of the world, but it might not
be completely accurate.

- So far, we have been working with "traditional" NVDIMMs: DRAM backed
by NAND. **This is not 3D X-Point.** That said, some of the work should
be applicable to X-Point DIMMs, when they become available.
("Traditional" is used very loosely here; NVDIMMs were still at the
"neat-experiment" stage when I first heard about them in late 2012 or
early 2013.)

- The systems we are using can only have a single NVDIMM, due to
motherboard/BIOS limitations. Some of what we have done might not work
in multi-NVDIMM systems.

- JEDEC has standardized some aspects of identifying DRAM/NAND NVDIMMs,
and monitoring their power status. However, the motherboard/BIOS we are
using do not support that interface; instead, they support the
proprietary interfaces from a few of the NVDIMM vendors. Therefore,
that's what our software works with as well.

- Some of our work is upstreamable, but some of it will remain
proprietary.

================================================================

Next, a "quick" review of how DRAM/NAND NVDIMMs work:

- Hardware support for NVDIMMs is required in the power-supplies and
motherboard.

- The DRAM on the DIMM is connected to the memory bus, as with any other
DIMM. However, the BIOS reports the SMAP entries associated with the
NVDIMM as having a different type from normal DRAM, so the operating
system can handle it differently. In the case of FreeBSD, it is simply
ignored by add_smap_entries(), so it doesn't end up in the normal
physical memory map. (We aren't using UEFI interfaces at this time;
presumably, GetMemoryMap() also flags them with a different type than
those used for regular memory, and they are ignored by
add_efi_map_entries().)

- During normal operations, the DRAM on the NVDIMM behaves just like
regular DRAM. All the usual behaviors WRT CPU caching, memory controller
buffers, etc. are the same.

- The power-supplies have enough internal energy storage that they can
continue to have good DC output for a few milliseconds after AC input
failure is detected. That allows them to assert a power-fail signal to
the motherboard and keep it running for a few moments.

- When the motherboard sees the power-fail signal is asserted, the CPU
is halted and a timer starts; the timer is long enough that uncommitted
write-buffers are drained. When the timer expires, the ADR (Automatic
Data Refresh) signal to the DIMM sockets is asserted. Note: While the
memory controller write buffers are drained, the CPU caches are not
automatically flushed.

- When the NVDIMM controller sees the ADR signal is asserted, it
disconnects the DRAM from the memory bus, and keeps the contents
refreshed using a backup power source of some type (i.e. a battery,
supercap, etc.). After this point, the motherboard can safely lose
power.

- While on backup power, the NVDIMM controller copies the contents of
the DRAM to the NAND. When the copy is complete, the NVDIMM powers
itself off.

- When the system is powered back on, the BIOS eventually queries the
NVDIMM for its status. If the NVDIMM indicates that has a saved memory
image, the BIOS can then instruct the NVDIMM to restore the image back
into DRAM.

- The BIOS waits for any NVDIMM restore to complete before handing off
to the bootloader.

--------------------------------

Also, a quick review of how X-Point NVDIMMs are supposed to work. I
don't have any first-hand experience with these, so this is based on
publicly-available information:

- As with DRAM/NAND NVDIMMs, they're connected to the memory bus, and
the BIOS reports them with yet another different type for
SMAP/GetMemoryMap().

- CPU caching, memory controller buffers, etc. are the same as for DRAM.
(I think?)

- Unlike DRAM, X-Point is inherently stable, so nothing needs to be done
to either save or restore the contents. Once it's in the DIMM, that's
it.

================================================================

Now, on to what Panasas has been doing!

The first thing to note is that raw non-volatile memory is not actually
all that useful. In particular, consider what happens if you're in the
middle of updating a data structure in non-volatile memory when you lose
power. When the system comes back up, that structure is only partially
updated. So, you need some sort of transactional layer on top. The folks
at pmem.io have lots of stuff related to that. I haven't looked too
closely, but my understanding is that it should work (with various
degrees of efficiency) on any storage device node.

Conveniently, Panasas has implemented a basic NVDIMM driver, which
exposes the NVDIMM as a device node. It locates the SMAP, looks for
entries of the correct (proprietary) type, and maps them into KVA w/
pmap_mapdev_attr(PAT_WRITE_BACK). It also sets up handlers for
open/read/write/mmap/ioctl. Though we haven't tested it, I believe that
is sufficient to work with the pmem.io libraries.

That said, we aren't using the pmem.io code; it's a user-space library,
and we need persistent memory access from the kernel. Instead, we
implemented a persistent-memory library which works both in the kernel
and in user-space, and used that to create a custom persistent-memory
filesystem (PMFS). It's even POSIX compliant - one of the things the
developer did was use it for builds!

Some known problems, limitations, etc are:

- Right now, the driver only works with SMAP, so we can't use UEFI.
That's certainly doable, but we haven't had a reason to do it yet.
Furthermore, it's looking for a proprietary type, rather than JEDEC;
that too is a doable change.

- Getting at the SMAP info is a bit complicated, requiring multiple
preload_search_* calls. It would be nice if there was a more convenient
way.

- Right now, we don't have anything which monitors the status of the
DRAM/NAND NVDIMM's backup power source. This will eventually use the
NVDIMM vendor's proprietary API, but again, we can implement the JEDEC
version when we have JEDEC-compatible hardware, firmware, and docs.

- We open and map files on PMFS from within the kernel. The kernel
doesn't allow page faults to be serviced from mappings within the
kernel's vm_map, but does allow them to be serviced from a sub-map. As
memory-mapped files are demand-paged, we have to create and destroy
sub-maps. Unfortunately, the KPI for creating a sub-map,
vm_map_create(), does not have a corresponding way to destroy them. On
top of that, there's a limit of MAX_KMAP=10 objects in the "mapzone" UMA
zone. We've put in some dirty hacks to get around this, but it's only
needed if you're memory-mapping the files into kernel-space.

- Bypassing the VM cache, since the PMFS is already memory-speed, and we
want accesses to be directly against the NVDIMM address space. Again,
we've hacked around this, but we need to come up with a proper solution
that's upstreamable.