The pNFS service is now in FreeBSD-12 and newer.

In FreeBSD-13, it will also support NFSv4.2. Mixing an MDS server that
supports NFSv4.2 without all DSs supporting NFSv4.2 will not work correctly.
When upgrading to FreeBSD-13 (or a pre-release from main/current), upgrade
the MDS after upgrading all the DSs.
The mounts from the MDS to the DSs must all use NFSv4.2 if the MDS supports
NFSv4.2 (minorversion=2 instead of minorversion=1 on the mounts).
The remainder of this document assumes FreeBSD-13, which supports NFSv4.2.
(For FreeBSD-12, replace all occurrences of "minorversion=2" with
 "minorversion=1".)

Overall Goal
A pNFS service separates the Read/Write operations from all the other NFSv4.n
Metadata operations. It is hoped that this separation allows a pNFS service
to be configured that exceeds the limits of a single NFS server for either
storage capacity and/or I/O bandwidth.
It is possible to configure mirroring within the data servers (DSs) so that
the data storage file for an MDS file will be mirrored on two or more of
the DSs.
When this is used, failure of a DS will not stop the pNFS service and a
failed DS can be recovered once repaired while the pNFS service continues
to operate.  Although two way mirroring would be the norm, it is possible
to set a mirroring level of up to four. This could be increased by recompiling
with NFSDEV_MAXMIRRORS set to a larger value.
The Metadata server will always be a single point of failure,
just as a single NFS server is.

Overview of Plan B
A Plan B pNFS service consists of a single MetaData Server (MDS) and K
Data Servers (DS), all of which are FreeBSD-12 or newer systems.
Clients will mount the MDS as they would a single NFS server.
When files are created, the MDS creates a file tree identical to what a
single NFS server creates, except that all the regular (VREG) files will
be empty. As such, if you look at the exported tree on the MDS directly
on the MDS server (not via an NFS mount), the files will all be of size 0.
Each of these files will also have two extended attributes in the system
attribute name space:
pnfsd.dsfile - This extended attrbute stores the information that
    the MDS needs to find the data storage file(s) on DS(s) for this file.
pnfsd.dsattr - This extended attribute stores the Size, AccessTime, ModifyTime
    and Change attributes for the file, so that the MDS doesn't need to
    acquire the attributes from the DS for every Getattr operation.
For each regular (VREG) file, the MDS creates a data storage file on one
(or more if mirroring is enabled) of the DSs in one of the "dsNN"
subdirectories.  The name of this file is the file handle
of the file on the MDS in hexadecimal so that the name is unique.
The DSs use subdirectories named "ds0" to "dsN" so that no one directory
gets too large. The value of "N" is set via the sysctl vfs.nfsd.dsdirsize
on the MDS, with the default being 20.
For production servers that will store a lot of files, this value should
probably be much larger.
It can be increased when the "nfsd" daemon is not running on the MDS,
once the "dsK" directories are created on all of the DSs.

For pNFS aware NFSv4.1/4.2 clients, the FreeBSD server will return two pieces
of information to the client that allows it to do I/O directly to the DS.
DeviceInfo - This is relatively static information that defines what a DS
             is. The critical bits of information returned by the FreeBSD
             server is the IP address of the DS and, for the Flexible
             File layout, that NFSv4.1/4.2 is to be used to do I/O on the DS
             plus that it is "tightly coupled".
             There is a "deviceid" which identifies the DeviceInfo.
Layout     - This is per file and can be recalled by the server when it
             is no longer valid. For the FreeBSD server, there is support
             for two types of layout, called File and Flexible File layout.
             Both allow the client to do I/O on the DS via NFSv4.1/4.2 I/O
             operations. The Flexible File layout is a more recent variant
             that allows specification of mirrors, where the client is
             expected to do writes to all mirrors to maintain them in a
             consistent state. The Flexible File layout also allows the
             client to report I/O errors for a DS back to the MDS, so that the
             DS can be disabled if it is mirrored.
             The Flexible File layout supports two variants referred to as
             "tightly coupled" vs "loosely coupled". The FreeBSD server always
             uses the "tightly coupled" variant where the client uses the
             same credentials to do I/O on the DS as it would on the MDS.
             The FreeBSD DSs maintain the same ownership, mode and ACL on
             data storage file as the corresponding file on the MDS, so that
             the DSs can apply the same permission checking as the MDS does.
             The FreeBSD server does not do striping and always returns
             layouts for the entire file. The critical information in a layout
             is Read vs Read/Write and DeviceID(s) that identify which
             DS(s) the data is stored on, along with the file handle(s) for the
             data storage file on the DS(s).

At this time, the MDS generates File Layout layouts to NFSv4.1/4.2 clients
that know how to do pNFS for the non-mirrored DS case unless the sysctl
vfs.nfsd.default_flexfile is set non-zero, in which case Flexible File
layouts are generated.
A mirrored DS configuration always generates Flexible File layouts.
For NFS clients that do not support NFSv4.1/4.2 pNFS, all I/O operations
are done against the MDS which acts as a proxy for the appropriate DS(s).
When the MDS receives an I/O RPC, it will do the RPC on the DS(s) as a proxy.
Multiple DSs can be on the same FreeBSD server, but the DSs must be on
system(s) separate from the MDS.

When the MDS is configured, DSs can be used to either store data files for
all exported file systems on the MDS or a specific exported file system
on the MDS. The latter configuration allows a system administrator to limit
allocation of data storage for an exported file system to specific file
system(s) on specific DS(s).

For the case where the DS(s) are assigned to specific MDS exported file
systems, a system administrator may find it useful to configure multiple
DSs on the same system, using separate file systems on this system for
each of the multiple DSs. Each of these DSs must be accessed via separate
IP addresses. This can be done via alias addresses assigned to the same
network interface or via multiple network interfaces on the system.
In other words, only one DS mount per IP address is allowed.
(This configuration can also be useful for testing.)

To do testing, you will need to use a FreeBSD-12 or newer system.

Setting up a FreeBSD pNFS server using Plan B
- I have only tested AUTH_SYS, although I do not know a reason why Kerberized
  mounts won't work.
  For Kerberized mounts, nfsuserd(8) must be used to map between uid/gid and
  owner/owner_group names. This can also be done for AUTH_SYS mounts, but I
  find it easier to set:
  vfs.nfs.enable_uidtostring=1
  vfs.nfsd.enable_stringtouid=1
  in /etc/sysctl.conf, so that the uid/gid numbers go on the wire as strings.
  (When you do this there is no reason to run the nfsuserd(8) daemon unless
   you are using it with the "-manage-gids" option.)
  *** Note that you must use one of the two above methods for mapping user/group
  names.  If this mapping is not working correctly, the server will be badly
  broken, because it will not be able to set attributes on the data files
  on the DS(s).  The default of "nobody" will not work correctly.
- As with other NFS setups, all the servers and clients need to have common
  user/uid and group/gid databases.

On the DS systems:
(All commands need to be done by root/su.)
The DSs are configured like a normal NFS server, with the following:
- There needs to be an exported directory with empty directories in it with
  the names:
  ds0, ds1, ds2, ds3, ds4, ds5, ds6, ds7, ds8, ds9, ds10, ds11, ds12, ds13,
  ds14, ds15, ds16, ds17, ds18, ds19
  (More subdirectories if vfs.nfsd.dsdirsize has been increased from the
   default of 20 on the MDS as above, with names ds20,...)
  This command done in each of the exported DS directories will create them:
  # jot -w ds 20 0 | xargs mkdir -m 700
  (Replace 20 with the value of vfs.nfsd.dsdirsize if you have increased it.)
- The exported directory must be mountable by the MDS via NFSv4.1/4.2 with
  the "-maproot=root" option on the /etc/exports line.
  It must also be exported to the clients, but the "-maproot=root" export
  option is not required.
  Assuming the exported directory is called "/DSstore", AUTH_SYS is being
  used and all clients on the 192.168.1 subnet, a typical /etc/exports on the
  DSs might be:
  /DSstore -sec=sys -maproot=root nfsv4-mds
  /DSstore -sec=sys -network 192.168.1.0 -mask 255.255.255.0
  V4: /DSstore -network 192.168.1.0 -mask 255.255.255.0
  (If multiple DSs are being configured on the system, there needs to be
   a separate exported directory with empty "dsN" directories in it for
   each DS. Each of these exported directories would normally be on
   separate DS file systems with separate export lines in /etc/exports.)
- The two sysctls:
  vfs.nfsd.enable_nobodycheck
  vfs.nfsd.enable_nogroupcheck
  should both normally be set to 0 on the DS (and maybe the MDS as well).
  This allows files to be correctly created by unknown users, where the
  user/group name gets mapped to "nobody"/"nogroup".

In other respects, the DSs are configured the same as a normal NFS server,
with no "-p" or "-m" options on the nfsd daemon.
- Since I choose to use uid/gid numbers in the strings and not run the nfsuserd,
  I set...
  In /etc/rc.conf I have:
  nfs_server_enable="YES"
  nfsv4_server_enable="YES"
  nfsv4_server_only="YES"
  nfs_server_flags="-t -n 32"
  - You also need to make sure that mountd_flags has the "-S" option, but that
    is normally the default.
  In /etc/sysctl.conf I have:
  vfs.nfsd.enable_stringtouid=1
  vfs.nfs.enable_uidtostring=1
  vfs.nfsd.enable_nobodycheck=0
  vfs.nfsd.enable_nogroupcheck=0
  Alternately, you can run the nfsuserd by adding this line to your
  /etc/rc.conf instead of the first two of the above lines in /etc/sysctl.conf.
  nfsuserd_enable="YES"

On the MDS system:
The MDS is set up like a normal NFS server, plus the following:
- The MDS must have the data storage directory (/DSstore for example) of
  all the DS(s) mounted somewhere on the MDS file system (not within
  the exported subtree) using NFSv4.1/4.2 mounts, but without the "pnfs" option.

  For example, if there are four DS servers called nfsv4-data0, nfsv4-data1,
  nfsv4-data2 and nfsv4-data3 where each has a /DSstore storage file
  directory exported as above, the /etc/fstab lines might look like:
  nfsv4-data0:/	/data0	nfs	rw,nfsv4,minorversion=2,soft,retrans=2	0	0
  nfsv4-data1:/	/data1	nfs	rw,nfsv4,minorversion=2,soft,retrans=2	0	0
  nfsv4-data2:/	/data2	nfs	rw,nfsv4,minorversion=2,soft,retrans=2	0	0
  nfsv4-data3:/	/data3	nfs	rw,nfsv4,minorversion=2,soft,retrans=2	0	0

Note that, unlike a normal NFSv4 mount, the "soft,retrans=2" options can be
used, so that the MDS will detect failures of a DS. These two options are only
useful for pNFS servers that implement mirroring.

Once the above DS server mounts are done on the MDS, do
# nfsstat -m
on the MDS and make sure the rsize, wsize are configured to the same
size as
# sysctl vfs.nfsd.srvmaxio
is set to on the MDS and all DSs.
For the pNFS server to work correctly, these must all be the same.
For default FreeBSD13 or 14 system, vfs.nfsd.srvmaxio will equal 131072,
however the rsize, wsize on the MDS will be 65536.
To fix this on the MDS:
Add an entry like
vfs.maxbcachebuf=131072
to the /boot/loader.conf file. Reboot and check that "nfsstat -m" shows
the mounts of the DSs with rsize=131027,wsize=131072.
If the MDS generates messages w.r.t. increasing the value of
kern.ipc.maxsockbuf, you should add a line to /etc/sysctl.conf on the MDS
to do so.

Then, the "-p" and optionally the "-m" options are added to the command
line options for the nfsd.
The "-p" option indicates that it is a pNFS service and lists the DSs.
The "-m" option enables mirroring and defines how many DSs will store the
data file for a file on the MDS.
Assuming there are four DSs mounted as above and they will store files for all
MDS file systems with a mirroring level of 2:

- In /etc/rc.conf I have:
  rpcbind_enable="YES"
  mountd_enable="YES"
  nfs_server_enable="YES"
  nfsv4_server_enable="YES"
  nfs_server_flags="-u -t -n 32 -m 2 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3"
  - You also need to make sure that mountd_flags has the "-S" option.

  In /etc/sysctl.conf I have:
  vfs.nfsd.enable_stringtouid=1
  vfs.nfs.enable_uidtostring=1
  Alternately, you can run the nfsuserd by adding this line to your
  /etc/rc.conf instead of the above lines in /etc/sysctl.conf.
  nfsuserd_enable="YES"

  In /etc/fstab I have:
  nfsv4-data0:/	/data0	nfs	rw,nfsv4,minorversion=2,soft,retrans=2	0	0
  nfsv4-data1:/	/data1	nfs	rw,nfsv4,minorversion=2,soft,retrans=2	0	0
  nfsv4-data2:/	/data2	nfs	rw,nfsv4,minorversion=2,soft,retrans=2	0	0
  nfsv4-data3:/	/data3	nfs	rw,nfsv4,minorversion=2,soft,retrans=2	0	0
    (Note that these mounts do not use the "pnfs" option and there is no
     need to have the nfscbd running on the MDS, since no callbacks from
     the DSs to MDS are done.
     Also note that the paths in the "-p" argument are the "mounted-on" paths
     from the above mounts and not the directories on the DSs.)

If the MDS exports two file systems to clients called /export1 and /export2
and the DSs are to be assigned to specific MDS file systems, the
nfs_server_flags line might be:
  nfs_server_flags="-u -t -n 32 -m 2 -p nfsv4-data0:/data0#/export1,nfsv4-data1:/data1#/export1,nfsv4-data2:/data2#/export2,nfsv4-data3:/data3#/export2"

so that files on /export1 are stored on nfsv4-data0 and nfsv4-data1, whereas
files on /export2 will be stored on nfsv4-data2 and nfsv4-data3.

A few notes:
- When shutting down the MDS, the nfsd threads must be killed before the
  data store mounts can be dismounted. For example:
  # /etc/rc.d/nfsd stop
  # umount /data0
  # umount /data1
  # umount /data2
  # umount /data3
  (This seems to work ok when you do a "reboot" from multiuser mode, but I'm
   not sure if the scripts guarantee this?)
  - This should be done before the DS machines are shut down.
  - If DSs have failed, it may not be possible to stop the kernel nfsd
    threads. (They will still be seen on a "ps axHl" command.)
    If this happens, it may be necessary for the system administrator to use
    the pnfsdskill(8) command with the "-f" option to disable the DSs.
    Once that happens, the nfsd threads should terminate and that should allow
    the DSs to be umount(8)ed with the "-N" option.
- If you are making any use of NFSv4 ACLs, these must be enabled on all
  the exported file systems (both MDS and DS). For UFS, this means that
  all must be mounted with the "nfsv4acls" option.
  - At times I believe there are some differences w.r.t. NFSv4 ACL semantics
    between UFS and ZFS, so I would avoid mixing the 2 file system types
    if NFSv4 ACLs are being used.
- For reasonable performance, these tunables should be increased on the MDS
  by putting these lines in /boot/loader.conf.
  vfs.nfsd.fhhashsize="1000"
  - For a production server with a reasonable amount of memory, you might want
    to increase this to 10000.

If the above has all worked, an NFSv4.1/4.2 mount with pNFS enabled should
work, with the I/O being done directly to the DS. ("nfsstat -E -s" on the
MDS should show few Read or Write operations happening, since they are being
done directly on the DSs.)
For a non-mirrored configuration, a fairly recent FreeBSD system should be
sufficient for the clients. (FreeBSD-11 or FreeBSD-12)
For a mirrored configuration, the FreeBSD clients will need to be FreeBSD-12
or newer.

For the FreeBSD client:
In the /etc/rc.conf file:
nfscbd_enable="YES"
- Then you should be able to do the mount...
# mount -t nfs -o nfsv4,minorversion=2,pnfs nfsv4-mds:/export /mnt
(minorversion=1 for FreeBSD-12.)
(Assuming the MDS is called nfsv4-mds and "/export" is the exported file
 system on the MDS.)
- If this works, you can put an entry in your /etc/fstab like:
  nfsv4-mds:/export	/mnt	nfs	rw,nfsv4,minorversion=2,pnfs	0	0

If you are using a recent Linux client, the mount command looks about the same:
# mount -t nfs -o nfsvers=4 nfsv4-mds:/export /mnt

If you are using a mirrored pNFS configuration and you want pNFS to work,
you will probably want a Linux 4.17-rc2 or later kernel. Kernels prior to 4.12
only handle Flexible File Layouts for NFSv3 DS servers. As such, they will fall
back to doing all I/O through the MDS. A 4.12 kernel works, but I saw Linux
client crashes. I don't see crashes with a 4.17-rc2 kernel, but I haven't tried
Linux kernels in between these two versions.
Also, the Flexible File Layout driver in the Linux client in a 4.17-rc2 kernel
does not handle a "tightly coupled" server correctly and uses the synthetic
user/group in the AUTH_SYS credentials instead of the ones for the user doing
the I/O.
There are two ways to deal with this:
1 - Run a Linux system with a patched Flexible File Layout driver.
    I do not know the exact Linux kernel version that acquired the fix, but
    recent (maybe all) 5.n kernels are fixed.
OR
2 - Set the sysctl vfs.nfsd.flexlinuxhack=1, so that the layouts will be
    issued with synthetic user/group of "0". This works around the problem,
    so long as you export the file systems on the DSs to the clients with
    "-maproot=root".

If this works, you should be ready for testing.
You can monitor how it is working in a few ways.
- You can enable logging in the server by:
  # sysctl vfs.nfsd.debuglevel=4
  or on the client by:
  # sysctl vfs.nfs.debuglevel=4
- You can capture packets and look at them in wireshark.
  To do this on the server:
  # tcpdump -s 0 -w run.pcap host <client-hostname>
  - Then look at run.pcap in wireshark, which understands NFSv4.
- You can get some basic information from nfsstat.
  On the server:
  # nfsstat -E -s
  On the client:
  # nfsstat -E -c

The command pnfsdsfile(8) can be used on files in the exported file system
on the MDS to find out where the data storage for a file resides and to fix up
the pnfsd.dsfile extended attribute, if needed.
- The MDS file should look normal, except for being empty. If this file happens
  to be abc.c and the pNFS file service is mirrored, the command:
  # pnfsdsfile abc.c
  abc.c:	nfsv4-data2	ds5/207508569ff983350c000000a9730200eec58e800000000000000000	nfsv4-data3	ds5/207508569ff983350c000000a9730200eec58e800000000000000000
  shows that the data files for abc.c are on nfsv4-data2 and nfsv4-data3 in
  subdirectory "ds5" with the file name "2075...".

- The DS file will have a 56byte hexidecimal name and should be the size,
  ownership, mode and ACL of the file. The contents of this file is the
  file's data.
- If the pnfsd.dsattr extended attribute somehow gets corrupted, you can
  remove it on the MDS file and the MDS server will recreate it, once the file
  attributes are changed in any way. (It caches attributes for the data file.)
  For example (for a file called abc.c):
  - Done on the exported file system on the MDS.
  # rmextattr system pnfsd.dsattr abc.c
  Then it will be recreated when the file is accessed via a client NFS mount.

Backup up the pNFS store:
The easy way to back this up is to archive from an NFS client mount of it,
since the files look "normal" with data as well as attributes. If, for example,
you did this with "tar", you could recover the archive anywhere. It does
not need to be recovered onto a pNFS store.

If the MDS tree is being archived on the MDS, the system extended attributes
must be saved/restored. (I'll admit I don't know how to do this at this time.)

Handling of failed mirrored DSs
When a mirrored DS fails, it can be disabled one or three ways:
1 - The MDS detects a problem when trying to do proxy
    operations on the DS. This is why the DS servers are mounted on the MDS
    with the "soft,retrans=2" options. This can take a couple of minutes
    after the DS failure or network partitioning occurs.

2 - A pNFS client can report an I/O error on the DS to the MDS in
    the arguments for a LayoutReturn operation.

3 - The system administrator can perform the pnfsdskill(8) command on the MDS
    to disable it. If the system administrator does a pnfsdskill(8) and it fails
    with ENXIO (Device not configured) that normally means the DS was already
    disabled via #1 or #2. Since doing this is harmless, once a system
    administrator knows that there is a problem with a mirrored DS, doing the
    command is recommended.

As such, once a system administrator knows that a mirrored DS has malfunctioned
or has been network partitioned, they should do the following as root/su
on the MDS:
# pnfsdskill <mounted-on-path-of-DS>
- If this fails with ENXIO (Device not configured), it normally isn't a problem
  and simply indicates that the DS has already been disabled.
# umount -N <mounted-on-path-of-DS>
- Note that the <mounted-on-path-of-DS> must be the exact mounted-on path
  string used when the DS was mounted on the MDS.
For example, if the DS was mounted on the MDS with:
# mount -t nfs -o nfsv4,minorversion=2,soft,retrans=2 nfsv4-data3:/ /data3
the above commands would be:
# pnfsdskill /data3
# umount -N /data3

Once the mirrored DS has been disabled, the pNFS service should continue to
function, but file updates will only happen on the DS(s) 
that have not been disabled. Assuming two way mirroring, that implies
the one DS of the pair stored in pnfsd.dsfile for the file on the MDS.

The next step is to clear the IP address in the pnfsd.dsfile extended
attribute on all files on the MDS for the failed DS.
This is done so that the recovered DS won't be
used for these files when the repaired DS is brought back online.
The command that clears the IP address is pnfsdsfile(8) with the "-r" option.
For example:
# pnfsdsfile yyy.c
yyy.c:	nfsv4-data2.home.rick	ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000	nfsv4-data3.home.rick	ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000

shows that this file has data files stored on nfsv4-data2 and nfsv4-data3.
After nfsv4-data3 has been disabled as above, only nfsv4-data2 will be used.
However, if nfsv4-data3 was brought back online without fixing this, the
client(s) and MDS could access an out-of-date data file on nfsv4-data3.
The command:
# pnfsdsfile -r nfsv4-data3 yyy.c
yyy.c:	nfsv4-data2.home.rick	ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000	0.0.0.0	ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000

replaces nfsv4-data3 with the IPv4 address 0.0.0.0, so that nfsv4-data3 will
not get used.

Normally this will be called within a find(1) command for all regular
files in the exported directory tree and must be done on the MDS.
When used with find(1), you will probably also
want the "-q" option so that it won't spit out the results for every file.
If the disabled/recovered DS is nfsv4-data3, the commands done on the MDS
would be:
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdsfile -q -r nfsv4-data3 {} \;

There is a problem with the above command if the file found by find(1)
is renamed or unlinked before the pnfsdsfile(8) command is done on it.
This should normally generate an error message. A simple unlink is harmless
but a link/unlink or rename might result in the file not having been processed
under its new name.
To check that all files have their IP addresses set to 0.0.0.0 these
commands can be used (assuming the "sh" shell):
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdsfile {} \; | sed "/nfsv4-data3/!d"
Any line(s) printed require the pnfsdsfile(8) with "-r" to be done again.

(In theory, the file could be renamed during the first command such that it
 gets missed and renamed again during the second command such that it gets
 missed again. However, I think this is highly unlikely. You can run the
 second command repeatedly if you feel it necessary. The only way to
 absolutely guarantee success is to shut down the pNFS service during the
 recovery, but since it may take minutes to hours, that probably isn't
 feasible.)

Once this is done, the replaced/repaired DS can be brought back online.
It should have empty dsNN directories under the exported storage directory,
just like it did when first set up.
Mount it on the MDS exactly as you did before disabling it.
For the nfsv4-data3 example, the command would be:
# mount -t nfs -o nfsv4,minorversion=2,soft,retrans=2 nfsv4-data3:/ /data3

Then restart the nfsd to re-enable the DS.
# /etc/rc.d/nfsd restart

Now, new files can be stored on nfsv4-data3,
but files with the IP address zeroed out will not yet use the repaired DS
(nfsv4-data3).
The next step is to go through the exported file tree and, for each of the
files with an IPv4 address of 0.0.0.0, copy the file data to the repaired DS and
re-enable use of this mirror for it. The command for copying the file data for
one MDS file is pnfsdscopymr(8) and it will also normally be used in a find(1).
This will take a while, since the kernel function performing this has
to perform several steps, as follows:
 - The MDS file's vnode is locked, blocking LayoutGet operations.
 - Disable issuing of read/write layouts for the file via the nfsdontlist,
   so that they will be disabled after the MDS file's vnode is unlocked.
 - Set up the nfsrv_recalllist so that recall of read/write layouts can
   be done.
 - Unlock the MDS file's vnode, so that the client(s) can perform proxied
   writes, LayoutCommits and LayoutReturns for the file when completing the
   LayoutReturn requested by the LayoutRecall callback.
 - Issue a LayoutRecall callback for all read/write layouts and wait for
   them to be returned. (If the LayoutRecall callback replies
   NFSERR_NOMATCHLAYOUT, they are gone and no LayoutReturn is needed.)
 - Exclusively lock the MDS file's vnode.  This ensures that no proxied
   writes are in progress or can occur during the DS file copy.
   It also blocks Setattr operations.
 - Create the file on the recovered mirror.
 - Copy the file from the operational DS.
 - Copy any ACL from the MDS file to the new DS file.
 - Set the modify time of the new DS file to that of the MDS file.
 - Update the extended attribute for the MDS file.
 - Enable issuing of rw layouts by deleting the nfsdontlist entry.
 - Unlock the MDS file's vnode allowing operations
   to continue normally, since it is now on the mirror again.
and this is done for every regular file in the directory tree.

For the example case, the commands on the MDS would be:
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdscopymr -r /data3 {} \;

When this completes, the recovery should be complete or at least nearly so.

As noted above, if a link/unlink or rename occurs on a file name while the
above find(1) is in progress, it may not get copied.
To check for any file(s) not yet copied, the commands are:
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdsfile {} \; | sed "/0\.0\.0\.0/!d"

If this command prints out any file name(s), these files must
have the pnfsdscopymr(8) command done on them again to complete the recovery.
(The above command is looking for any file that has 0.0.0.0 as a DS IP address.)
# pnfsdscopymr -r /data3 <file-path-reporetd>

If there are any errors printed out, these files need the command redone
on them. If repeated attempts fail with
"pnfsdscopymr: Copymr failed for file <path>: Device not configured"
it may mean that there is a Read/Write layout for this file that has not been
returned. All that can be done to fix this is to restart the nfsd, once
you are convinced that the file is no longer being written by any client.

All of these commands are designed to be
done while the pNFS service is running.
The pnfsdscopymr(8) command  can be safely re-run on files, as it recognizes
cases where the file does not need to be copied.

Switching from a non-mirrored to mirrored DS configuration:
Once the nfsd is restarted on the MDS with the "-m 2" option, mirroring
of newly created files will be done.
To create mirrors for old files, the following commands can be used on the MDS:
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdscopymr {} \;
- Which will mirror each unmirrored file on one of the other DSs.
  If you wish to mirror file(s) to a specific DS, you can use pnfsdsfile(8)
  with the "-m" option to add "0.0.0.0" entrie(s) and then use pnfsdscopymr(8)
  with the "-r" option as above for the recovery of a repaired DS.

Migrating a data file to a different DS:
Lets assume the system administrator wished to move the data file for "xxx.c"
from nfsv4-data2 to nfsv4-data3 in our example. The command on the MDS is:
# pnfsdscopymr -m /data2 /data3 xxx.c

pnfsdscopymr(8) just does some sanity checking and one system call to move
the data file.  A load or storage balancer could easily do the same
system call.  I don't plan on implementing such a tool, but hopefully
others will someday.

Please let me know how any testing goes, rick