/*
Software RAID (GRAID/GEOM_RAID) module design.

Software RAID functionality will be implemented in form of "RAID" GEOM class.
The geom class will include a common framework, and incorporate two sets of
pluggable modules. One set will implement the different metaformat handling
(eg, Intel or DDF), while the other will implement the different data
transformations algorithms (eg, RAID0, RAID1, RAID0+1, etc). Pluggable modules
will register themselves via an API that allows them to be loaded and unloaded
in the kernel and for the base GEOM class to not have specific knowledge of
all the possible classes. The GEOM tasting process will drive the creation of
the volumes, as well as drive the state of each volume (eg, all disks present
vs missing members).

Depending on the specific setup, several instances of RAID class can exist.
Separate instances will be used for different metadata formats or in cases
when metadata allow reliably identify independent disk groups. For example,
Intel metadata allows defining several RAID volumes on the same group of
disks, but provides a unique ID, separating these volumes from other
groups. DDF metadata same time doesn't demand strong disk groups
isolation. Here are some examples:

    RAID0       RAID0   RAID1             RAID0+1         RAID3
      |           |       |                  |              |
+---prov1---+ +-prov2---prov3-+ +----------prov4----------prov5-------+
|node1|     | |   | node2 |   | |node3       |              |         |
|  volume1  | |volume1 volume2| |       --volume1--      volume2      |
|   /   \   | | |    \ /    | | |     /   |     |   \   /   |   \     |
| sd1   sd2 | |sd1 sd2 sd3 sd4| |  sd1   sd2   sd3   sd4   sd6   sd7  |
|  |     |  | |  \ /     \ /  | |   |     |     |     |     |     |   |
+cons1-cons2+ +-cons3---cons4-+ +-cons3-cons4-cons3-cons4-cons5-cons6-+
   |     |        |       |         |     |     |     |     |     |
 disk1 disk2    disk3   disk4     disk5 disk6 disk7 disk8 disk9 disk10

Node1 implements RAID0 volume of two disks. Node2 implements two volumes
(RAID0 and RAID1), sharing parts of the same disks. Node3 implements two
volumes (RAID0+1 and RAID3), sharing one of the disks.

In this design, stacking of transformation algorithms isn't directly
supported. This would add considerable complexity to the code, and is rarely
used in the real world apart from RAID0+1. RAID0+1 can be implemented with an
indirect approach where both the striping and the mirroring transformations
are applied in the RAID0+1 transformation module. According to conversations
with Scott Long, this is desirable anyway because Intel soft raid views
RAID0+1 as one volume, not a nested set of volumes and volume management will
be simiplfiied if all the disks are know to one transform.

Each node includes a requests/events queue, a worker thread and a set of locks
to serialize access to the common data structures. Iteration through the list
of nodes (for example, during new disk tasting) is protected by GEOM topology
lock.

A node's life cycle starts when a new device taste is initiated by GEOM.
During a geom tasting, the common code iterates through the list of registered
metedata modules, in order of precedence, and calls each module's meta_taste()
method until one returns a success status. Tasting happens with the GEOM
topology lock held.

The meta_taste() method checks different vendor-specific criteria, reads and
checks metadata. If it succeeded, it will traverses through the list of
existing raid nodes, trying to find one with same metadata format and other
relevant format-specific factors. If matching node is not found, it is
created. After this, the topology lock gets dropped, and node lock is
obtained. md_taste() checks the lists of volumes in the node and creates
missing ones from the metadata. The new disk's subdisks get connected to the
raid volume and an event is sent to notify transformation code about new
device.

After the volume data is filled in from the metadata, the shared code searches
for and initializes proper transformation algorithm. It iterates through the
list of registered transformation modules, calling the tr_taste() method from
each one until match found.

The tr_taste() method analyzes vendor-independent volume data, provided by
metadata module (RAID type/subtype, number of disks, etc). If it is able to
handle them, the common code initializes it's state and returns success
status.

When a physical events (like some subdisk joins/left volume) happens, the
transformation module receives notification via call of tr_event()
method. That method is called from worker thread holding node lock. The
transformation will do the appropriate house keeping, update its internal
state and possibly write new metadata if appropriate.

During device tasting, meta_taste() may detect that the volume configuration
read from earlier attached drive is older than from new one, indicating the
drive was offline during a critical configuration update and so may be faulty.
In such case all such drives will be marked as OFFLINE, subdisks dropped from
volume and new drive connected. Transformation layer will receive respective
notifications.

When created, the volume stays in the STARTING state and has no GEOM provider.
During a volume discovery and initialization, the transformation module shifts
the volume to other states. After all expected disks are connected, the
volume shifts to the COMPLETE state. If a disk fails to attach within the
specified timeout, but the present ones are enough for full data access - the
volume is shifted to the DEGRADED state. When the volume is shifted to the
DEGRADED or the COMPLETE state, the GEOM provider for the node gets created by
the shared code, allowing normal volume operation. If the disks present are
insufficient for operation, or a volume loses a disk and the remaining disks
are insufficient for the volume, the volume shitfs to the BROKEN state and if
one exists, the provider gets destroyed.

During normal operation, data requests from the registered provider get queued
to the worker thread queue. The worker thread fetches them one by one and
calls the transformation module's tr_start() method. This method does
necessary data transformations and generates requests to underlying disks by
calling a global start() function which handles disk partitioning (subdisks,
if present) and forwards request to the respective GEOM disk.

After the I/O's queued to the disk have completed, a completion event is
queued to the worker thread.  Worker thread identifies the subdisk the request
belongs to and forwards it to the appropriate transformation module via
tr_done() call. The transformation module undoes any transformation, checks
original request, and when done, it returns to GEOM via global done() call.

As the status of volumes or subdisks change (a disk failed, array became
dirty, synchronization started, etc). This may require updating on-disk
metadata. The transformation module updates volume or subdisks state and calls
the md_write() method, flushing the metadata to all the affected disks. New
metadata constructed by merging ones read by meta_taste() and present
volumes/disks/subdisks statuses.

All duties related to array consistency (eg synchronization, bad sector
recovery, checking, etc) are implemented by the transformation module. These
are triggered by different events, such as subdisk state change, I/O errors,
external commands.  During these times, some transform code may want exclusive
access to some subset of the volume. Thus, the shared code implements a
mechanism for deferring accesses to the unwanted portions of the volume to a
separate queue. The transformation module calls lock() function, specifying
start offset and length of the portion of the volume to be blocked.  If there
are no active write requests in the range, the call returns success
status. If there any requests that are in the range, the call returns an
error.  When the range is clear, a callback is called.  Any write request
affecting locked range delayed in v_locked queue untill unlock() function
called for the range and there is no other related locks.

metadata methods:
md_taste():
    - checks external things, such as controller PCI ID;
    - reads metedata from specified provider and checks them;
    - iterates through configured nodes looking for matched type
      and some additional parameters, such as array ID;
    - if matching node not found - creates it;
    - creates inside node volumes and disks.
md_event():
    - receives disk events (disappeared disk);
    - updates subdisks statuses and sends events to them.
md_write():
    - writes metadata to specified disk, mixing original and new data.
md_free():
    - frees memory allocated for storing metadata;

transformer methods:
tr_taste():
    - checks ability of current transformer to handle the volume;
    - allocates/initializes internal structures;
    - updates volume/subdisks statuses.
tr_event():
    - handles subdisk status change events (attach/detach/...):
      - updates volume/subdisk statuses;
      - starts/stops synchronization.
tr_start():
    - start volume operation if it haven't started automatically yet.
tr_stop():
    - stop volume operation (rebuild, etc).
tr_iostart():
    - called on the way down for the volume;
    - manages forward transformation and generates requests to disks;
tr_iodone():
    - called on the way up for subvolume;
    - manages backward transformation and reports completion status when ready;
    - for requests initiated by transformation layer (synchronization)
      may generate new requests;
    - depending on result statuses may generate volume and subdisks status
      changes.
tr_locked():
    - callback method for lock(). Reports there is no active write requests
      in requested range.
tr_free():
    - frees internal structures.

body functions:
create_node():
    - creates new GEOM node;
    - launches worker thread.
      Called by md_taste() holding GEOM topology lock.
create_volume():
    - creates new volume.
      Called by md_taste() holding node lock.
create_disk():
    - creates new disk.
      Called by md_taste() holding node lock.
destroy_node():
    - checks node has no volumes and disks, if it has - destroys them;
    - stops worker thread;
    - destroys GEOM node;
destroy_volume():
    - teardowns and destroys volume;
destroy_disk():
    - destroys disk;
subdisk_iostart():
    - called on the way down for subdisk (usually by tr_iostart()).
iodone():
    - called on the way up for volume (usually by td_iodone()).
lock():
    - locks specified range from writing. Any new write request in specified
      range will be delayed. If such request is already in progress - 
      range is wtill blocked, but function returns specific status. When
      write complete - tr_locked() getting called.
unlock():
    - unblocks specified range, resuming delayed requests.

Mirror
------

All write requests executed on all subdisks participating in a volume
in ACTIVE and SYNCHRONIZING states. Operation completes when all
subdisks report completion. Success status returned if at least one
subdisk returned success. Subdisks reporting error either dropped from
array or marked as DIRTY to avoid reading from them unless neccessary.

Read requests executed by disk with lowest average load not marked as
DIRTY. If read request fails - it is retried on another subdisks until
succeeded. Subdisk areas with errors added to the list to run explicit
synchronization process on them and so allow disk to rewrite or
reallocate sectors. If disk fails to synchronize - it will be dropped
from the array or marked DIRTY.

Except read error areas synchronization, full background disk
synchronization will be implemented. It will be used in cases of
disk replacement or optionally on array desynchronization due to
unclean shutdown. If synchonization completed successfully - subdisk
shifted to ACTIVE state and DIRTY flag removed.

Tunables:

kern.geom.raid.auto_insert = {0, 1}

    When true, a disk inserted on the same PCI controller that has a
    degraded mirror is automatically added to that mirror. When false,
    the disk is ignored.  If the disk is too small, it won't be
    inserted into the mirror.

kern.geom.raid.auto_sync = {0, 1}

    When true, subdisks found to be dirty automatically start background
    synchronization process.

kern.geom.raid.read_master = {0, 1} 

    When true, all reads are forced to go to the master drive. When
    false, reads are multiplexed between the drives of the mirror.

    [[ when we come up dirty, we default to this until we can figure
       out if we need to synchronize or not ]]

kern.geom.raid.write_on_read_fail = { 0, 1 }

    When false, a read failure will force the drive out of the mirror.
    When true, the other member of the mirror will be read to get the
    data and the mirror transform will try to write the data to force
    a bad-sector reallocation to 'self heal' the mirror before going
    to the extreme step of breaking the mirror.

*/

/*
 * KOBJ parent class of metadata processing modules.
 */
struct g_raid_md_class {
	KOBJ_CLASS_FIELDS;
	int			 mdc_priority;
	TAILQ_ENTRY(g_raid_md_class) mdc_list;
};

/*
TAILQ_HEAD(, g_raid_md_class) g_raid_md_classes =
    TAILQ_HEAD_INITIALIZER(g_raid_md_classes);
*/

/*
 * KOBJ instance of metadata processing module.
 */
struct g_raid_md_object {
	KOBJ_FIELDS;
	struct g_raid_md_class  *mdo_class;
	struct g_raid_softc     *mdo_softc;     /* Back-pointer to softc. */
};

/*
 * KOBJ parent class of data transformation modules.
 */
struct g_raid_tr_class {
	KOBJ_CLASS_FIELDS;
	int			 trc_priority;
	TAILQ_ENTRY(g_raid_tr_class) trc_list;
};

/*
TAILQ_HEAD(, g_raid_tr_class) g_raid_tr_classes =
    TAILQ_HEAD_INITIALIZER(g_raid_tr_classes);
*/

/*
 * KOBJ instance of data transformation module.
  */
struct g_raid_tr_object {
	KOBJ_FIELDS;
	struct g_raid_tr_class  *tro_class;
	struct g_raid_volume    *tro_volume;    /* Back-pointer to volume. */
};

struct g_raid_lock {
	off_t			 l_offset;
	off_t			 l_length;
	void			*l_callback_arg;
	LIST_ENTRY(g_raid_lock)	 l_next;
};

#define	G_RAID_EVENT_WAIT	0x01
#define	G_RAID_EVENT_VOLUME	0x02
#define	G_RAID_EVENT_SUBDISK	0x04
#define	G_RAID_EVENT_DISK	0x08
#define	G_RAID_EVENT_DONE	0x10
struct g_raid_event {
	void			*e_tgt;
	int			 e_event;
	int			 e_flags;
	int			 e_error;
	TAILQ_ENTRY(g_raid_event) e_next;
};
#define G_RAID_DISK_S_NONE		0x00
#define G_RAID_DISK_S_ACTIVE		0x01
#define G_RAID_DISK_S_SPARE		0x02
#define G_RAID_DISK_S_OFFLINE		0x03

#define G_RAID_DISK_E_DISCONNECTED	0x01

struct g_raid_disk {
	struct g_raid_softc	*d_softc;	/* Back-pointer to softc. */
	struct g_consumer	*d_consumer;	/* GEOM disk consumer. */
	void			*d_md_data;	/* Disk's metadata storage. */
	u_int			 d_state;	/* Disk state. */
	uint64_t		 d_flags;	/* Additional flags. */
	u_int			 d_load;	/* Disk average load. */
	off_t			 d_last_offset;	/* Last head offset. */
	LIST_HEAD(, g_raid_subdisk)	 d_subdisks; /* List of subdisks. */
	LIST_ENTRY(g_raid_disk)	 d_next;	/* Next disk in the node. */
};

#define G_RAID_SUBDISK_S_NONE		0x00
#define G_RAID_SUBDISK_S_NEW		0x01
#define G_RAID_SUBDISK_S_ACTIVE		0x02
#define G_RAID_SUBDISK_S_STALE		0x03
#define G_RAID_SUBDISK_S_SYNCHRONIZING	0x04
#define G_RAID_SUBDISK_S_DISCONNECTED	0x05
#define G_RAID_SUBDISK_S_DESTROY	0x06

#define G_RAID_SUBDISK_E_NEW		0x01
#define G_RAID_SUBDISK_E_DISCONNECTED	0x02

struct g_raid_subdisk {
	struct g_raid_softc	*sd_softc;	/* Back-pointer to softc. */
	struct g_raid_disk	*sd_disk;	/* Where this subdisk lives. */
	struct g_raid_volume	*sd_volume;	/* Volume, sd is a part of. */
	off_t			 sd_offset;	/* Offset on the disk. */
	off_t			 sd_size;	/* Size on the disk. */
	u_int			 sd_pos;	/* Position in volume. */
	u_int			 sd_state;	/* Subdisk state. */
	LIST_ENTRY(g_raid_subdisk)	 sd_next; /* Next subdisk on disk. */
};

#define G_RAID_MAX_SUBDISKS	16
#define G_RAID_MAX_VOLUMENAME	16

#define G_RAID_VOLUME_S_STARTING	0x00
#define G_RAID_VOLUME_S_BROKEN		0x01
#define G_RAID_VOLUME_S_DEGRADED	0x02
#define G_RAID_VOLUME_S_SUBOPTIMAL	0x03
#define G_RAID_VOLUME_S_OPTIMAL		0x04
#define G_RAID_VOLUME_S_UNSUPPORTED	0x05
#define G_RAID_VOLUME_S_STOPPED		0x06

#define G_RAID_VOLUME_S_ALIVE(s)			\
    ((s) == G_RAID_VOLUME_S_DEGRADED ||			\
     (s) == G_RAID_VOLUME_S_SUBOPTIMAL ||		\
     (s) == G_RAID_VOLUME_S_OPTIMAL)

#define G_RAID_VOLUME_E_DOWN		0x00
#define G_RAID_VOLUME_E_UP		0x01
#define G_RAID_VOLUME_E_START		0x10

#define G_RAID_VOLUME_RL_RAID0		0x00
#define G_RAID_VOLUME_RL_RAID1		0x01
#define G_RAID_VOLUME_RL_RAID3		0x03
#define G_RAID_VOLUME_RL_RAID4		0x04
#define G_RAID_VOLUME_RL_RAID5		0x05
#define G_RAID_VOLUME_RL_RAID6		0x06
#define G_RAID_VOLUME_RL_RAID10		0x0a
#define G_RAID_VOLUME_RL_RAID1E		0x11
#define G_RAID_VOLUME_RL_SINGLE		0x0f
#define G_RAID_VOLUME_RL_CONCAT		0x1f
#define G_RAID_VOLUME_RL_RAID5E		0x15
#define G_RAID_VOLUME_RL_RAID5EE	0x25
#define G_RAID_VOLUME_RL_UNKNOWN	0xff

#define G_RAID_VOLUME_RLQ_NONE		0x00
#define G_RAID_VOLUME_RLQ_UNKNOWN	0xff

struct g_raid_volume {
	struct g_raid_softc	*v_softc;	/* Back-pointer to softc. */
	struct g_provider	*v_provider;	/* GEOM provider. */
	struct g_raid_subdisk	 v_subdisks[G_RAID_MAX_SUBDISKS];
						/* Subdisks of this volume. */
	void			*v_md_data;	/* Volume's metadata storage. */
	struct g_raid_tr_object	*v_tr;		/* Transformation object. */
	char			 v_name[G_RAID_MAX_VOLUMENAME];
						/* Volume name. */
	u_int			 v_state;	/* Volume state. */
	u_int			 v_raid_level;	/* Array RAID level. */
	u_int			 v_raid_level_qualifier; /* RAID level det. */
	u_int			 v_disks_count;	/* Number of disks in array. */
	u_int			 v_strip_size;	/* Array strip size. */
	u_int			 v_sectorsize;	/* Volume sector size. */
	off_t			 v_mediasize;	/* Volume media size.  */
	struct bio_queue_head	 v_inflight;	/* In-flight write requests. */
	struct bio_queue_head	 v_locked;	/* Blocked I/O requests. */
	LIST_HEAD(, g_raid_lock)	 v_locks; /* List of locked regions. */
	int			 v_idle;	/* DIRTY flags removed. */
	time_t			 v_last_write;	/* Time of the last write. */
	u_int			 v_writes;	/* Number of active writes. */
	struct root_hold_token	*v_rootmount;	/* Root mount delay token. */
	struct callout		 v_start_co;	/* STARTING state timer. */
	int			 v_starting;	/* STARTING state timer armed */
	int			 v_stopping;	/* Volume is stopping */
	int			 v_provider_open; /* Number of opens. */
	LIST_ENTRY(g_raid_volume)	 v_next; /* List of volumes entry. */
};

struct g_raid_softc {
	struct g_raid_md_object	*sc_md;		/* Metadata object. */
	struct g_geom		*sc_geom;	/* GEOM class instance. */
	uint64_t		 sc_flags;	/* Additional flags. */
	LIST_HEAD(, g_raid_volume)	 sc_volumes;	/* List of volumes. */
	LIST_HEAD(, g_raid_disk)	 sc_disks;	/* List of disks. */
	struct sx		 sc_lock;	/* Main node lock. */
	struct proc		*sc_worker;	/* Worker process. */
	struct mtx		 sc_queue_mtx;	/* Worker queues lock. */
	TAILQ_HEAD(, g_raid_event) sc_events;	/* Worker events queue. */
	struct bio_queue_head	 sc_queue;	/* Worker I/O queue. */
	int			 sc_stopping;	/* Node is stopping */
};
#define	sc_name	sc_geom->name