/* Software RAID (GRAID/GEOM_RAID) module design. Software RAID functionality will be implemented in form of "RAID" GEOM class. The geom class will include a common framework, and incorporate two sets of pluggable modules. One set will implement the different metaformat handling (eg, Intel or DDF), while the other will implement the different data transformations algorithms (eg, RAID0, RAID1, RAID0+1, etc). Pluggable modules will register themselves via an API that allows them to be loaded and unloaded in the kernel and for the base GEOM class to not have specific knowledge of all the possible classes. The GEOM tasting process will drive the creation of the volumes, as well as drive the state of each volume (eg, all disks present vs missing members). Depending on the specific setup, several instances of RAID class can exist. Separate instances will be used for different metadata formats or in cases when metadata allow reliably identify independent disk groups. For example, Intel metadata allows defining several RAID volumes on the same group of disks, but provides a unique ID, separating these volumes from other groups. DDF metadata same time doesn't demand strong disk groups isolation. Here are some examples: RAID0 RAID0 RAID1 RAID0+1 RAID3 | | | | | +---prov1---+ +-prov2---prov3-+ +----------prov4----------prov5-------+ |node1| | | | node2 | | |node3 | | | | volume1 | |volume1 volume2| | --volume1-- volume2 | | / \ | | | \ / | | | / | | \ / | \ | | sd1 sd2 | |sd1 sd2 sd3 sd4| | sd1 sd2 sd3 sd4 sd6 sd7 | | | | | | \ / \ / | | | | | | | | | +cons1-cons2+ +-cons3---cons4-+ +-cons3-cons4-cons3-cons4-cons5-cons6-+ | | | | | | | | | | disk1 disk2 disk3 disk4 disk5 disk6 disk7 disk8 disk9 disk10 Node1 implements RAID0 volume of two disks. Node2 implements two volumes (RAID0 and RAID1), sharing parts of the same disks. Node3 implements two volumes (RAID0+1 and RAID3), sharing one of the disks. In this design, stacking of transformation algorithms isn't directly supported. This would add considerable complexity to the code, and is rarely used in the real world apart from RAID0+1. RAID0+1 can be implemented with an indirect approach where both the striping and the mirroring transformations are applied in the RAID0+1 transformation module. According to conversations with Scott Long, this is desirable anyway because Intel soft raid views RAID0+1 as one volume, not a nested set of volumes and volume management will be simiplfiied if all the disks are know to one transform. Each node includes a requests/events queue, a worker thread and a set of locks to serialize access to the common data structures. Iteration through the list of nodes (for example, during new disk tasting) is protected by GEOM topology lock. A node's life cycle starts when a new device taste is initiated by GEOM. During a geom tasting, the common code iterates through the list of registered metedata modules, in order of precedence, and calls each module's meta_taste() method until one returns a success status. Tasting happens with the GEOM topology lock held. The meta_taste() method checks different vendor-specific criteria, reads and checks metadata. If it succeeded, it will traverses through the list of existing raid nodes, trying to find one with same metadata format and other relevant format-specific factors. If matching node is not found, it is created. After this, the topology lock gets dropped, and node lock is obtained. md_taste() checks the lists of volumes in the node and creates missing ones from the metadata. The new disk's subdisks get connected to the raid volume and an event is sent to notify transformation code about new device. After the volume data is filled in from the metadata, the shared code searches for and initializes proper transformation algorithm. It iterates through the list of registered transformation modules, calling the tr_taste() method from each one until match found. The tr_taste() method analyzes vendor-independent volume data, provided by metadata module (RAID type/subtype, number of disks, etc). If it is able to handle them, the common code initializes it's state and returns success status. When a physical events (like some subdisk joins/left volume) happens, the transformation module receives notification via call of tr_event() method. That method is called from worker thread holding node lock. The transformation will do the appropriate house keeping, update its internal state and possibly write new metadata if appropriate. During device tasting, meta_taste() may detect that the volume configuration read from earlier attached drive is older than from new one, indicating the drive was offline during a critical configuration update and so may be faulty. In such case all such drives will be marked as OFFLINE, subdisks dropped from volume and new drive connected. Transformation layer will receive respective notifications. When created, the volume stays in the STARTING state and has no GEOM provider. During a volume discovery and initialization, the transformation module shifts the volume to other states. After all expected disks are connected, the volume shifts to the COMPLETE state. If a disk fails to attach within the specified timeout, but the present ones are enough for full data access - the volume is shifted to the DEGRADED state. When the volume is shifted to the DEGRADED or the COMPLETE state, the GEOM provider for the node gets created by the shared code, allowing normal volume operation. If the disks present are insufficient for operation, or a volume loses a disk and the remaining disks are insufficient for the volume, the volume shitfs to the BROKEN state and if one exists, the provider gets destroyed. During normal operation, data requests from the registered provider get queued to the worker thread queue. The worker thread fetches them one by one and calls the transformation module's tr_start() method. This method does necessary data transformations and generates requests to underlying disks by calling a global start() function which handles disk partitioning (subdisks, if present) and forwards request to the respective GEOM disk. After the I/O's queued to the disk have completed, a completion event is queued to the worker thread. Worker thread identifies the subdisk the request belongs to and forwards it to the appropriate transformation module via tr_done() call. The transformation module undoes any transformation, checks original request, and when done, it returns to GEOM via global done() call. As the status of volumes or subdisks change (a disk failed, array became dirty, synchronization started, etc). This may require updating on-disk metadata. The transformation module updates volume or subdisks state and calls the md_write() method, flushing the metadata to all the affected disks. New metadata constructed by merging ones read by meta_taste() and present volumes/disks/subdisks statuses. All duties related to array consistency (eg synchronization, bad sector recovery, checking, etc) are implemented by the transformation module. These are triggered by different events, such as subdisk state change, I/O errors, external commands. During these times, some transform code may want exclusive access to some subset of the volume. Thus, the shared code implements a mechanism for deferring accesses to the unwanted portions of the volume to a separate queue. The transformation module calls lock() function, specifying start offset and length of the portion of the volume to be blocked. If there are no active write requests in the range, the call returns success status. If there any requests that are in the range, the call returns an error. When the range is clear, a callback is called. Any write request affecting locked range delayed in v_locked queue untill unlock() function called for the range and there is no other related locks. metadata methods: md_taste(): - checks external things, such as controller PCI ID; - reads metedata from specified provider and checks them; - iterates through configured nodes looking for matched type and some additional parameters, such as array ID; - if matching node not found - creates it; - creates inside node volumes and disks. md_event(): - receives disk events (disappeared disk); - updates subdisks statuses and sends events to them. md_write(): - writes metadata to specified disk, mixing original and new data. md_free(): - frees memory allocated for storing metadata; transformer methods: tr_taste(): - checks ability of current transformer to handle the volume; - allocates/initializes internal structures; - updates volume/subdisks statuses. tr_event(): - handles subdisk status change events (attach/detach/...): - updates volume/subdisk statuses; - starts/stops synchronization. tr_start(): - start volume operation if it haven't started automatically yet. tr_stop(): - stop volume operation (rebuild, etc). tr_iostart(): - called on the way down for the volume; - manages forward transformation and generates requests to disks; tr_iodone(): - called on the way up for subvolume; - manages backward transformation and reports completion status when ready; - for requests initiated by transformation layer (synchronization) may generate new requests; - depending on result statuses may generate volume and subdisks status changes. tr_locked(): - callback method for lock(). Reports there is no active write requests in requested range. tr_free(): - frees internal structures. body functions: create_node(): - creates new GEOM node; - launches worker thread. Called by md_taste() holding GEOM topology lock. create_volume(): - creates new volume. Called by md_taste() holding node lock. create_disk(): - creates new disk. Called by md_taste() holding node lock. destroy_node(): - checks node has no volumes and disks, if it has - destroys them; - stops worker thread; - destroys GEOM node; destroy_volume(): - teardowns and destroys volume; destroy_disk(): - destroys disk; subdisk_iostart(): - called on the way down for subdisk (usually by tr_iostart()). iodone(): - called on the way up for volume (usually by td_iodone()). lock(): - locks specified range from writing. Any new write request in specified range will be delayed. If such request is already in progress - range is wtill blocked, but function returns specific status. When write complete - tr_locked() getting called. unlock(): - unblocks specified range, resuming delayed requests. Mirror ------ All write requests executed on all subdisks participating in a volume in ACTIVE and SYNCHRONIZING states. Operation completes when all subdisks report completion. Success status returned if at least one subdisk returned success. Subdisks reporting error either dropped from array or marked as DIRTY to avoid reading from them unless neccessary. Read requests executed by disk with lowest average load not marked as DIRTY. If read request fails - it is retried on another subdisks until succeeded. Subdisk areas with errors added to the list to run explicit synchronization process on them and so allow disk to rewrite or reallocate sectors. If disk fails to synchronize - it will be dropped from the array or marked DIRTY. Except read error areas synchronization, full background disk synchronization will be implemented. It will be used in cases of disk replacement or optionally on array desynchronization due to unclean shutdown. If synchonization completed successfully - subdisk shifted to ACTIVE state and DIRTY flag removed. Tunables: kern.geom.raid.auto_insert = {0, 1} When true, a disk inserted on the same PCI controller that has a degraded mirror is automatically added to that mirror. When false, the disk is ignored. If the disk is too small, it won't be inserted into the mirror. kern.geom.raid.auto_sync = {0, 1} When true, subdisks found to be dirty automatically start background synchronization process. kern.geom.raid.read_master = {0, 1} When true, all reads are forced to go to the master drive. When false, reads are multiplexed between the drives of the mirror. [[ when we come up dirty, we default to this until we can figure out if we need to synchronize or not ]] kern.geom.raid.write_on_read_fail = { 0, 1 } When false, a read failure will force the drive out of the mirror. When true, the other member of the mirror will be read to get the data and the mirror transform will try to write the data to force a bad-sector reallocation to 'self heal' the mirror before going to the extreme step of breaking the mirror. */ /* * KOBJ parent class of metadata processing modules. */ struct g_raid_md_class { KOBJ_CLASS_FIELDS; int mdc_priority; TAILQ_ENTRY(g_raid_md_class) mdc_list; }; /* TAILQ_HEAD(, g_raid_md_class) g_raid_md_classes = TAILQ_HEAD_INITIALIZER(g_raid_md_classes); */ /* * KOBJ instance of metadata processing module. */ struct g_raid_md_object { KOBJ_FIELDS; struct g_raid_md_class *mdo_class; struct g_raid_softc *mdo_softc; /* Back-pointer to softc. */ }; /* * KOBJ parent class of data transformation modules. */ struct g_raid_tr_class { KOBJ_CLASS_FIELDS; int trc_priority; TAILQ_ENTRY(g_raid_tr_class) trc_list; }; /* TAILQ_HEAD(, g_raid_tr_class) g_raid_tr_classes = TAILQ_HEAD_INITIALIZER(g_raid_tr_classes); */ /* * KOBJ instance of data transformation module. */ struct g_raid_tr_object { KOBJ_FIELDS; struct g_raid_tr_class *tro_class; struct g_raid_volume *tro_volume; /* Back-pointer to volume. */ }; struct g_raid_lock { off_t l_offset; off_t l_length; void *l_callback_arg; LIST_ENTRY(g_raid_lock) l_next; }; #define G_RAID_EVENT_WAIT 0x01 #define G_RAID_EVENT_VOLUME 0x02 #define G_RAID_EVENT_SUBDISK 0x04 #define G_RAID_EVENT_DISK 0x08 #define G_RAID_EVENT_DONE 0x10 struct g_raid_event { void *e_tgt; int e_event; int e_flags; int e_error; TAILQ_ENTRY(g_raid_event) e_next; }; #define G_RAID_DISK_S_NONE 0x00 #define G_RAID_DISK_S_ACTIVE 0x01 #define G_RAID_DISK_S_SPARE 0x02 #define G_RAID_DISK_S_OFFLINE 0x03 #define G_RAID_DISK_E_DISCONNECTED 0x01 struct g_raid_disk { struct g_raid_softc *d_softc; /* Back-pointer to softc. */ struct g_consumer *d_consumer; /* GEOM disk consumer. */ void *d_md_data; /* Disk's metadata storage. */ u_int d_state; /* Disk state. */ uint64_t d_flags; /* Additional flags. */ u_int d_load; /* Disk average load. */ off_t d_last_offset; /* Last head offset. */ LIST_HEAD(, g_raid_subdisk) d_subdisks; /* List of subdisks. */ LIST_ENTRY(g_raid_disk) d_next; /* Next disk in the node. */ }; #define G_RAID_SUBDISK_S_NONE 0x00 #define G_RAID_SUBDISK_S_NEW 0x01 #define G_RAID_SUBDISK_S_ACTIVE 0x02 #define G_RAID_SUBDISK_S_STALE 0x03 #define G_RAID_SUBDISK_S_SYNCHRONIZING 0x04 #define G_RAID_SUBDISK_S_DISCONNECTED 0x05 #define G_RAID_SUBDISK_S_DESTROY 0x06 #define G_RAID_SUBDISK_E_NEW 0x01 #define G_RAID_SUBDISK_E_DISCONNECTED 0x02 struct g_raid_subdisk { struct g_raid_softc *sd_softc; /* Back-pointer to softc. */ struct g_raid_disk *sd_disk; /* Where this subdisk lives. */ struct g_raid_volume *sd_volume; /* Volume, sd is a part of. */ off_t sd_offset; /* Offset on the disk. */ off_t sd_size; /* Size on the disk. */ u_int sd_pos; /* Position in volume. */ u_int sd_state; /* Subdisk state. */ LIST_ENTRY(g_raid_subdisk) sd_next; /* Next subdisk on disk. */ }; #define G_RAID_MAX_SUBDISKS 16 #define G_RAID_MAX_VOLUMENAME 16 #define G_RAID_VOLUME_S_STARTING 0x00 #define G_RAID_VOLUME_S_BROKEN 0x01 #define G_RAID_VOLUME_S_DEGRADED 0x02 #define G_RAID_VOLUME_S_SUBOPTIMAL 0x03 #define G_RAID_VOLUME_S_OPTIMAL 0x04 #define G_RAID_VOLUME_S_UNSUPPORTED 0x05 #define G_RAID_VOLUME_S_STOPPED 0x06 #define G_RAID_VOLUME_S_ALIVE(s) \ ((s) == G_RAID_VOLUME_S_DEGRADED || \ (s) == G_RAID_VOLUME_S_SUBOPTIMAL || \ (s) == G_RAID_VOLUME_S_OPTIMAL) #define G_RAID_VOLUME_E_DOWN 0x00 #define G_RAID_VOLUME_E_UP 0x01 #define G_RAID_VOLUME_E_START 0x10 #define G_RAID_VOLUME_RL_RAID0 0x00 #define G_RAID_VOLUME_RL_RAID1 0x01 #define G_RAID_VOLUME_RL_RAID3 0x03 #define G_RAID_VOLUME_RL_RAID4 0x04 #define G_RAID_VOLUME_RL_RAID5 0x05 #define G_RAID_VOLUME_RL_RAID6 0x06 #define G_RAID_VOLUME_RL_RAID10 0x0a #define G_RAID_VOLUME_RL_RAID1E 0x11 #define G_RAID_VOLUME_RL_SINGLE 0x0f #define G_RAID_VOLUME_RL_CONCAT 0x1f #define G_RAID_VOLUME_RL_RAID5E 0x15 #define G_RAID_VOLUME_RL_RAID5EE 0x25 #define G_RAID_VOLUME_RL_UNKNOWN 0xff #define G_RAID_VOLUME_RLQ_NONE 0x00 #define G_RAID_VOLUME_RLQ_UNKNOWN 0xff struct g_raid_volume { struct g_raid_softc *v_softc; /* Back-pointer to softc. */ struct g_provider *v_provider; /* GEOM provider. */ struct g_raid_subdisk v_subdisks[G_RAID_MAX_SUBDISKS]; /* Subdisks of this volume. */ void *v_md_data; /* Volume's metadata storage. */ struct g_raid_tr_object *v_tr; /* Transformation object. */ char v_name[G_RAID_MAX_VOLUMENAME]; /* Volume name. */ u_int v_state; /* Volume state. */ u_int v_raid_level; /* Array RAID level. */ u_int v_raid_level_qualifier; /* RAID level det. */ u_int v_disks_count; /* Number of disks in array. */ u_int v_strip_size; /* Array strip size. */ u_int v_sectorsize; /* Volume sector size. */ off_t v_mediasize; /* Volume media size. */ struct bio_queue_head v_inflight; /* In-flight write requests. */ struct bio_queue_head v_locked; /* Blocked I/O requests. */ LIST_HEAD(, g_raid_lock) v_locks; /* List of locked regions. */ int v_idle; /* DIRTY flags removed. */ time_t v_last_write; /* Time of the last write. */ u_int v_writes; /* Number of active writes. */ struct root_hold_token *v_rootmount; /* Root mount delay token. */ struct callout v_start_co; /* STARTING state timer. */ int v_starting; /* STARTING state timer armed */ int v_stopping; /* Volume is stopping */ int v_provider_open; /* Number of opens. */ LIST_ENTRY(g_raid_volume) v_next; /* List of volumes entry. */ }; struct g_raid_softc { struct g_raid_md_object *sc_md; /* Metadata object. */ struct g_geom *sc_geom; /* GEOM class instance. */ uint64_t sc_flags; /* Additional flags. */ LIST_HEAD(, g_raid_volume) sc_volumes; /* List of volumes. */ LIST_HEAD(, g_raid_disk) sc_disks; /* List of disks. */ struct sx sc_lock; /* Main node lock. */ struct proc *sc_worker; /* Worker process. */ struct mtx sc_queue_mtx; /* Worker queues lock. */ TAILQ_HEAD(, g_raid_event) sc_events; /* Worker events queue. */ struct bio_queue_head sc_queue; /* Worker I/O queue. */ int sc_stopping; /* Node is stopping */ }; #define sc_name sc_geom->name