1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
Fault Management Logic for ZEDThe integration of Fault Management Daemon (FMD) logic from illumos is being deployed in three phases. This logic is encapsulated in several software modules inside ZED. ZED+FM Phase 1All the phase 1 work is in current Master branch. Phase I work includes:
ZED+FM Phase 2 (WIP)The phase 2 work primarily entails the Diagnosis Engine and the Retire Agent modules. It also includes infrastructure to support a crude FMD environment to host these modules. For additional information see the FMD Components in ZED and Implementation Notes sections below. ZED+FM Phase 3Future work will add additional functionality and will likely include:
ZFS Fault Management OverviewThe primary purpose with ZFS fault management is automated diagnosis
and isolation of VDEV faults. A fault is something we can associate with
an impact (e.g. loss of data redundancy) and a corrective action (e.g.
offline or replace a disk). A typical ZFS fault management stack is
comprised of error detectors (e.g.
After detecting a software error, the ZFS kernel module sends error events to the ZED user daemon which in turn routes the events to its internal FMA modules based on their event subscriptions. Likewise, if a disk is added or changed in the system, the disk monitor sends disk events which are consumed by a response agent. FMD Components in ZEDThere are three FMD modules (aka agents) that are now built into ZED.
To begin with, a Diagnosis Engine consumes per-vdev I/O and checksum ereports and feeds them into a Soft Error Rate Discrimination (SERD) algorithm which will generate a corresponding fault diagnosis when the tracked VDEV encounters N events in a given T time window. The initial N and T values for the SERD algorithm are estimates inherited from illumos (10 errors in 10 minutes). In turn, a Retire Agent responds to diagnosed faults by isolating the faulty VDEV. It will notify the ZFS kernel module of the new VDEV state (degraded or faulted). The retire agent is also responsible for managing hot spares across all pools. When it encounters a device fault or a device removal it will replace the device with an appropriate spare if available. Finally, a Disk Add Agent responds to events from a
libudev disk monitor ( Note that the auto-replace feature (aka hot plug) is opt-in
and you must set the pool's Implementation Notes
|