summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Remove znode move functionalityBrian Behlendorf2011-02-101-184/+0
| | | | | | | | Unlike Solaris the Linux implementation embeds the inode in the znode, and has no use for a vnode. So while it's true that fragmention of the znode cache may occur it should not be worse than any of the other Linux FS inode caches. Until proven that this is a problem it's just added complexity we don't need.
* Conserve stack in zfs_mkdir()Brian Behlendorf2011-02-101-1/+3
| | | | | Move the sa_attrs array from the stack to the heap to minimize stack space usage.
* Conserve stack in zfs_sa_upgrade()Brian Behlendorf2011-02-101-4/+8
| | | | | | As always under Linux stack space is at a premium. Relocate two 20 element sa_bulk_attr_t arrays in zfs_sa_upgrade() from the stack to the heap.
* Export required vfs/vn symbolsBrian Behlendorf2011-02-106-29/+153
|
* Add HAVE_SCANSTAMPBrian Behlendorf2011-02-101-2/+6
| | | | | This functionality is not supported under Linux, perhaps it will be some day if it's decided it's useful.
* Add initial rw_uio functions to the dmuBrian Behlendorf2011-02-042-5/+117
| | | | | | | These functions were dropped originally because I felt they would need to be rewritten anyway to avoid using uios. However, this patch readds then with they dea they can just be reworked and the uio bits dropped.
* Remove HAVE_ZPL from commands and librariesBrian Behlendorf2011-02-047-139/+0
| | | | | Thanks to the previous few commits we can now build all of the user space commands and libraries with support for the zpl.
* Documentation updatesBrian Behlendorf2011-02-046-17/+17
| | | | | Minor Linux specific documentation updates to the comments and man pages.
* Minimal libshare infrastructureBrian Behlendorf2011-02-0449-234/+24
| | | | | | | | | | | | | | | | | | ZFS even under Solaris does not strictly require libshare to be available. The current implementation attempts to dlopen() the library to access the needed symbols. If this fails libshare support is simply disabled. This means that on Linux we only need the most minimal libshare implementation. In fact just enough to prevent the build from failing. Longer term we can decide if we want to implement a libshare library like Solaris. At best this would be an abstraction layer between ZFS and NFS/SMB. Alternately, we can drop libshare entirely and directly integrate ZFS with Linux's NFS/SMB. Finally the bare bones user-libshare.m4 test was dropped. If we do decide to implement libshare at some point it will surely be as part of this package so the check is not needed.
* Add 'zfs mount' supportBrian Behlendorf2011-02-045-132/+169
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | By design the zfs utility is supposed to handle mounting and unmounting a zfs filesystem. We could allow zfs to do this directly. There are system calls available to mount/umount a filesystem. And there are library calls available to manipulate /etc/mtab. But there are a couple very good reasons not to take this appraoch... for now. Instead of directly calling the system and library calls to (u)mount the filesystem we fork and exec a (u)mount process. The principle reason for this is to delegate the responsibility for locking and updating /etc/mtab to (u)mount(8). This ensures maximum portability and ensures the right locking scheme for your version of (u)mount will be used. If we didn't do this we would have to resort to an autoconf test to determine what locking mechanism is used. The downside to using mount(8) instead of mount(2) is that we lose the exact errno which was returned by the kernel. The return code from mount(8) provides some insight in to what went wrong but it not quite as good. For the moment this is translated as a best guess in to a errno for the higher layers of zfs. In the long term a shared library called libmount is under development which provides a common API to address the locking and errno issues. Once the standard mount utility has been updated to use this library we can then leverage it. Until then this is the only safe solution. http://www.kernel.org/pub/linux/utils/util-linux/libmount-docs/index.html
* Open up libzfs_run_process/libzfs_load_moduleBrian Behlendorf2011-01-282-2/+9
| | | | | | | Recently helper functions were added to libzfs_util to load a kernel module or execute a process. Initially this functionality was limited to libzfs but it has become clear there will be other consumers. This change opens up the interface so it may be used where appropriate.
* Disable umount.zfs helperBrian Behlendorf2011-01-281-37/+103
| | | | | | | | | | | | | | | | | For the moment, the only advantage in registering a umount helper would be to automatically unshare a zfs filesystem. Since under Linux this would be unexpected (but nice) behavior there is no harm in disabling it. This is desirable because the 'zfs unmount' path invokes the system umount. This is done to ensure correct mtab locking but has the side effect that the umount.zfs helper would be called if it exists. By default this helper calls back in to zfs to do the unmount on Solaris which we don't want under Linux. Once libmount is available and we have a safe way to correctly lock and update the /etc/mtab file we can reconsider the need for a umount helper. Using libmount is the prefered solution.
* Enable mount.zfs helperBrian Behlendorf2011-01-281-32/+188
| | | | | | | | | | | | | | | | | | While not strictly required to mount a zfs filesystem using a mount helper has certain advantages. First, we need it if we want to honor the mount behavior as found on Solaris. As part of the mount we need to validate that the dataset has the legacy mount property set if we are using 'mount' instead of 'zfs mount'. Secondly, by using a mount helper we can automatically load the zpl kernel module. This way you can just issue a 'mount' or 'zfs mount' and it will just work. Finally, it gives us common hook in user space to add any zfs specific mount options we might want. At the moment we don't have any but now the infrastructure is at least in place.
* Autoconf selinux supportBrian Behlendorf2011-01-2849-16/+678
| | | | | | | | | | | | | | | | | If libselinux is detected on your system at configure time link against it. This allows us to use a library call to detect if selinux is enabled and if it is to pass the mount option: "context=\"system_u:object_r:file_t:s0" For now this is required because none of the existing selinux policies are aware of the zfs filesystem type. Because of this they do not properly enable xattr based labeling even though zfs supports all of the required hooks. Until distro's add zfs as a known xattr friendly fs type we must use mntpoint labeling. Alternately, end users could modify their existing selinux policy with a little guidance.
* Fix ZVOL rename minor devicesBrian Behlendorf2011-01-071-3/+9
| | | | | | | During a rename we need to be careful to destroy and create a new minor for the ZVOL _only_ if the rename succeeded. The previous code would both destroy you minor device unconditionally, it would also fail to create the new minor device on success.
* Fix minor compiler warningsBrian Behlendorf2011-01-068-65/+68
| | | | | | | These compiler warnings were introduced when code which was previously #ifdef'ed out by HAVE_ZPL was re-added for use by the posix layer. All of the following changes should be obviously correct and will cause no semantic changes.
* Add missing mkdirp prototypeBrian Behlendorf2010-12-141-0/+34
| | | | | | For while now mkdirp has been built as part of libspl however the protoype was never added to libgen.h. This went unnoticed until enabling the mount support which uses mkdirp().
* Use cv_timedwait_interruptible in arcBrian Behlendorf2010-12-142-3/+4
| | | | | | | | | | | The issue is that cv_timedwait() sleeps uninterruptibly to block signals and avoid waking up early. Under Linux this counts against the load average keeping it artificially high. This change allows the arc to sleep interruptibly which mean it may be woken up early due to a signal. Normally this means some extra care must be taken to handle a potential signal. But for the arcs usage of cv_timedwait() there is no harm in waking up before the timeout expires so no extra handling is required.
* Fix block device-related issues in zdb.Ricardo M. Correia2010-12-145-23/+58
| | | | | | | | | | | Specifically, this fixes the two following errors in zdb when a pool is composed of block devices: 1) 'Value too large for defined data type' when running 'zdb <dataset>'. 2) 'character device required' when running 'zdb -l <block-device>'. Signed-off-by: Ricardo M. Correia <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Enable rrwlock.c compilationBrian Behlendorf2010-12-071-3/+0
| | | | | | With the addition of the thread specific data interfaces to the SPL it is safe to enable compilation of the re-enterant read reader/writer locks.
* Refresh autogen.sh productsBrian Behlendorf2010-12-072-6/+6
| | | | | | | | | Refresh the autogen.sh products based on the versions which are installed by default in the GA RHEL6.0 release. autoconf (GNU Autoconf) 2.63 automake (GNU automake) 1.11.1 ltmain.sh (GNU libtool) 2.2.6b
* Remove partition from vdev name in zfault.shNed Bass2010-11-291-1/+1
| | | | | | | | | As of the 0.5.2 tag, names of whole-disk vdevs must be specified to the command line tools without partition identifiers. This commit fixes a 'zpool online' command in zfault.sh that incorrectly includes he partition in the vdev name, causing test 9 to fail. Signed-off-by: Brian Behlendorf <[email protected]>
* Skip /dev/hpet during 'zpool import'zfs-0.5.2Brian Behlendorf2010-11-121-1/+2
| | | | | | | | | | | | | | | | | | | If libblkid does not contain ZFS support, then 'zpool import' will scan all block devices in /dev/ to determine which ones are components of a ZFS filesystem. It does this by opening all the devices and stat'ing them to determine which ones are block devices. If the device turns out not to be a block device it is skipped. Usually, this whole process is pretty harmless (although slow). But there are certain devices in /dev/ which must be handled in a very specific way or your system may crash. For example, if /dev/watchdog is simply opened the watchdog timer will be started and your system will panic when the timer expires. It turns out the /dev/hpet causes similiar problems although only when accessed under a virtual machine. For some reason accessing /dev/hpet causes qemu to crash. To address this issue this commit adds /dev/hpet to the device blacklist, it will be skipped solely based on its name.
* Add '-ts' options to zconfig.sh/zfault.sh usageBrian Behlendorf2010-11-112-2/+4
| | | | | | | When adding this functionality originally the options to only run specific tests (-t), or conversely skip specific tests (-s) were omitted from the usage page. This commit adds the missing documentation.
* Remove spl/zfs modules as part of cleanupBrian Behlendorf2010-11-114-0/+4
| | | | | | | | | | The idea behind the '-c' flag is to cleanup everything from a previous test run which might cause the test script to fail. This should also include removing the previously loaded module. This makes it a little easier to run 'zconfig.sh -c', however remember this is a test script and it will take all of your other zpools offline for the purposes of the test. This notion has also been extended to the default 'make check' behavior.
* Unconditionally load core kernel modulesBrian Behlendorf2010-11-112-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Loading and unloading the zlib modules as part of the zfs.sh script has proven a little problematic for a few reasons. * First, your kernel may not need to load either zlib_inflate or zlib_deflate. This functionality may be built directly in to your kernel. It depends entirely on what your distribution decided was the right thing to do. * Second, even if you do manage to load the correct modules you may not be able to unload them. There may other consumers of the modules with a reference preventing the unload. To avoid both of these issues the test scripts have been updated to attempt to unconditionally load all modules listed in KERNEL_MODULES. If the module is successfully loaded you must have needed it. If the module can't be loaded that almost certainly means either it is built in to your kernel or is already being used by another consumer. In both cases this is not an issue and we can move on to the spl/zfs modules. Finally, by removing these kernel modules from the MODULES list we ensure they are never unloaded during 'zfs.sh -u'. This avoids the issue of the script failing because there is another consumer using the module we were not aware of. In other words the script restricts unloading modules to only the spl/zfs modules. Closes #78
* Fix for access beyond end of device errorNed Bass2010-11-105-11/+13
| | | | | | | | | | | | | | | | | | | | | | This commit fixes a sign extension bug affecting l2arc devices. Extremely large offsets may be passed down to the low level block device driver on reads, generating errors similar to attempt to access beyond end of device sdbi1: rw=14, want=36028797014862705, limit=125026959 The unwanted sign extension occurrs because the function arc_read_nolock() stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel. This offset is then passed to zio_read_phys() as a uint64_t argument, causing sign extension for values of 0x80000000 or greater. To avoid this, we store the offset in a uint64_t. This change also changes a few daddr_t struct members to uint64_t in the libspl headers to avoid similar bugs cropping up in the future. We also add an ASSERT to __vdev_disk_physio() to check for invalid offsets. Closes #66 Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 2.6.36 compat, use fops->unlocked_ioctl()Brian Behlendorf2010-11-101-6/+5
| | | | | | | | | As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has been removed and thus fops()->ioctl() has also been removed. The replacement hook is fops->unlocked_ioctl() which has existed in kernel since 2.6.12. Since the ZFS code only contains support back to 2.6.18 vintage kernels, I'm not adding an autoconf check for this and simply moving everything to use fops->unlocked_ioctl().
* Linux 2.6.36 compat, blk_* macros removedBrian Behlendorf2010-11-101-0/+10
| | | | | | | Most of the blk_* macros were removed in 2.6.36. Ostensibly this was done to improve readability and allow easier grepping. However, from a portability stand point the macros are helpful. Therefore the needed macros are redefined here if they are missing from the kernel.
* Linux 2.6.36 compat, synchronous bio flagBrian Behlendorf2010-11-106-13/+330
| | | | | | | | | | | | | | The name of the flag used to mark a bio as synchronous has changed again in the 2.6.36 kernel due to the unification of the BIO_RW_* and REQ_* flags. The new flag is called REQ_SYNC. To simplify checking this flag I have introduced the vdev_disk_dio_is_sync() helper function. Based on the results of several new autoconf tests it uses the correct mask to check for a synchronous bio. Preferred interface for flagging a synchronous bio: 2.6.12-2.6.29: BIO_RW_SYNC 2.6.30-2.6.35: BIO_RW_SYNCIO 2.6.36-2.6.xx: REQ_SYNC
* Linux 2.6.36 compat, use REQ_FAILFAST_MASKBrian Behlendorf2010-11-105-13/+329
| | | | | | | | | | | | | | | | As of linux-2.6.36 the BIO_RW_FAILFAST and REQ_FAILFAST flags have been unified under the REQ_* names. These flags always had to be kept in-sync so this is a nice step forward, unfortunately it means we need to be careful to only use the new unified flags when the BIO_RW_* flags are not defined. Additional autoconf checks were added for this and if it is ever unclear which method to use no flags are set. This is safe but may result in longer delays before a disk is failed. Perferred interface for setting FAILFAST on a bio: 2.6.12-2.6.27: BIO_RW_FAILFAST 2.6.28-2.6.35: BIO_RW_FAILFAST_{DEV|TRANSPORT|DRIVER} 2.6.36-2.6.xx: REQ_FAILFAST_{DEV|TRANSPORT|DRIVER}
* Remove inconsistent use of EOPNOTSUPPNed Bass2010-11-101-1/+1
| | | | | | | | | | | Commit 3ee56c292bbcd7e6b26e3c2ad8f0e50eee236bcc changed an ENOTSUP return value in one location to ENOTSUPP to fix user programs seeing an invalid ioctl() error code. However, use of ENOTSUP is widespread in the zfs module. Instead of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be consistent with user space. The changed return value in the above commit is therefore no longer needed, so this commit reverses it to maintain consistency. Signed-off-by: Brian Behlendorf <[email protected]>
* Add lustre zpios-test workloadBrian Behlendorf2010-11-083-20/+88
| | | | | | | | The lustre zpios-test simulates a reasonable lustre workload. It will create 128 threads, the same as a Lustre OSS, and then 4096 individual objects. Each objects is 16MiB in size and will be written/read in 1MiB from a random thread. This is fundamentally how we expect Lustre to behave for large IO intensive workloads.
* Prep for 0.5.2 tagBrian Behlendorf2010-11-081-1/+1
| | | | Update META file to prep for 0.5.2 tag.
* Replace custom zpool configs with generic configsBrian Behlendorf2010-11-0817-276/+261
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To streamline testing I have in the past added several custom configs to the zpool-config directory. This change reverts those custom configs and replaces them with three generic config which can do the same thing. The generic config behavior can be set by setting various environment variables when calling either the zpool-create.sh or zpios.sh scripts. For example if you wanted to create and test a single 4-disk Raid-Z2 configuration using disks [A-D]1 with dedicated ZIL and L2ARC devices you could run the following. $ ZIL="log A2" L2ARC="cache B2" RANKS=1 CHANNELS=4 LEVEL=2 \ zpool-create.sh -c zpool-raidz $ zpool status tank pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 A1 ONLINE 0 0 0 B1 ONLINE 0 0 0 C1 ONLINE 0 0 0 D1 ONLINE 0 0 0 logs A2 ONLINE 0 0 0 cache B2 ONLINE 0 0 0 errors: No known data errors
* Make rollbacks fail gracefullyNed Bass2010-11-082-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Support for rolling back datasets require a functional ZPL, which we currently do not have. The zfs command does not check for ZPL support before attempting a rollback, and in preparation for rolling back a zvol it removes the minor node of the device. To prevent the zvol device node from disappearing after a failed rollback operation, this change wraps the zfs_do_rollback() function in an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL. This is consistent with the behavior of other ZPL dependent commands such as mount. The orginal error message observed with this bug was rather confusing: internal error: Unknown error 524 Aborted This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but Linux actually has no such error code. It should instead return EOPNOTSUPP, as that is how ENOTSUP is defined in user space. With that we would have gotten the somewhat more helpful message cannot rollback 'tank/fish': unsupported version This is rather a moot point with the above changes since we will no longer make that ioctl call without a ZPL. But, this change updates the error code just in case. Signed-off-by: Brian Behlendorf <[email protected]>
* Increate zio write interrupt thread count.Brian Behlendorf2010-11-081-1/+1
| | | | | | | Increasing the default zio_wr_int thread count from 8 to 16 improves write performence by 13% on large systems. More testing need to be done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum of 8 threads.
* Shorten zio_* thread namesBrian Behlendorf2010-11-082-3/+2
| | | | | | Linux kernel thread names are expected to be short. This change shortens the zio thread names to 10 characters leaving a few chracters to append the /<cpuid> to which the thread is bound. For example: z_wr_iss/0.
* Fix panic mounting unformatted zvolNed Bass2010-10-291-0/+4
| | | | | | | | | | On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL file pointer if the user tries to mount a zvol without a filesystem on it. This change adds checks to prevent a null pointer dereference. Closes #73. Signed-off-by: Brian Behlendorf <[email protected]>
* Call modprobe with absolute pathNed Bass2010-10-221-1/+1
| | | | | | | | | Some sudo configurations may not include /sbin in the PATH. libzfs_load_module() currently does not call modprobe with an absolute path, so it may fail under such configurations if called under sudo. This change adds the absolute path to modprobe so we no longer rely on how PATH is set. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix intermittent 'zpool add' failuresNed Bass2010-10-221-15/+27
| | | | | | | | | | | | | | | Creating whole-disk vdevs can intermittently fail if a udev-managed symlink to the disk partition is already in place. To avoid this, we now remove any such symlink before partitioning the disk. This makes zpool_label_disk_wait() truly wait for the new link to show up instead of returning if it finds an old link still in place. Otherwise there is a window between when udev deletes and recreates the link during which access attempts will fail with ENOENT. Also, clean up a comment about waiting for udev to create symlinks. It no longer needs to describe the special cases for the link names, since that is now handled in a separate helper function. Signed-off-by: Brian Behlendorf <[email protected]>
* Add zconfig test for adding and removing vdevsNed Bass2010-10-221-0/+101
| | | | | | | | | | | | | This test performs a sanity check of the zpool add and remove commands. It tests adding and removing both a cache disk and a log disk to and from a zpool. Usage of both a shorthand device path and a full path is covered. The test uses a scsi_debug device as the disk to be added and removed. This is done so that zpool will see it as a whole disk and partition it, which it does not currently done for loopback devices. We want to verify that the manipulation done to whole disks paths to hide the parition information does not break the add/remove interface. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove solaris-specific code from make_leaf_vdev()Ned Bass2010-10-221-37/+0
| | | | | | | Portability between Solaris and Linux isn't really an issue for us anymore, and removing sections like this one helps simplify the code. Signed-off-by: Brian Behlendorf <[email protected]>
* Support shorthand names with zpool removeNed Bass2010-10-221-64/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | zpool status displays abbreviated vdev names without leading path components and, in the case of whole disks, without partition information. Also, the zpool subcommands 'create' and 'add' support using shorthand devices names without qualified paths. Prior to this change, however, removing a device generally required specifying its name as it is stored in the vdev label. So while zpool status might list a cache disk with a name like A16, removing it would require a full path such as /dev/disk/zpool/A16-part1, which is non-intuitive. This change adds support for shorthand device names with the remove subcommand so one can simply type, for example, zpool remove tank A16 A consequence of this change is that including the partition information when removing a whole-disk vdev now results in an error. While this is arguably the correct behavior, it is a departure from how zpool previously worked in this project. This change removes the only reference to ctd_check_path(), so that function is also removed to avoid compiler warnings. Signed-off-by: Brian Behlendorf <[email protected]>
* Add helper functions for manipulating device namesNed Bass2010-10-223-23/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This change adds two helper functions for working with vdev names and paths. zfs_resolve_shortname() resolves a shorthand vdev name to an absolute path of a file in /dev, /dev/disk/by-id, /dev/disk/by-label, /dev/disk/by-path, /dev/disk/by-uuid, /dev/disk/zpool. This was previously done only in the function is_shorthand_path(), but we need a general helper function to implement shorthand names for additional zpool subcommands like remove. is_shorthand_path() is accordingly updated to call the helper function. There is a minor change in the way zfs_resolve_shortname() tests if a file exists. is_shorthand_path() effectively used open() and stat64() to test for file existence, since its scope includes testing if a device is a whole disk and collecting file status information. zfs_resolve_shortname(), on the other hand, only uses access() to test for existence and leaves it to the caller to perform any additional file operations. This seemed like the most general and lightweight approach, and still preserves the semantics of is_shorthand_path(). zfs_append_partition() appends a partition suffix to a device path. This should be used to generate the name of a whole disk as it is stored in the vdev label. The user-visible names of whole disks do not contain the partition information, while the name in the vdev label does. The code was lifted from the function make_disks(), which now just calls the helper function. Again, having a helper function to do this supports general handling of shorthand names in the user interface. Signed-off-by: Brian Behlendorf <[email protected]>
* Add zfault zpool configurations and testsBrian Behlendorf2010-10-1222-66/+2069
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eleven new zpool configurations were added to allow testing of various failure cases. The first 5 zpool configurations leverage the 'faulty' md device type which allow us to simuluate IO errors at the block layer. The last 6 zpool configurations leverage the scsi_debug module provided by modern kernels. This device allows you to create virtual scsi devices which are backed by a ram disk. With this setup we can verify the full IO stack by injecting faults at the lowest layer. Both methods of fault injection are important to verifying the IO stack. The zfs code itself also provides a mechanism for error injection via the zinject command line tool. While we should also take advantage of this appraoch to validate the code it does not address any of the Linux integration issues which are the most concerning. For the moment we're trusting that the upstream Solaris guys are running zinject and would have caught internal zfs logic errors. Currently, there are 6 r/w test cases layered on top of the 'faulty' md devices. They include 3 writes tests for soft/transient errors, hard/permenant errors, and all writes error to the device. There are 3 matching read tests for soft/transient errors, hard/permenant errors, and fixable read error with a write. Although for this last case zfs doesn't do anything special. The seventh test case verifies zfs detects and corrects checksum errors. In this case one of the drives is extensively damaged and by dd'ing over large sections of it. We then ensure zfs logs the issue and correctly rebuilds the damage. The next test cases use the scsi_debug configuration to injects error at the bottom of the scsi stack. This ensures we find any flaws in the scsi midlayer or our usage of it. Plus it stresses the device specific retry, timeout, and error handling outside of zfs's control. The eighth test case is to verify that the system correctly handles an intermittent device timeout. Here the scsi_debug device drops 1 in N requests resulting in a retry either at the block level. The ZFS code does specify the FAILFAST option but it turns out that for this case the Linux IO stack with still retry the command. The FAILFAST logic located in scsi_noretry_cmd() does no seem to apply to the simply timeout case. It appears to be more targeted to specific device or transport errors from the lower layers. The ninth test case handles a persistent failure in which the device is removed from the system by Linux. The test verifies that the failure is detected, the device is made unavailable, and then can be successfully re-add when brought back online. Additionally, it ensures that errors and events are logged to the correct places and the no data corruption has occured due to the failure.
* Fix missing 'zpool events'Brian Behlendorf2010-10-123-7/+21
| | | | | | | | | | | | | | | | | | | | It turns out that 'zpool events' over 1024 bytes in size where being silently dropped. This was discovered while writing the zfault.sh tests to validate common failure modes. This could occur because the zfs interface for passing an arbitrary size nvlist_t over an ioctl() is to provide a buffer for the packed nvlist which is usually big enough. In this case 1024 byte is the default. If the kernel determines the buffer is to small it returns ENOMEM and the minimum required size of the nvlist_t. This was working properly but in the case of 'zpool events' the event stream was advanced dispite the error. Thus the retry with the bigger buffer would succeed but it would skip over the previous event. The fix is to pass this size to zfs_zevent_next() and determine before removing the event from the list if it will fit. This was preferable to checking after the event was returned because this avoids the need to rewind the stream.
* Initial zio delay timingBrian Behlendorf2010-10-125-3/+34
| | | | | | | | | | | | | | | | | | | | | | | While there is no right maximum timeout for a disk IO we can start laying the ground work to measure how long they do take in practice. This change simply measures the IO time and if it exceeds 30s an event is posted for 'zpool events'. This value was carefully selected because for sd devices it implies that at least one timeout (SD_TIMEOUT) has occured. Unfortunately, even with FAILFAST set we may retry and request and not get an error. This behavior is strongly dependant on the device driver and how it is hooked in to the scsi error handling stack. However by setting the limit at 30s we can log the event even if no error was returned. Slightly longer term we can start recording these delays perhaps as a simple power-of-two histrogram. This histogram can then be reported as part of the 'zpool status' command when given an command line option. None of this code changes the internal behavior of ZFS. Currently it is simply for reporting excessively long delays.
* Add FAILFAST supportBrian Behlendorf2010-10-1249-10/+241
| | | | | | | | | | | | | | | | | | | | ZFS works best when it is notified as soon as possible when a device failure occurs. This allows it to immediately start any recovery actions which may be needed. In theory Linux supports a flag which can be set on bio's called FAILFAST which provides this quick notification by disabling the retry logic in the lower scsi layers. That's the theory at least. In practice is turns out that while the flag exists you oddly have to set it with the BIO_RW_AHEAD flag. And even when it's set it you may get retries in the low level drivers decides that's the right behavior, or if you don't get the right error codes reported to the scsi midlayer. Unfortunately, without additional kernels patchs there's not much which can be done to improve this. Basically, this just means that it may take 2-3 minutes before a ZFS is notified properly that a device has failed. This can be improved and I suspect I'll be submitting patches upstream to handle this.
* Fix 'zpool events' formatting for awkBrian Behlendorf2010-10-121-4/+9
| | | | | | | | | | To make the 'zpool events' output simple to parse with awk the extra newline after embedded nvlists has been dropped. This allows the entire event to be parsed as a single whitespace seperated record. The -H option has been added to operate in scripted mode. For the 'zpool events' command this means don't print the header. The usage of -H is consistent with scripted mode for other zpool commands.