summaryrefslogtreecommitdiffstats
path: root/scripts/zfault.sh
Commit message (Collapse)AuthorAgeFilesLines
* Disable 90-zfs.rules for test suitezfs-0.6.0-rc6Brian Behlendorf2011-10-111-0/+3
| | | | | | | | | | | | When running the zconfig.sh, zpios-sanity.sh, and zfault.sh from the installed packages the 90-zfs.rules can cause failures. These will occur because the test suite assumes it has full control over loading/unloading the module stack. If the stack gets asynchronously loaded by the udev rule the test suite will treat it as a failure. Resolve the issue by disabling the offending rule during the tests and enabling it on exit. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove partition from vdev name in zfault.shNed Bass2010-11-291-1/+1
| | | | | | | | | As of the 0.5.2 tag, names of whole-disk vdevs must be specified to the command line tools without partition identifiers. This commit fixes a 'zpool online' command in zfault.sh that incorrectly includes he partition in the vdev name, causing test 9 to fail. Signed-off-by: Brian Behlendorf <[email protected]>
* Add '-ts' options to zconfig.sh/zfault.sh usageBrian Behlendorf2010-11-111-1/+1
| | | | | | | When adding this functionality originally the options to only run specific tests (-t), or conversely skip specific tests (-s) were omitted from the usage page. This commit adds the missing documentation.
* Remove spl/zfs modules as part of cleanupBrian Behlendorf2010-11-111-0/+1
| | | | | | | | | | The idea behind the '-c' flag is to cleanup everything from a previous test run which might cause the test script to fail. This should also include removing the previously loaded module. This makes it a little easier to run 'zconfig.sh -c', however remember this is a test script and it will take all of your other zpools offline for the purposes of the test. This notion has also been extended to the default 'make check' behavior.
* Add zfault zpool configurations and testsBrian Behlendorf2010-10-121-0/+951
Eleven new zpool configurations were added to allow testing of various failure cases. The first 5 zpool configurations leverage the 'faulty' md device type which allow us to simuluate IO errors at the block layer. The last 6 zpool configurations leverage the scsi_debug module provided by modern kernels. This device allows you to create virtual scsi devices which are backed by a ram disk. With this setup we can verify the full IO stack by injecting faults at the lowest layer. Both methods of fault injection are important to verifying the IO stack. The zfs code itself also provides a mechanism for error injection via the zinject command line tool. While we should also take advantage of this appraoch to validate the code it does not address any of the Linux integration issues which are the most concerning. For the moment we're trusting that the upstream Solaris guys are running zinject and would have caught internal zfs logic errors. Currently, there are 6 r/w test cases layered on top of the 'faulty' md devices. They include 3 writes tests for soft/transient errors, hard/permenant errors, and all writes error to the device. There are 3 matching read tests for soft/transient errors, hard/permenant errors, and fixable read error with a write. Although for this last case zfs doesn't do anything special. The seventh test case verifies zfs detects and corrects checksum errors. In this case one of the drives is extensively damaged and by dd'ing over large sections of it. We then ensure zfs logs the issue and correctly rebuilds the damage. The next test cases use the scsi_debug configuration to injects error at the bottom of the scsi stack. This ensures we find any flaws in the scsi midlayer or our usage of it. Plus it stresses the device specific retry, timeout, and error handling outside of zfs's control. The eighth test case is to verify that the system correctly handles an intermittent device timeout. Here the scsi_debug device drops 1 in N requests resulting in a retry either at the block level. The ZFS code does specify the FAILFAST option but it turns out that for this case the Linux IO stack with still retry the command. The FAILFAST logic located in scsi_noretry_cmd() does no seem to apply to the simply timeout case. It appears to be more targeted to specific device or transport errors from the lower layers. The ninth test case handles a persistent failure in which the device is removed from the system by Linux. The test verifies that the failure is detected, the device is made unavailable, and then can be successfully re-add when brought back online. Additionally, it ensures that errors and events are logged to the correct places and the no data corruption has occured due to the failure.