aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix for re-reading /etc/mtab.John M. Layman2014-04-043-5/+22
| | | | | | | | | | | | | | | | | | | This is a continuation of fb5c53ea65b75c67c23f90ebbbb1134a5bb6c140: When /etc/mtab is updated on Linux it's done atomically with rename(2). A new mtab is written, the existing mtab is unlinked, and the new mtab is renamed to /etc/mtab. This means that we must close the old file and open the new file to get the updated contents. Using rewind(3) will just move the file pointer back to the start of the file, freopen(3) will close and open the file. In this commit, a few more rewind(3) calls were replaced with freopen(3) to allow updated mtab entries to be picked up immediately. Signed-off-by: John M. Layman <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2215 Issue #1611
* Fix locking order in zfs_zget()Richard Yao2014-04-041-1/+1
| | | | | Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Fix deadlock in zfs_zget()Richard Yao2014-04-041-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zfsonlinux/zfs#180 occurred because of a race between inode eviction and zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call to the VFS to learn whether an inode is being evicted. If it was being evicted the operation was retried after dropping and reacquiring the relevant resources. Unfortunately, this introduced another deadlock. INFO: task kworker/u24:6:891 blocked for more than 120 seconds. Tainted: P O 3.13.6 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 Call Trace: [<ffffffff813dd1d4>] schedule+0x24/0x70 [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [<ffffffff811608c5>] ilookup+0x65/0xd0 [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [<ffffffff8106072c>] process_one_work+0x16c/0x490 [<ffffffff810613f3>] worker_thread+0x113/0x390 [<ffffffff81066edf>] kthread+0xdf/0x100 This patch implements the original fix in a slightly different manner in order to avoid both deadlocks. Instead of relying on a call to ilookup() which can block in __wait_on_freeing_inode() the return value from igrab() is used. This gives us the information that ilookup() provided without the risk of a deadlock. Alternately, this race could be closed by registering an sops->drop_inode() callback. The callback would need to detect the active SA hold thereby informing the VFS that this inode should not be evicted. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #180
* Revert "Fixed a use-after-free bug in zfs_zget()."Brian Behlendorf2014-04-031-23/+1
| | | | | | This reverts commit 36df284366caa77cb40083d2e6bcce02274e2f05. Signed-off-by: Brian Behlendorf <[email protected]>
* Merge branch 'zed-initial'Chris Dunlap2014-04-0255-35/+4121
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zed monitors ZFS events. When a zevent is posted, zed will run any scripts that have been enabled for the corresponding zevent class. Multiple scripts may be invoked for a given zevent. The zevent nvpairs are passed to the scripts as environment variables. Refer to the zed(8) manpage for details. Events are processed synchronously by the single thread, and there is no maximum timeout for script execution. Consequently, a misbehaving script can delay (or forever block) the processing of subsequent zevents. Plans are to address this in future commits. An EID (Event IDentifier) has been added to each event to uniquely identify it throughout the lifetime of the loaded ZFS kernel module; it is a monotonically increasing integer that resets to 1 each time the module is loaded. Initial scripts have been developed to log zevents to syslog, automatically rebuild to a hot spare device, and send email in response to checksum / data / io / resilver.finish / scrub.finish zevents. To enable email notifications, configure ZED_EMAIL in zed.rc (which is serving as a config file of sorts until a proper configuration file is implemented). To enable hot sparing, uncomment ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in zed.rc; note that the autoexpand property is not yet supported. zed is a work-in-progress. Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2
| * Replace check for _POSIX_MEMLOCK w/ HAVE_MLOCKALLChris Dunlap2014-04-022-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zed supports a '-M' cmdline opt to lock all pages in memory via mlockall(). The _POSIX_MEMLOCK define is checked to determine whether this function is supported. The current test assumes mlockall() is supported if _POSIX_MEMLOCK is non-zero. However, this test is insufficient according to mlock(2) and sysconf(3). If _POSIX_MEMLOCK is -1, mlockall() is not supported; but if _POSIX_MEMLOCK is 0, availability must be checked at runtime. This commit adds an autoconf check for mlockall() to user.m4. The zed code block for mlockall() is now guarded with a test for HAVE_MLOCKALL. If defined, mlockall() will be called and its runtime availability checked via its return value. Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2
| * Add automatic hot spare functionalityBrian Behlendorf2014-04-027-9/+190
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a vdev starts getting I/O or checksum errors it is now possible to automatically rebuild to a hot spare device. To cleanly support this functionality in a shell script some additional information was added to all zevent ereports which include a vdev. This covers both io and checksum zevents but may be used but other scripts. In the Illumos FMA solution the same information is required but it is retrieved through the libzfs library interface. Specifically the following members were added: vdev_spare_paths - List of vdev paths for all hot spares. vdev_spare_guids - List of vdev guids for all hot spares. vdev_read_errors - Read errors for the problematic vdev vdev_write_errors - Write errors for the problematic vdev vdev_cksum_errors - Checksum errors for the problematic vdev. By default the required hot spare scripts are installed but this functionality is disabled. To enable hot sparing uncomment the ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the /etc/zfs/zed.d/zed.rc configuration file. These scripts do no add support for the autoexpand property. At a minimum this requires adding a new udev rule to detect when a new device is added to the system. It also requires that the autoexpand policy be ported from Illumos, see: https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c Support for detecting the correct name of a vdev when it's not a whole disk was added by Turbo Fredriksson. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Turbo Fredriksson <[email protected]> Issue #2
| * Add missing DATA_TYPE_STRING_ARRAY outputBrian Behlendorf2014-04-021-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | This functionality has always been missing. But until now there were no zevents which included an array of strings so it wasn't missed. However, that's now changed so to ensure this information is output correctly by 'zpool events -v' the DATA_TYPE_STRING_ARRAY has been implemented. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Issue #2
| * Make command line guid parsing more tolerantBrian Behlendorf2014-04-022-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several of the zfs utilities allow you to pass a vdev's guid rather than the device name. However, the utilities are not consistent in how they parse that guid. For example, 'zinject' expects the guid to be passed as a hex value while 'zpool replace' wants it as a decimal. The user is forced to just know what format to use. This patch improve things by making the parsing more tolerant. When strtol(3) is called using 0 for the base, rather than say 10 or 16, it will then accept hex, decimal, or octal input based on the prefix. From the man page. If base is zero or 16, the string may then include a "0x" prefix, and the number will be read in base 16; otherwise, a zero base is taken as 10 (decimal) unless the next character is '0', in which case it is taken as 8 (octal). NOTE: There may be additional conversions not caught be this patch. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Issue #2
| * Add systemd unit file for zedChris Dunlap2014-04-023-2/+20
| | | | | | | | | | | | | | | | | | | | This commit adds a systemd unit file for zed.service and integrates it into the zfs.target from commit 881f45c. Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2108 Issue #2
| * Initial implementation of zed (ZFS Event Daemon)Chris Dunlap2014-04-0234-1/+3715
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zed monitors ZFS events. When a zevent is posted, zed will run any scripts that have been enabled for the corresponding zevent class. Multiple scripts may be invoked for a given zevent. The zevent nvpairs are passed to the scripts as environment variables. Events are processed synchronously by the single thread, and there is no maximum timeout for script execution. Consequently, a misbehaving script can delay (or forever block) the processing of subsequent zevents. Plans are to address this in future commits. Initial scripts have been developed to log events to syslog and send email in response to checksum/data/io errors and resilver.finish/scrub.finish events. By default, email will only be sent if the ZED_EMAIL variable is configured in zed.rc (which is serving as a config file of sorts until a proper configuration file is implemented). Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2
| * Replace zpool_events_next() "block" parm w/ "flags"Chris Dunlap2014-03-314-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zpool_events_next() can be called in blocking mode by specifying a non-zero value for the "block" parameter. However, the design of the ZFS Event Daemon (zed) requires additional functionality from zpool_events_next(). Instead of adding additional arguments to the function, it makes more sense to use flags that can be bitwise-or'd together. This commit replaces the zpool_events_next() int "block" parameter with an unsigned bitwise "flags" parameter. It also defines ZEVENT_NONE to specify the default behavior. Since non-blocking mode can be specified with the existing ZEVENT_NONBLOCK flag, the default behavior becomes blocking mode. This, in effect, inverts the previous use of the "block" parameter. Existing callers of zpool_events_next() have been modified to check for the ZEVENT_NONBLOCK flag. Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2
| * Add defs for makefile installation dir varsChris Dunlap2014-03-313-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add macro definitions to AM_CPPFLAGS to propagate makefile installation directory variables for libexecdir, runstatedir, sbindir, and sysconfdir. https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Installation-Directory-Variables.html A corollary is that you should not use these variables except in makefiles. For instance, instead of trying to evaluate datadir in configure and hard-coding it in makefiles using e.g., 'AC_DEFINE_UNQUOTED([DATADIR], ["$datadir"], [Data directory.])', you should add -DDATADIR='$(datadir)' to your makefile's definition of CPPFLAGS (AM_CPPFLAGS if you are also using Automake). The runstatedir directory is for "installing data files which the programs modify while they run, that pertain to one specific machine, and which need not persist longer than the execution of the program". https://www.gnu.org/prep/standards/html_node/Directory-Variables.html It will be defined by autoconf 2.70 or later, and default to "$(localstatedir)/run". http://git.savannah.gnu.org/gitweb/?p=autoconf.git;a=commit;h=a197431414088a417b407b9b20583b2e8f7363bd Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2
| * Clarify zpool_events_next() commentBrian Behlendorf2014-03-313-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to the very poorly chosen argument name 'cleanup_fd' it was completely unclear that this file descriptor is used to track the current cursor location. When the file descriptor is created by opening ZFS_DEV a private cursor is created in the kernel for the returned file descriptor. Subsequent calls to zpool_events_next() and zpool_events_seek() then require the file descriptor as an argument to reposition the cursor. When the file descriptor is closed the kernel state tracking the cursor is destroyed. This patch contains no functional change, it just changes a few variable names and clarifies the documentation. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Issue #2
| * Add zpool_events_seek() functionalityBrian Behlendorf2014-03-317-1/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers to seek around the zevent file descriptor by EID. When a specific EID is passed and it exists the cursor will be positioned there. If the EID is no longer cached by the kernel ENOENT is returned. The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek to those respective locations. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Issue #2
| * Add a unique "eid" value to all zeventsBrian Behlendorf2014-03-313-0/+18
|/ | | | | | | | | | | | Tagging each zevent with a unique monotonically increasing EID (Event IDentifier) provides the required infrastructure for a user space daemon to reliably process zevents. By writing the EID to persistent storage the daemon can safely resume where it left off in the event stream when it's restarted. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Issue #2
* Remount datasets for "zfs inherit".Gunnar Beutner2014-03-241-0/+9
| | | | | | | | | | | | Changing properties with "zfs inherit" should cause the datasets to be remounted. This ensures that the modified property values will be propagated in to the filesystem namespace where they can be enforced. This change is modeled after an identical fix made to zfs_prop_set(). Signed-off-by: Gunnar Beutner <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2201
* Linux 3.13 compat: Handle __must_check bdi_setup_and_registerRichard Yao2014-03-241-2/+3
| | | | | | | | | | | | | | | torvalds/linux@8077c0d983ab276ec5f2700df56a64d671781905 added a __must_check to the bdi_setup_and_register(), which caused our autotools check to break. zfsonlinux/zfs@729210564a5325e190fc4fba22bf17bacf957ace was intended to correct that, but it depended on -Wno-unused-result, which is unrecognized in older GCC versions. That commit has been reverted in favor of a solution that does not require -Wno-unused-result. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2102 Closes #2135
* Revert "Properly ignore bdi_setup_and_register return value"Richard Yao2014-03-241-4/+1
| | | | | | | | | | Older GCC versions do not obey -Wno-unused-result. This reverts commit 729210564a5325e190fc4fba22bf17bacf957ace in favor of a solution that does not require -Wno-unused-result. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1906
* Illumos #4089 NULL pointer dereference in arc_read()Boris Protopopov2014-03-241-9/+11
| | | | | | | | | | | | | | | | | | 4089 NULL pointer dereference in arc_read() Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4089 illumos/illumos-gate@57815f6b95a743697e148327725b7f568e75e6ea Signed-off-by: Brian Behlendorf <[email protected]> Issue #2171 Issue #2165 Closes #2198
* Illumos #3120 zinject hangs in zfsdev_ioctl() due to uninitialized zcYuri Pankov2014-03-241-7/+5
| | | | | | | | | | | | | | | | 3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc Reviewed by: Richard Lowe <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/3120 illumos/illumos-gate@f4c46b1eda9212fd32ba197043d52239ef5c0a7f Ported-by: Brian Behlendorf <[email protected]> Closes #2152
* config: compile test rather than run testChunwei Chen2014-03-203-3/+3
| | | | | | | | | | When testing compiler flags, we only need to do compile test. Otherwise, configure will fail with "configure: error: cannot run test program while cross compiling" when cross compiling. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2191
* Implement -t option to zpool import for temporary pool namesRichard Yao2014-03-204-5/+45
| | | | | | | | | | | | | | | | Originally, users had to handle spa namespace collisions by either exporting the already imported pool or by specifying a new name for the pool with a conflicting name. In the case of root pools from virtual guests, neither approach to collision resolution is reasonable. This is addressed by extending the new name syntax with a -t option to specify that the new name is temporary. When specified, this sets an internal flag that is passed into the kernel to tell it that all label updates should refer to the name used in the original label. Consequently, the original pool name will be retained on export. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2189
* Assert alignment in umem_alloc_alignedRichard Yao2014-03-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | Valgrind suggests that the address we are returning is not properly aligned, so lets add an assertion. ==87740== Address 0x1012a22a is 554 bytes inside a block of size 4,096 alloc'd ==87740== at 0x4C2BBA0: memalign (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==87740== by 0x4C2BCC7: posix_memalign (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==87740== by 0x52FA845: zio_buf_alloc (umem.h:101) ==87740== by 0x52F6226: zil_alloc_lwb (zil.c:463) ==87740== by 0x52F8559: zil_commit (zil.c:566) ==87740== by 0x40611D: ztest_freeze (ztest.c:5909) ==87740== by 0x4066A7: ztest_init (ztest.c:6048) ==87740== by 0x407AF4: main (ztest.c:6226) Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2174
* sighandler() should take 2 argumentsIsaac Huang2014-03-201-6/+2
| | | | | | | | | | | | | Stopping arcstat.py with ^C always ends up with error: TypeError: sighandler() takes no arguments (2 given) Since no special signal handling was done in sighandler(), it's simpler to just set SIGINT handler to SIG_DFL, which terminates the script. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Isaac Huang <[email protected]> Closes #2179
* Revert "sighandler() should take 2 arguments"Brian Behlendorf2014-03-201-1/+1
| | | | | | | | This reverts commit 0bb89b6c594259829556f6dea5a89e722f214fd3 in favor of a cleaner implementation. Signed-off-by: Brian Behlendorf <[email protected]> Issue #2182
* Fix regression introduced in port of Illumos #3744Andrey Vesnovaty2014-03-201-2/+4
| | | | | | | | | | | | | | Remove the redundant call to zfs_unmount_snap() which was being done after char array was freed, This fixes an upstream regression that was introduced in commit zfsonlinux/zfs@d09f25dc66774959499a89bf3680d09c6e541ce8, which ported the Illumos 3744 changes. Signed-off-by: Andrey Vesnovaty <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #2156
* Remove unused option -r from 'zpool import'Richard Yao2014-03-121-1/+1
| | | | | | Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2178
* Switch ztest mmap(2) ASSERTs to VERIFYsRichard Yao2014-03-121-3/+3
| | | | | | | | | | | This is just a small bit of cleanup to ensure ztest fails early on systems where mmap(2) is not functioning. For the automated testing which is the primary consumer of ztest there is no functional change because debugging is always enabled. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2177
* Free props in ztest_init()Richard Yao2014-03-121-0/+1
| | | | | | | | | Valgrind complained about this and it's absolutely right. The props nvlist was not being freed in ztest_init. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #2174
* sighandler() should take 2 argumentsIsaac Huang2014-03-121-1/+1
| | | | | | | | | | | | | | Stopping arcstat.py with ^C always ends up with error: TypeError: sighandler() takes no arguments (2 given) This patch corrects the error by updating the signal handler to take the correct number of arguments. Signed-off-by: Isaac Huang <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2182
* Make arcstat.py default to one header per screenIsaac Huang2014-03-121-0/+13
| | | | | | | | | | | | Today arcstat.py prints one header every hdr_intr (20 by default) lines. It would be more consistent with out utilities like vmstat if hdr_intr defaulted to terminal window size, i.e. one header per screenful of outputs. Signed-off-by: Isaac Huang <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2183
* Fix NAME section of manpages zhack and fsck.zfs.Turbo Fredriksson2014-03-102-2/+3
| | | | | | | | | | | | | | | | In Debian GNU/Linux a program called 'linitian' is used to make sure that packages conforms to the Debian GNU/Linux packaging guide lines. This fixes the problem reported as: W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man1/zhack.1.gz W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man8/fsck.zfs.8.gz Not something that ZoL needs to addhere to, but every other man page have their NAME section in a special way - why not these two as well? Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2161
* Illumos #4088 use after free in arc_release()Boris Protopopov2014-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | 4088 use after free in arc_release() Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4088 illumos/illumos-gate@ccc22e130479b5bd7c0002267fee1e0602d3f772 From the illumos issue: A race-induced use after free occurs in arc_release() where the ARC header is used outside the critical section protected by the hash_lock. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #2162
* Use KM_PUSHPAGE in spa_add() for spa_label_features.Tim Chase2014-03-101-1/+1
| | | | | | | | | The spa_label_features nvlist is used in the sync context during pool version upgrade. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2168
* Export symbols dsl_sync_task{_nowait}Brian Behlendorf2014-03-071-0/+2
| | | | | | | | These are needed by consumers (i.e. Lustre) who wish to perform a callback in the syncing context. Both a blocking and non-blocking version are available to callers. Signed-off-by: Brian Behlendorf <[email protected]>
* Improve reporting of tx assignment wait timesNed Bass2014-03-041-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call dmu_tx_wait() themselves before retrying if the assignment fails. The wait times for such callers are not accounted for in the dmu_tx_assign kstat histogram, because the histogram only records time spent in dmu_tx_assign(). This change moves the histogram update to dmu_tx_wait() to properly account for all time spent there. One downside of this approach is that it is possible to call dmu_tx_wait() multiple times before successfully assigning a transaction, in which case the cumulative wait time would not be recorded. However, this case should not often arise in practice, because most callers currently use one of these forms: dmu_tx_assign(tx, TXG_WAIT); dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); The first form should make just one call to dmu_tx_delay() inside of dmu_tx_assign(). The second form retries with TXG_WAITED if the first assignment fails and incurs a delay, in which case no further waiting is performed. Therefore transaction delays normally occur in one call to dmu_tx_wait() so the histogram should be fairly accurate. Another possible downside of this approach is that the histogram will no longer record overhead outside of dmu_tx_wait() such as in dmu_tx_try_assign(). While I'm not aware of any reason for concern on this point, it is conceivable that lock contention, long list traversal, etc. could cause assignment delays that would not be reflected in the histogram. Therefore the histogram should strictly be used for visibility in to the normal delay mechanisms and not as a profiling tool for code performance. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1915
* replace nreserved with ndirty in txgs kstatNed Bass2014-03-043-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | The nreserved column in the txgs kstat file always contains 0 following the write throttle restructuring of commit e8b96c6007bf97cdf34869c1ffbd0ce753873a3d. Prior to that commit, the nreserved column showed the number of bytes temporarily reserved in the pool by a transaction group at sync time. The new write throttle did away with temporary reservations and uses the amount of dirty data instead. To approximate the old output of the txgs kstat, the number of dirty bytes per-txg was passed in as the nreserved value to spa_txg_history_set_io(). This approach did not work as intended, because the per-txg dirty value is decremented as data is written out to disk, so it is zero by the time we call spa_txg_history_set_io(). To fix this, save the number of dirty bytes before calling spa_sync(), and pass this value in to spa_txg_history_set_io(). Also, since the name "nreserved" is now a misnomer, the column heading is now labeled "ndirty". Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1696
* dmu_tx kstat cleanupNed Bass2014-03-043-4/+2
| | | | | | | | | | | | | | | | | | | | | | A few counters in the dmu_tx kstats are obsolete or no longer bumped properly. - The sync task restructuring commit 13fe019870c8779bf2f5b3ff731b512cf89133ef removed the code that bumpted dmu_tx_quota. The counter is now bumped in two cases, instead of just the one case as before (after the result of dsl_dataset_check_quota call). The second case is where we check the requested reservation against the actual pool size, as this is an implicit quota of sorts. - The write throttle restructuring commit e8b96c6007bf97cdf34869c1ffbd0ce753873a3d makes dmu_tx_how and dmu_tx_inflight obsolete, so they are removed. Signed-off-by: Kohsuke Kawaguchi <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1914
* Invalidate Linux buffer cache on vdevs upon each flushRichard Yao2014-03-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userland tools such as blkid, grub2-probe and zdb will go through the buffer cache. However, ZFS uses on submit_bio() to bypass the buffer cache when performing IO operations on vdevs for efficiency purposes. This permits the on-disk state and buffer cache to fall out of synchronization. That causes seemingly random failures when tools reading stale metadata from the buffer cache try to access references to data that is no longer there. A particularly bad failure this causes involves grub2-probe, which is used by grub2-mkconfig. Ordinarily, a rootfs might be called rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe, grub2-mkconfig will generate a configuration file containing /ROOT/gentoo, which omits the pool name and causes a boot failure. This is avoidable by calling invalidate_bdev() on each flush, which is a simple way to ensure that all non-dirty pages are wiped. Since userland tools rarely access vdevs directly, this should be a fancy noop >99.999% of the time and have little impact on IO. We could have tried a finer grained approach for the rare instances in which the vdevs are accessed frequently by userland. However, that would require consideration of corner cases and it is not worth the effort. Memory-wise, it would have been better to use a Linux kernel API hook to disable the buffer cache on such devices, but it provides us no way of doing that, so we opt for this approach instead. We should revisit that idea in the future when higher priority issues have been tackled. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2150
* Inform OpenRC that ZFS uses mtabRichard Yao2014-03-041-0/+1
| | | | | | | | | | | | p_l in #zfsonlinux reported that he had issues mounting filesystems that were resolved by adding rc_need="mtab" to /etc/init.d/zfs. Closer inspection revealed that we do have a race, but it is not clear how this race caused mounting to fail. What is clear is that this race should be fixed, so lets add the proper `use mtab` line to handle it. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2148
* Illumos #4574 get_clones_stat does not call zap_count in non-debug kernelAlexander Stetsenko2014-03-041-1/+2
| | | | | | | | | | | | | | | | 4574 get_clones_stat does not call zap_count in non-debug kernel Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Marcel Telka <[email protected]> Approved by: Gordon Ross <[email protected]> References: https://www.illumos.org/issues/4574 illumos/illumos-gate@03d1795fa6f720eafbee821ad37f4343c391cfe4 Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2147
* Fix zap_lookup() in feature_is_supported().Tim Chase2014-03-041-1/+1
| | | | | | | | | | The length (number of integers) argument passed to zap_lookup was wrong; likely as a result of performing stack-reduction on the function. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2141
* Include l2asize in arcstatcburroughs2014-03-041-0/+2
| | | | | | | | | | | | | For consistency with upstream pull in the l2asize update after reworking it from Perl to Python. References: https://github.com/mharsch/arcstat/pull/11 https://github.com/mharsch/arcstat/pull/12 Signed-off-by: cburroughs <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2122
* Remove recursion from dsl_dir_willuse_space()Andrew Barnes2014-03-041-11/+17
| | | | | | | | | | | Remove recursion from dsl_dir_willuse_space() to reduce stack usage. Issues with stack overflow were observed in zfs recv of zvols, likelihood of an overflow is proportional to the depth of the dataset as dsl_dir_willuse_space() recurses to parent datasets. Signed-off-by: Andrew Barnes <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2069
* Merge branch 'arc-changes'Brian Behlendorf2014-02-214-83/+165
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This stack of patches has been empirically shown to drastically improve the hit rate of the ARC for certain workloads. As a result, fewer reads to disk are required, which is generally a good thing and can drastically improve performance if the workload is disk limited. For the impatient, I'll summarize the results of the tests performed: * Test 1 - Creating many empty directories. This test saw 99.9% fewer reads and 12.8% more inodes created when running *with* these changes. * Test 2 - Creating many empty files. This test saw 4% fewer reads and 0% more inodes created when running *with* these changes. * Test 3 - Creating many 4 KiB files. This test saw 96.7% fewer reads and 4.9% more inodes created when running *with* these changes. * Test 4 - Creating many 4096 KiB files. This test saw 99.4% fewer reads and 0% more inodes created (but took 6.9% fewer seconds to complete) when running *with* these changes. * Test 5 - Rsync'ing a dataset with many empty directories. This test saw 36.2% fewer reads and 66.2% more inodes created when running *with* these changes. * Test 6 - Rsync'ing a dataset with many empty files. This test saw 30.9% fewer reads and 0% more inodes created (but took 24.3% fewer seconds to complete) when running *with* these changes. * Test 7 - Rsync'ing a dataset with many 4 KiB files. This test saw 30.8% fewer reads and 173.3% more inodes created when running *with* these changes. For the patient, the following consists of more a more detailed description of the tests performed and the results gathered. All the tests were run using identical machines, each with a pool consisting of 5 mirror pairs with 2TB 7200 RPM disks. Each test was run twice, once *without* this set of patches and again *with* this set of patches to highlight the performance changes introduced. The first four workloads tested were: ** NOTE: None of these tests were run to completion. They ran for a set amount of time and then were terminated or hit ENOSPC. 1. Creating many empty directories: * fdtree -d 10 -l 8 -s 0 -f 0 -C -> 111,111,111 Directories -> 0 Files -> 0 KiB File Data 2. Creating many empty files: * fdtree -d 10 -l 5 -s 0 -f 10000 -C -> 111,111 Directories -> 1,111,110,000 Files -> 0 KiB File Data 3. Creating many 4 KiB files: * fdtree -d 10 -l 5 -s 1 -f 10000 -C -> 111,111 Directories -> 1,111,110,000 Files -> 4,444,440,000 KiB File Data 4. Creating many 4096 KiB files: * fdtree -d 10 -l 5 -s 1024 -f 10000 -C -> 111,111 Directories -> 1,111,110,000 Files -> 4,551,106,560,000 KiB File Data Results for these first four tests are below: | Time (s) | inodes | reads | writes | --+----------+----------+--------+-----------+ Test 1 Before | 65069 | 37845363 | 831975 | 3214646 | Test 1 After | 65069 | 42703608 | 778 | 3327674 | --+----------+----------+--------+-----------+ Test 2 Before | 65073 | 54257583 | 208647 | 2413056 | Test 2 After | 65069 | 54255782 | 200038 | 2533759 | --+----------+----------+--------+-----------+ Test 3 Before | 65068 | 49857744 | 487130 | 5533348 | Test 3 After | 65071 | 52294311 | 16078 | 5648354 | --+----------+----------+--------+-----------+ Test 4 Before | 34854 | 2448329 | 385870 | 162116572 | Test 4 After | 32419 | 2448329 | 2339 | 162175706 | --+----------+----------+--------+-----------+ * "Time (s)" - The run time of the test in seconds * "inodes" - The number of inodes created by the test * "reads" - The number of reads performed by the test * "writes" - The number of writes performed by the test As you can see from the table above, running with this patch stack *significantly* reduced the number of reads performed in 3 out of the 4 tests (due to an improved ARC hit rate). In addition to the tests described above, which specifically targeted creates only, three other workloads were tested. These additional tests were targeting rsync performance against the datasets created in the previous tests. A brief description of the workloads and results for these tests are below: ** NOTE: Aside from (6), these tests didn't run to completion. They ran for a set amount of time and then were terminated. 5. Rsync the dataset created in Test 1 to a new dataset: * rsync -a /tank/test-1 /tank/test-5 6. Rsync the dataset created in Test 2 to a new dataset: * rsync -a /tank/test-2 /tank/test-6 7. Rsync the dataset created in Test 3 to a new dataset: * rsync -a /tank/test-3 /tank/test-7 Results for Test 5, 6, and 7 are below: | Time (s) | inodes | reads | writes | --+----------+----------+----------+---------+ Test 5 Before | 93041 | 17921014 | 47632823 | 4094848 | Test 5 After | 93029 | 29785847 | 30376206 | 4484459 | --+----------+----------+----------+---------+ Test 6 Before | 15290 | 54264474 | 6018331 | 733087 | Test 6 After | 11573 | 54260826 | 4155661 | 617285 | --+----------+----------+----------+---------+ Test 7 Before | 93057 | 10093749 | 41561635 | 3659098 | Test 7 After | 93045 | 27587043 | 28773151 | 5612234 | --+----------+----------+----------+---------+ * "Time (s)" - The run time of the test in seconds * "inodes" - The number of inodes created by the test * "reads" - The number of reads performed by the test * "writes" - The number of writes performed by the test Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2110
| * Set "arc_meta_limit" to 3/4 arc_c_max by defaultPrakash Surya2014-02-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately, this change is an cheap attempt to work around a pathological workload for the ARC. A "real" solution still needs to be fleshed out, so this patch is intended to alleviate the situation in the meantime. Let me try and describe the problem.. Data buffers residing in the dbuf hash table (dbuf cache) will keep a hold on their respective dnode, this dnode will in turn keep a hold on its backing dbuf (the physical block of the dnode object backing it). Since the dnode has a hold on its backing dbuf, the arc buffer for this dbuf is unevictable. What this essentially boils down to, "data" buffers have the potential to pin "metadata" in the arc (as a result of these dnode object buffers being unevictable). This scenario becomes a real problem when the workload consists of many small files (e.g. creating millions of 4K files). With this workload, the arc's "arc_meta_used" space get filled up with buffers for any resident directories as well as buffers for the objset's dnode object. Once the "arc_meta_limit" is reached, the directory buffers will be evicted and only the unevictable dnode object buffers will reside. If the workload is simply creating new small files, these dnode object buffers will never even be needed again, whereas the directory buffers will be used constantly until the creates move to a new directory. If "arc_c" and "arc_meta_limit" are sized appropriately, this situation wont occur. This is because as the data buffers accumulate, "arc_size" will eventually approach "arc_c" (before "arc_meta_used" reaches "arc_meta_limit"); at that point the data buffers will be evicted, which releases the hold on the dnode, which releases the hold on the dnode object's dbuf, which allows that buffer to be evicted from the arc in preference to more "useful" metadata. So, to side step the issue, we simply need to ensure "arc_size" reaches "arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to pick a proper limit, we have to do some math. To make things a little easier to follow, it is assumed that there will only be a single data buffer per file (which is probably always the case for "small" files anyways). Based on the current internals of the arc, if N files residing in the dbuf cache all pin a single dnode buffer (i.e. their dnodes all share the same physical dnode object block), then the following amount of "arc_meta_used" space will be consumed: - 16K for the dnode object's block - [ 16384 bytes] - N * sizeof(dnode_t) -------------- [ N * 928 bytes] - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes] - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes] - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes] To simplify, these N files will pin the following amount of "arc_meta_used" space as unevictable: Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280) Pinned "arc_meta_used" bytes = 17000 + N * 1544 This pinned space is regardless of the size of the files, and is only dependent on the number of pinned dnodes sharing a physical block (i.e. N). For example, 32 512b files sharing a single dnode object block would consume the same "arc_meta_used" space as 32 4K files sharing a single dnode object block. Now, given a files size of S, we can determine the total amount of space that will be consumed in the arc: Total = 17000 + N * 1544 + S * N ^^^^^^^^^^^^^^^^ ^^^^^ metadata data So, given these formulas, we can generate a table which states the ratio of pinned metadata to total arc (meta + data) using different values of N (number of pinned dnodes per pinned physical dnode block) and S (size of the file). File Sizes (S) | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | ---+----------+----------+----------+----------+----------+----------+ 1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 | 2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 | N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 | 8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 | 16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 | 32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 | As you can see, if we wanted to support the absolute worst case of 1 dnode per physical dnode block and 512b files, we would have to set the "arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At that point, it essentially defeats the purpose of having an "arc_meta_limit" at all. This patch changes the default value of "arc_meta_limit" to be 75% of "arc_c_max", which should be good enough for "most" workloads (I think). Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
| * Split "data_size" into "meta" and "data"Prakash Surya2014-02-212-14/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the "data_size" field in the arcstats kstat contained the amount of cached "metadata" and "data" in the ARC. The problem is this then made it difficult to extract out just the "metadata" size, or just the "data" size. To make it easier to distinguish the two values, "data_size" has been modified to count only buffers of type ARC_BUFC_DATA, and "meta_size" was added to count only buffers of type ARC_BUFC_METADATA. If one wants the old "data_size" value, simply sum the new "data_size" and "meta_size" values. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
| * Prioritize "metadata" in arc_get_data_bufPrakash Surya2014-02-211-22/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the arc is at it's size limit and a new buffer is added, data will be evicted (or recycled) from the arc to make room for this new buffer. As far as I can tell, this is to try and keep the arc from over stepping it's bounds (i.e. keep it below the size limitation placed on it). This makes sense conceptually, but there appears to be a subtle flaw in its current implementation, resulting in metadata buffers being throttled. When it evicts from the arc's lists, it also passes in a "type" so as to remove a buffer of the same type that it is adding. The problem with this is that once the size limit is hit, the ratio of "metadata" to "data" contained in the arc essentially becomes fixed. For example, consider the following scenario: * the size of the arc is capped at 10G * the meta_limit is capped at 4G * 9G of the arc contains "data" * 1G of the arc contains "metadata" Now, every time a new "metadata" buffer is created and added to the arc, an older "metadata" buffer(s) will be removed from the arc; preserving the 9G "data" to 1G "metadata" ratio that was in-place when the size limit was reached. This occurs even though the amount of "metadata" is far below the "metadata" limit. This can result in the arc behaving pathologically for certain workloads. To fix this, the arc_get_data_buf function was modified to evict "data" from the arc even when adding a "metadata" buffer; unless it's at the "metadata" limit. In addition, arc_evict now more closely resembles arc_evict_ghost; such that when evicting "data" from the arc, it may make a second pass over the arc lists and evict "metadata" if it cannot meet the eviction size the first time around. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
| * Remove "arc_meta_used" from arc_adjust calculationPrakash Surya2014-02-211-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110