aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Implement -t option to zpool create for temporary pool namesRichard Yao2014-09-302-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | Creating virtual machines that have their rootfs on ZFS on hosts that have their rootfs on ZFS causes SPA namespace collisions when the standard name rpool is used. The solution is either to give each guest pool a name unique to the host, which is not always desireable, or boot a VM environment containing an ISO image to install it, which is cumbersome. 26b42f3f9d03f85cc7966dc2fe4dfe9216601b0e introduced `zpool import -t ...` to simplify situations where a host must access a guest's pool when there is a SPA namespace conflict. We build upon that to introduce `zpool import -t tname ...`. That allows us to create a pool whose in-core name is tname, but whose on-disk name is the normal name specified. This simplifies the creation of machine images that use a rootfs on ZFS. That benefits not only real world deployments, but also ZFSOnLinux development by decreasing the time needed to perform rootfs on ZFS experiments. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2417
* Perform whole-page page truncation for hole-punching under a range lockTim Chase2014-09-291-22/+8
| | | | | | | | | | | | | | | | | As an attempt to perform the page truncation more optimally, the hole-punching support added in 223df0161fad50f53a8fa5ffeea8cc4f8137d522 truncated performed the operation in two steps: first, sub-page "stubs" were zeroed under the range lock in zfs_free_range() using the new zfs_zero_partial_page() function and then the whole pages were truncated within zfs_freesp(). This left a window of opportunity during which the full pages could be touched. This patch closes the window by moving the whole-page truncation into zfs_free_range() under the range lock. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2733
* Illumos 5138 - add tunable for maximum number of blocks freed in one txgMax Grossman2014-09-231-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Mattew Ahrens <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5138 https://github.com/illumos/illumos-gate/commit/af3465d Porting notes: Because support for exposing a uint64_t parameter wasn't added until v3.17-rc1 the zfs_free_max_blocks variable has been declared as a unsigned long. This is already far larger than required and it allows us to avoid additional autoconf compatibility code. The default value has been set to 100,000 on Linux instead of ULONG_MAX which is used on Illumos. This was done to limit the number of outstanding IOs in the system when snapshots are destroyed. This helps ensure individual TXG sync times are kept reasonable and memory isn't wasted managing a huge backlog of outstanding IOs. Ported by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2675 Closes #2581
* Illumos 4753 - increase number of outstanding async writes when sync task is ↵Alex Reece2014-09-233-5/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | waiting Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/4753 https://github.com/illumos/illumos-gate/commit/73527f4 Comments by Matt Ahrens from the issue tracker: When a sync task is waiting for a txg to complete, we should hurry it along by increasing the number of outstanding async writes (i.e. make vdev_queue_max_async_writes() return a larger number). Initially we might just have a tunable for "minimum async writes while a synctask is waiting" and set it to 3. Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2716
* Illumos 5139 - SEEK_HOLE failed to report a hole at end of fileMatthew Ahrens2014-09-232-9/+21
| | | | | | | | | | | | | | | | | | | | 5139 SEEK_HOLE failed to report a hole at end of file Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Max Grossman <[email protected]> Reviewed by: Peng Dai <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5139 https://github.com/illumos/illumos-gate/commit/0fbc0cd Ported by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2714
* Fix function call with uninitialized value in vdev_inuseRichard Yao2014-09-231-1/+2
| | | | | | | | | | | | | | LLVM's static analyzer reported that we could pass an uninitialized pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct. An attempt to repurpose a spare or L2ARC drive from an exported pool will cause the pool_guid passed to spa_by_guid() to be unintialized information from the stack. This will cause non-deterministic behavior. Since there is no reason why we cannot repurpose such disks, we modify vdev_inuse() to avoid calling spa_by_guid() when they are detected. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2330
* Illumos 5161 - add tunable for number of metaslabs per vdevMatthew Ahrens2014-09-231-2/+13
| | | | | | | | | | | | | | | | | | | 5161 add tunable for number of metaslabs per vdev Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/5161 https://github.com/illumos/illumos-gate/commit/bf3e216 Ported by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2698
* Illumos 5177 - remove dead code from dsl_scan.cMatthew Ahrens2014-09-221-46/+36
| | | | | | | | | | | | | | | | | | | | | | 5177 remove dead code from dsl_scan.c Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5177 https://github.com/illumos/illumos-gate/commit/5f37736 Porting notes: The local variable 'buf' was removed from dsl_scan_visitbp(). This wasn't part of the original patch but it should have been. Ported by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2712
* Illumos 5174 - add sdt probe for blocked read in dbuf_read()Adam Leventhal2014-09-221-0/+2
| | | | | | | | | | | | | | | | | | | | 5174 add sdt probe for blocked read in dbuf_read() Reviewed by: Basil Crow <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Steven Hartland <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Boris Protopopov <[email protected]> Reviewed by: Steven Hartland <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5174 https://github.com/illumos/illumos-gate/commit/f6164ad Ported by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2710
* Illumos 5140 - message about "%recv could not be opened" is printed when ↵Matthew Ahrens2014-09-181-1/+9
| | | | | | | | | | | | | | | | | | booting after crash Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Max Grossman <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/projects/illumos-gate//issues/5140 https://github.com/illumos/illumos-gate/commit/2243853 Ported by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2676
* Fix z_teardown_inactive_lock deadlockBrian Behlendorf2014-09-111-1/+1
| | | | | | | | | | | | | | | When rolling back a mounted filesystem zfs_suspend() is called which acquires the z_teardown_inactive_lock. This lock can not be dropped until the filesystem has been rolled back and resumed in zfs_resume_fs(). Therefore, we must not call iput() under this lock because it may result in the inode->evict() handler being called which also takes this lock. Instead use zfs_iput_async() to ensure dropping the last reference is deferred and runs in a safe context. Signed-off-by: Brian Behlendorf <[email protected]> Closes #2670
* Implement fallocate FALLOC_FL_PUNCH_HOLETim Chase2014-09-083-21/+143
| | | | | | | | | | | | | | | | | | | | | | | | Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of fallocate(2). Mimic the behavior of other native file systems such as ext4 in cases where the file might be extended. If the offset is beyond the end of the file, return success without changing the file. If the extent of the punched hole would extend the file, only the existing tail of the file is punched. Add the zfs_zero_partial_page() function, modeled after update_page(), to handle zeroing partial pages in a hole-punching operation. It must be used under a range lock for the requested region in order that the ARC and page cache stay in sync. Move the existing page cache truncation via truncate_setsize() into zfs_freesp() for better source structure compatibility with upstream code. Add page cache truncation to zfs_freesp() and zfs_free_range() to handle hole punching. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #2619
* Illumos 5117 - spacemap reallocation can cause corruptionGeorge Wilson2014-09-081-5/+5
| | | | | | | | | | | | | | | | 5117 space map reallocation can cause corruption Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/projects/illumos-gate/issues/5117 https://github.com/illumos/illumos-gate/commit/e503a68 Ported by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2662
* Add object type checking to zap_lockdir()Brian Behlendorf2014-09-081-7/+4
| | | | | | | | | | | | | | | | | | | | | | If a non-ZAP object is passed to zap_lockdir() it will be treated as a valid ZAP object. This can result in zap_lockdir() attempting to read what it believes are leaf blocks from invalid disk locations. The SCSI layer will eventually generate errors for these bogus IOs but the caller will hang in zap_get_leaf_byblk(). The good news is that is a situation which can not occur unless the pool has been damaged. The bad news is that there are reports from both FreeBSD and Solaris of damaged pools. Specifically, there are normal files in the filesystem which reference another normal file as their parent. Since pools like this are known to exist the zap_lockdir() function has been updated to verify the type of the object. If a non-ZAP object has been passed it EINVAL will be returned immediately. Signed-off-by: Brian Behlendorf <[email protected]> Issue #2597 Issue #2602
* Linux AIO SupportRichard Yao2014-09-053-37/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to know that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's responsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #223 Closes #2373
* Illumos 5049 - panic when removing log deviceAlex Reece2014-09-051-1/+2
| | | | | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Mattew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Rich Lowe <[email protected]> References: https://www.illumos.org/issues/5049 https://github.com/illumos/illumos-gate/commit/2986efa Ported-by: Brian Behlendorf <[email protected]> Closes #2636
* Fix invalid locking order in rename operationStanislav Seletskiy2014-09-041-17/+20
| | | | | | | | | | | | | This commit should prevent a deadlock on dp_config_rwlock when running `zfs rename` by ensuring zvol_rename_minors() is not called under this lock. Signed-off-by: Stanislav Seletskiy <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2652. Closes #2525.
* Change the default 'zfs_dedup_prefetch' value to '0'Alexey Smirnoff2014-09-041-1/+1
| | | | | | | | | This gives a huge performance improvement in operations with deduped datasets especially when the bottleneck is the amount of ram available for zfs. Signed-off-by: Brian Behlendorf <[email protected]> Closes #2639
* Improve handling of filesystem versionsDan Swartzendruber2014-09-031-6/+1
| | | | | | | | | | | | | Change mount code to diagnose filesystem versions that are not supported by the current implementation. Change upgrade code to do likewise and refuse to upgrade a pool if any filesystems on it are a version which is not supported by the current implementation. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Dan Swartzendruber <[email protected]> Closes: #2616
* Illumos 4970-4974 - extreme rewind enhancementsMatthew Ahrens2014-08-261-13/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4970 need controls on i/o issued by zpool import -XF 4971 zpool import -T should accept hex values 4972 zpool import -T implies extreme rewind, and thus a scrub 4973 spa_load_retry retries the same txg 4974 spa_load_verify() reads all data twice Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/4970 https://www.illumos.org/issues/4971 https://www.illumos.org/issues/4972 https://www.illumos.org/issues/4973 https://www.illumos.org/issues/4974 https://github.com/illumos/illumos-gate/commit/e42d205 Notes: This set of patches adds a set of tunable parameters for the "extreme rewind" mode of pool import which allows control over the traversal performed during such an import. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2598
* Illumos 5034 - ARC's buf_hash_table is too smallMatthew Ahrens2014-08-261-3/+10
| | | | | | | | | | | | | | | | | 5034 ARC's buf_hash_table is too small Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Gordon Ross <[email protected]> References: https://www.illumos.org/issues/5034 https://github.com/illumos/illumos-gate/commit/63e911b Ported-by: Brian Behlendorf <[email protected]> Closes #2615
* Fixed memory leaks in zevent handlingIsaac Huang2014-08-202-18/+44
| | | | | | | | | | Some nvlist_t could be leaked in error handling paths. Also make sure cb argument to zfs_zevent_post() cannnot be NULL. Signed-off-by: Isaac Huang <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2158
* Illumos 4631 - zvol_get_stats triggering too many readsMatthew Ahrens2014-08-202-77/+63
| | | | | | | | | | | | | | | | | | 4631 zvol_get_stats triggering too many reads Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4631 https://github.com/illumos/illumos-gate/commit/bbfa8ea Ported-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2612 Closes #2480
* Don't upgrade a metaslab when the pool is not writableTim Chase2014-08-181-5/+8
| | | | | | | | | | | | | | Illumos 4982 added code to metaslab_fragmentation() to proactively update space maps when the spacemap_histogram feature is enabled. This should only happen when the pool is writeable. References: https://www.illumos.org/issues/4982 https://github.com/illumos/illumos-gate/commit/2e4c998 Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2595
* Illumos 4976-4984 - metaslab improvementsGeorge Wilson2014-08-187-156/+541
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4976 zfs should only avoid writing to a failing non-redundant top-level vdev 4978 ztest fails in get_metaslab_refcount() 4979 extend free space histogram to device and pool 4980 metaslabs should have a fragmentation metric 4981 remove fragmented ops vector from block allocator 4982 space_map object should proactively upgrade when feature is enabled 4983 need to collect metaslab information via mdb 4984 device selection should use fragmentation metric Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/4976 https://www.illumos.org/issues/4978 https://www.illumos.org/issues/4979 https://www.illumos.org/issues/4980 https://www.illumos.org/issues/4981 https://www.illumos.org/issues/4982 https://www.illumos.org/issues/4983 https://www.illumos.org/issues/4984 https://github.com/illumos/illumos-gate/commit/2e4c998 Notes: The "zdb -M" option has been re-tasked to display the new metaslab fragmentation metric and the new "zdb -I" option is used to control the maximum number of in-flight I/Os. The new fragmentation metric is derived from the space map histogram which has been rolled up to the vdev and pool level and is presented to the user via "zpool list". Add a number of module parameters related to the new metaslab weighting logic. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2595
* Create an 'overlay' propertyTurbo Fredriksson2014-08-151-0/+2
| | | | | | | | | | | | | | | Add a new 'overlay' property (default 'off') that controls whether the filesystem should be mounted even if the mountpoint is busy or if it should fail with a 'mountpoint not empty'. Doing overlay mounts is the default mount behavior on Linux, but not in ZFS. It have been decided that following the ZFS behavior should be the default, but this overlay allows for site administrator to override this decision on a per-dataset basis. Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes: #2503
* Revert "Revert "Revert "Fix unlink/xattr deadlock"""Brian Behlendorf2014-08-111-85/+57
| | | | | | | | | | | | | | | | | | This reverts commit 7973e46 which brings the basic flow of the code back in line with the other ZFS implementations. This was possible due to the following related changes. e89260a Directory xattr znodes hold a reference on their parent 6f9548c Fix deadlock in zfs_zget() 0a50679 Add zfs_iput_async() interface 4dd1893 Avoid 128K kmem allocations in mzap_upgrade() Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #457 Closes #2058 Closes #2128 Closes #2240
* Add zfs_iput_async() interfaceBrian Behlendorf2014-08-113-11/+16
| | | | | | | | | | | Handle all iputs in zfs_purgedir() and zfs_inode_destroy() asynchronously to prevent deadlocks. When the iputs are allowed to run synchronously in the destroy call path deadlocks between xattr directory inodes and their parent file inodes are possible. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #457
* Avoid 128K kmem allocations in mzap_upgrade()Brian Behlendorf2014-08-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | As originally implemented the mzap_upgrade() function will perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc(). These large allocations can potentially block indefinitely if contiguous memory is not available. Since this allocation is done under the zap->zap_rwlock it can appear as if there is a deadlock in zap_lockdir(). This is shown below. The optimal fix for this would be to rework mzap_upgrade() such that no large allocations are required. This could be done but it would result in us diverging further from the other implementations. Therefore I've opted against doing this unless it becomes absolutely necessary. Instead mzap_upgrade() has been updated to use zio_buf_alloc() which can reliably provide buffers of up to SPA_MAXBLOCKSIZE. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Close #2580
* Avoid dynamic allocation of 'search zio'Brian Behlendorf2014-08-111-6/+4
| | | | | | | | | | | | | | | | As part of commit e8b96c6 the search zio used by the vdev_queue_io_to_issue() function was moved to the heap to minimize stack usage. Functionally this is fine, but to maximize performance it's best to minimize the number of dynamic allocations. To avoid this allocation temporary space for the search zio has been reserved in the vdev_queue structure. All access must be serialized through the vq_lock. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #2572
* Use KM_PUSHPAGE in dsl_dataset_rollback_check()Brian Behlendorf2014-08-061-2/+2
| | | | | | | | | | | The dsl_dataset_rollback_check() function is executed in the txg_sync context. To prevent a potential deadlock due to direct memory reclaim it must use KM_PUSHPAGE. This was introduced by the recent 'zfs bookmark' features, commit da53684. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Eric Dillmann <[email protected]> Closes #2569
* Illumos 4914 - zfs on-disk bookmark structure should be named *_phys_tMatthew Ahrens2014-08-0616-82/+83
| | | | | | | | | | | | | | | | | | | | | | | | 4914 zfs on-disk bookmark structure should be named *_phys_t Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/4914 https://github.com/illumos/illumos-gate/commit/7802d7b Porting notes: There were a number of zfsonlinux-specific uses of zbookmark_t which needed to be updated. This should reduce the likelihood of further problems like issue #2094 from occurring. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2558
* Illumos 4881 - zfs send performance regression with embedded dataMatthew Ahrens2014-08-061-15/+21
| | | | | | | | | | | | | | | | | 4881 zfs send performance degradation when embedded block pointers are encountered Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4881 https://github.com/illumos/illumos-gate/commit/06315b7 Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2547
* Illumos 4897 - Space accounting mismatch in L2ARC/zpoolSaso Kiselkov2014-08-061-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | 4897 Space accounting mismatch in L2ARC/zpool Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Boris Protopopov <[email protected]> Approved by: Dan McDonald <[email protected]> From the illumos issue tracker: L2ARC vdev space usage statistics are calculated as the delta between the maximum and minimum vdev offset ever written to by the L2ARC fill thread, but do not inform the user of how much space in between these two offsets is actually taken up by cached buffers. This fix changes that so that vdev space usage stats on L2ARC devices accurately track the volume of buffers stored on them, allowing users to see the exact L2ARC usage in "zpool iostat -v". References: https://www.illumos.org/issues/4897 https://github.com/illumos/illumos-gate/commit/3038a2b Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2555
* Illumos 4390 - I/O errors can corrupt space map when deleting fs/volMatthew Ahrens2014-08-048-155/+294
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4390 i/o errors when deleting filesystem/zvol can lead to space map corruption Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4390 https://github.com/illumos/illumos-gate/commit/7fd05ac Porting notes: Previous stack-reduction efforts in traverse_visitb() caused a fair number of un-mergable pieces of code. This patch should reduce its stack footprint a bit more. The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated using kmem_zalloc() for the purpose of stack reduction. The new global zfs_free_leak_on_eio has been defined as an integer rather than a boolean_t as was the case with the related zfs_recover global. Also, zfs_free_leak_on_eio's definition has been inserted into zfs_debug.c for consistency with the existing definition of zfs_recover. Illumos placed it in spa_misc.c. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2545
* Illumos 4757, 4913Matthew Ahrens2014-08-0124-167/+670
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Max Grossman <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4757 https://www.illumos.org/issues/4913 https://github.com/illumos/illumos-gate/commit/5d7b4d4 Porting notes: For compatibility with the fastpath code the zio_done() function needed to be updated. Because embedded-data block pointers do not require DVAs to be allocated the associated vdevs will not be marked and therefore should not be unmarked. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2544
* Illumos 3835 zfs need not store 2 copies of all metadataMatthew Ahrens2014-07-313-9/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Richard Lowe <[email protected]> Description from Matt Ahrens's bug report at Delphix: Add a new zfs property, "redundant_metadata" which can have values "all" or "most". The default will be "all", which is the current behavior. Setting to "most" will cause us to only store 1 copy of level-1 indirect blocks of user data files. Additional notes: The new man page section for this property states "The exact behavior of which metadata blocks are stored redundantly may change in future releases." and: "When set to most, ZFS stores an extra copy of most types of metadata. This can improve performance of random writes, because less metadata must be written." The current implementation is as described above in Matt's blog. It is controlled by a new global integer "zfs_redundant_metadata_most_ditto_level", currently initialized to 2. When "redundant_metadata" is set to "most", only indirect blocks of the specified level and higher will have additional ditto blocks created. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2542
* Illumos 4754, 4755George Wilson2014-07-302-50/+6
| | | | | | | | | | | | | | | | | | | 4754 io issued to near-full luns even after setting noalloc threshold 4755 mg_alloc_failures is no longer needed Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4754 https://www.illumos.org/issues/4755 https://github.com/illumos/illumos-gate/commit/b6240e8 Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2533
* Illumos #4374Matthew Ahrens2014-07-3013-160/+116
| | | | | | | | | | | | | | | | | | | 4374 dn_free_ranges should use range_tree_t Reviewed by: George Wilson <[email protected]> Reviewed by: Max Grossman <[email protected]> Reviewed by: Christopher Siden <[email protected] Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4374 https://github.com/illumos/illumos-gate/commit/bf16b11 Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2531
* Illumos 4368, 4369.Matthew Ahrens2014-07-2913-115/+845
| | | | | | | | | | | | | | | | | 4369 implement zfs bookmarks 4368 zfs send filesystems from readonly pools Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/4369 https://www.illumos.org/issues/4368 https://github.com/illumos/illumos-gate/commit/78f1710 Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2530
* Illumos 4370, 4371Max Grossman2014-07-2821-290/+481
| | | | | | | | | | | | | | | | | | | | 4370 avoid transmitting holes during zfs send 4371 DMU code clean up Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Approved by: Garrett D'Amore <[email protected]>a References: https://www.illumos.org/issues/4370 https://www.illumos.org/issues/4371 https://github.com/illumos/illumos-gate/commit/43466aa Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2529
* Illumos 4171, 4172Matthew Ahrens2014-07-2516-222/+254
| | | | | | | | | | | | | | | | | | | | 4171 clean up spa_feature_*() interfaces 4172 implement extensible_dataset feature for use by other zpool features Reviewed by: Max Grossman <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Approved by: Garrett D'Amore <[email protected]>a References: https://www.illumos.org/issues/4171 https://www.illumos.org/issues/4172 https://github.com/illumos/illumos-gate/commit/2acef22 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2528
* zfs_trunc() should use dmu_tx_assign(tx, TXG_WAIT)Brian Behlendorf2014-07-221-7/+1
| | | | | | | | | | | | As part of the write throttle & i/o schedule performance work the zfs_trunc() function should have been updated to use TXG_WAIT. Using TXG_WAIT ensures that the tx will be part of the next txg. If TXG_NOWAIT is used and retried for ERESTART errors then the tx can suffer from starvation. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #2488
* Illumos #4756 Fix metaslab_group_preload deadlockGeorge Wilson2014-07-221-3/+22
| | | | | | | | | | | | | | | | | | | | | | | 4756 metaslab_group_preload() could deadlock Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Garrett D'Amore <[email protected]> The metaslab_group_preload() function grabs the mg_lock and then later tries to grab the metaslab lock. This lock ordering may lead to a deadlock since other consumers of the mg_lock will grab the metaslab lock first. References: https://www.illumos.org/issues/4756 https://github.com/illumos/illumos-gate/commit/30beaff Ported-by: Prakash Surya <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2488
* Illumos #4730 destroy metaslab group taskqGeorge Wilson2014-07-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | 4730 metaslab group taskq should be destroyed in metaslab_group_destroy() Reviewed by: Alex Reece <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Rich Lowe <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4730 https://github.com/illumos/illumos-gate/commit/be08211 Porting notes: Under ZFSonlinux, one of the effects of not destroying the taskq is that zdb would never exit (due to the SPL taskq implementation). Ported-by: Tim Chase <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2488
* Illumos #4101, #4102, #4103, #4105, #4106George Wilson2014-07-2212-1164/+1916
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Garrett D'Amore <[email protected]> Prior to this patch, space_maps were preferred solely based on the amount of free space left in each. Unfortunately, this heuristic didn't contain any information about the make-up of that free space, which meant we could keep preferring and loading a highly fragmented space map that wouldn't actually have enough contiguous space to satisfy the allocation; then unloading that space_map and repeating the process. This change modifies the space_map's to store additional information about the contiguous space in the space_map, so that we can use this information to make a better decision about which space_map to load. This requires reallocating all space_map objects to increase their bonus buffer size sizes enough to fit the new metadata. The above feature can be enabled via a new feature flag introduced by this change: com.delphix:spacemap_histogram In addition to the above, this patch allows the space_map block size to be increase. Currently the block size is set to be 4K in size, which has certain implications including the following: * 4K sector devices will not see any compression benefit * large space_maps require more metadata on-disk * large space_maps require more time to load (typically random reads) Now the space_map block size can adjust as needed up to the maximum size set via the space_map_max_blksz variable. A bug was fixed which resulted in potentially leaking an object when removing a mirrored log device. The previous logic for vdev_remove() did not deal with removing top-level vdevs that are interior vdevs (i.e. mirror) correctly. The problem would occur when removing a mirrored log device, and result in the DTL space map object being leaked; because top-level vdevs don't have DTL space map objects associated with them. References: https://www.illumos.org/issues/4101 https://www.illumos.org/issues/4102 https://www.illumos.org/issues/4103 https://www.illumos.org/issues/4105 https://www.illumos.org/issues/4106 https://github.com/illumos/illumos-gate/commit/0713e23 Porting notes: A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also, the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary. Ported-by: Tim Chase <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2488
* Move metaslab_group_alloc_update() callPrakash Surya2014-07-221-2/+2
| | | | | | | | | | | | This changes moves the called to metaslab_group_alloc_update() to the metaslab_sync_reassess() function. The original placement of the call within metaslab_sync_done() appears to have been a simple mistake, introduced by ac72fac3eaa569902cad88053167f7d74e7fe7e4. This aligns us more closely to the upstream illumos code base. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Fix zil_commit() NULL dereferenceBrian Behlendorf2014-07-174-6/+23
| | | | | | | | | | | | Update the current code to ensure inodes are never dirtied if they are part of a read-only file system or snapshot. If they do somehow get dirtied an attempt will make made to write them to disk. In the case of snapshots, which don't have a ZIL, this will result in a NULL dereference in zil_commit(). Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2405
* Illumos 4168, 4169, 4170: ztest, zdb and zhack fixesGeorge Wilson2014-07-172-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4168 ztest assertion failure in dbuf_undirty 4169 verbatim import causes zdb to segfault 4170 zhack leaves pool in ACTIVE state Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4168 https://www.illumos.org/issues/4169 https://www.illumos.org/issues/4170 https://github.com/illumos/illumos-gate/commit/7fdd916 Porting notes: Of particular interest when troubleshooting corrupted pools, the commonly-used "zdb -e" operation may perform verbatim imports and furthermore, it will soon have direct support for verbatim imports via a new "-V" option. The 4169 fix eliminates a common segfault case in which spa_history_log_version() tries to access an un-opened dsl_pool_t. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2451 Closes #2283 Closes #2467
* Convert zfs_mg_noalloc_threshold to a module parameter and documentTim Chase2014-07-161-0/+4
| | | | | | | | | | The parameter was added as illumos issue 4081 which was committed to zfsonlinux in ac72fac3eaa569902cad88053167f7d74e7fe7e4. This patch documents the parameter and allows for it to be set as a module parameter. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2483