summaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* Illumos #3618 ::zio dcmd does not show timestamp dataMatthew Ahrens2013-08-122-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3618 ::zio dcmd does not show timestamp data Reviewed by: Adam Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Dan McDonald <[email protected]> References: http://www.illumos.org/issues/3618 illumos/illumos-gate@c55e05cb35da47582b7afd38734d2f0d9c6deb40 Notes on porting to ZFS on Linux: The original changeset mostly deals with mdb ::zio dcmd. However, in order to provide the requested functionality it modifies vdev and zio structures to keep the timing data in nanoseconds instead of ticks. It is these changes that are ported over in the commit in hand. One visible change of this commit is that the default value of 'zfs_vdev_time_shift' tunable is changed: zfs_vdev_time_shift = 6 to zfs_vdev_time_shift = 29 The original value of 6 was inherited from OpenSolaris and was subotimal - since it shifted the raw tick value - it didn't compensate for different tick frequencies on Linux and OpenSolaris. The former has HZ=1000, while the latter HZ=100. (Which itself led to other interesting performance anomalies under non-trivial load. The deadline scheduler delays the IO according to its priority - the lower priority the further the deadline is set. The delay is measured in units of "shifted ticks". Since the HZ value was 10 times higher, the delay units were 10 times shorter. Thus really low priority IO like resilver (delay is 10 units) and scrub (delay is 20 units) were scheduled much sooner than intended. The overall effect is that resilver and scrub IO consumed more bandwidth at the expense of the other IO.) Now that the bookkeeping is done is nanoseconds the shift behaves correctly for any tick frequency (HZ). Ported-by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1643
* Linux 3.8 compat: Support CONFIG_UIDGID_STRICT_TYPE_CHECKSRichard Yao2013-08-093-7/+7
| | | | | | | | | | | | | | | | When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are replaced by kuid_t/kgid_t, which are structures instead of integral types. This causes any code that uses an integral type to fail to build. The User Namespace functionality introduced in Linux 3.8 requires CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any kernel that supported it. We resolve this by converting between the new kuid_t/kgid_t structures and the original uid_t/gid_t types. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1589
* Evict meta data from ghost lists + l2arc headersBrian Behlendorf2013-08-091-1/+17
| | | | | | | | | | | | | | | | | When the meta limit is exceeded the ARC evicts some meta data buffers from the mfu+mru lists. Unfortunately, for meta data heavy workloads it's possible for these buffers to accumulate on the ghost lists if arc_c doesn't exceed arc_size. To handle this case arc_adjust_meta() has been entended to explicitly evict meta data buffers from the ghost lists in proportion to what was evicted from the mfu+mru lists. If this is insufficient we request that the VFS release some inodes and dentries. This will result in the release of some dnodes which are counted as 'other' metadata. Signed-off-by: Brian Behlendorf <[email protected]>
* Allow arc_evict_ghost() to only evict meta dataBrian Behlendorf2013-08-091-9/+13
| | | | | | | | | | | | | | | | | | | | | | | The default behavior of arc_evict_ghost() is to start by evicting data buffers. Then only if the requested number of bytes to evict cannot be satisfied by data buffers move on to meta data buffers. This is ideal for honoring arc_c since it's preferable to keep the meta data cached. However, if we're trying to free memory from the arc to honor the meta limit it's a problem because we will need to discard all the data to get to the meta data. To avoid this issue the arc_evict_ghost() is now passed a fourth argumented describing which buffer type to start with. The arc_evict() function already behaves exactly like this for a same reason so this is consistent with the existing code. All existing callers have been updated to pass ARC_BUFC_DATA so this patch introduces no functional change. New callers may pass ARC_BUFC_METADATA to skip immediately to evicting meta data leaving the normal data untouched. Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #3137 L2ARC compressionSaso Kiselkov2013-08-084-76/+411
| | | | | | | | | | | | | | | | | | | | | | | 3137 L2ARC compression Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> References: illumos/illumos-gate@aad02571bc59671aa3103bb070ae365f531b0b62 https://www.illumos.org/issues/3137 http://wiki.illumos.org/display/illumos/L2ARC+Compression Notes for Linux port: A l2arc_nocompress module option was added to prevent the compression of l2arc buffers regardless of how a dataset's compression property is set. This allows the legacy behavior to be preserved. Ported by: James H <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1379
* Return -1 from arc_shrinker_func()Richard Yao2013-08-081-3/+1
| | | | | | | | | | | This is analogous to SPL commit zfsonlinux/spl@b9b3715. While we don't have clear evidence of systems getting caught here indefinately like in the SPL this ensures that it will never happen. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1579
* Return correct type and offset from zfs_readdirRichard Yao2013-08-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | zfs_readdir() is used by getdents(), which provides a list of all files in directory, their types and an offset that be used by llseek() to seek to the next directory entry. On Solaris, the first two directory entries "." and ".." respectively have offsets 1 and 2 on ZFS while the other files have rather large numbers. Currently, ZFSOnLinux is giving "." offset 0 and all other entries large numbers. The first entry's next entry offset points to itself, which causes software that uses llseek() in conjunction with getdents() for filesystem navigation to enter an infinite loop. The offsets used for each directory entry are filesystem specific on all platforms, so we can fix this by adopting the Solaris behavior. Also, we currently report each directory entry as having type 0 (???). This is not wrong, but we can do better. getdents() on Solaris does not appear to provide this information, but it does on Linux and Mac OS X do. ZFS provides easy access to type information in zfs_readdir(), so this patch provides this as well. Reported-by: Andrey <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1624
* Illumos #3639 zpool.cache should skip over readonly poolsGeorge Wilson2013-08-071-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 3639 zpool.cache should skip over readonly pools Reviewed by: Eric Schrock <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Basil Crow <[email protected]> Approved by: Gordon Ross <[email protected]> References: illumos/illumos-gate@fb02ae025247e3b662600e5a9c1b4c33ecab7d72 https://www.illumos.org/issues/3639 Normally we don't list pools that are imported read-only in the cache file, however you can accidentally get one into the cache file by importing and exporting a read-write pool while a read-only pool is imported: $ zpool import -o readonly test1 $ zpool import test2 $ zpool export test2 $ zdb -C This is a problem because if the machine reboots we import all pools in the cache file as read-write. Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Write dirty inodes on closeBrian Behlendorf2013-08-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the property atime=on is set operations which only access and inode do cause an atime update. However, it turns out that dirty inodes with updated atimes are only written to disk when the inodes get evicted from the cache. Somewhat surprisingly the source suggests that this isn't a ZoL specific issue. This behavior may in part explain why zfs's reclaim logic has been observed to be slow. When reclaiming inodes its likely that they have a dirty atime which will force a write to disk. Obviously we don't want to force a write to disk for every atime update, these needs to be batched. The right way to do this is to fully implement the .dirty_inode and .write_inode callbacks. However, to do that right requires proper unification of some fields in the znode/inode. Then we could just mark the inode dirty and leave it to the VFS to call .write_inode periodically. Until that work gets done we have to settle for some middle ground. The simplest and safest thing we can do for now is to write the dirty inode on last close. This should prevent the majority of inodes in the cache from having dirty atimes and not drastically increase the number of writes. Some rudimentally testing to show how long it takes to drop 500,000 inodes from the cache shows promising results. This is as expected because we're no longer do lots of IO as part of the eviction, it was done earlier during the close. w/out patch: ~30s to drop 500,000 inodes with drop_caches. with patch: ~3s to drop 500,000 inodes with drop_caches. Signed-off-by: Brian Behlendorf <[email protected]>
* Export additional dmu symbolsBrian Behlendorf2013-08-011-0/+6
| | | | | | | | The dmu_prefetch, dmu_free_long_range, dmu_free_object, dmu_prealloc, dmu_write_policy, and dmu_sync symbols have been exported so they may be used by other modules. Signed-off-by: Brian Behlendorf <[email protected]>
* dmu_tx: Fix possible NULL pointer dereferenceNathaniel Clark2013-08-011-2/+5
| | | | | | | | | | dmu_tx_hold_object_impl can return NULL on error. Check for this condition prior to dereferencing pointer. This can only occur if the passed object was invalid or unallocated. Signed-off-by: Nathaniel Clark <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1610
* Remove b_thawed from arc_buf_hdr_tRichard Yao2013-08-011-11/+0
| | | | | | | | The code involving b_thawed appears to be dead, so lets discard it. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1614
* Remove arc_data_buf_alloc()/arc_data_buf_free()Richard Yao2013-08-011-17/+0
| | | | | | | | | | These functions are used in neither Illumos nor ZFSOnLinux. They appear to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove them. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1614
* Remove zio_alloc_arenaRichard Yao2013-08-011-6/+0
| | | | | | | | | | We declare zio_alloc_arena using extern, but it does not appear to exist anywhere in the code. This permits undefined behavior, so lets remove it. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1614
* Make arc+l2arc module options writableBrian Behlendorf2013-07-301-49/+60
| | | | | | | The l2arc module options can be made safely writable. This allows the options to be changed without unloading/loading the modules. Signed-off-by: Brian Behlendorf <[email protected]>
* Change l2arc_norw default to zeroBrian Behlendorf2013-07-291-1/+1
| | | | | | | | | | These days modern SSDs can efficiently service concurrent reads and writes. When this flag was added that wasn't really the case for a variety of SSD controllers. But now we can set the default value to take advantage of this parallelism and only disable this as needed for specific troublesome hardware. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix inaccurate arcstat_l2_hdr_size calculationsYing Zhu2013-07-291-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on the comments in arc.c we know that buffers can exist both in arc and l2arc, under this circumstance both arc_buf_hdr_t and l2arc_buf_hdr_t will be allocated. However the current logic only cares for memory that l2arc_buf_hdr takes up when the buffer's state transfers from or to arc_l2c_only. This will cause obvious deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr) is larger than ZOL's. We can implement the calcuation in the following simple way: 1. When allocate a l2arc_buf_hdr_t we add its memory consumption instantly and subtract it when we free or evict the l2arc buf. 2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if the buffer only stays in l2arc we should also add the memory its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE in step 1 and the same for transfering arc bufs from l2arc only state. The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory and tests were set upped in the following way: 1. Fdisked a SATA disk into two partitions, one partition for zpool storage and the other one was used as the cache device. 2. Generated some files occupying 14GB altogether in the zpool prepared in step 1 using iozone. 3. Read them all using md5sum and watched the l2arc related statistics in /proc/spl/kstat/zfs/arcstats. After the reading ended the l2_hdr_size and l2_size were shown like this: l2_size 4 4403780608 l2_hdr_size 4 0 which was weird. 4. After applying this patch and reran step 1-3, the results were as following: l2_size 4 4306443264 l2_hdr_size 4 535600 these numbers made sense, on 64-bit systems the sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all blocks are equal-sized, the theoretical result will be a little bigger, as we can see. Since I'm familiar with systemtap instrumentation tool I used it to examine what had happened. The script looked like this: probe module("zfs").function("arc_chage_state") { if ($new_state == $arc_l2_only) printf("change arc buf to arc_l2_only\n") } It will print out some information each time we call funciton arc_chage_state if the argument new_state is arc_l2_only. I gathered the trace logs and found that none of the arc bufs ran into arc state arc_l2_only when the tests was running, this was the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into arc_l2_only when the pool or the filesystem was offlined. Signed-off-by: Ying Zhu <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Fix arc_adapt() spinning in iterate_supers_type()Brian Behlendorf2013-07-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | The iterate_supers_type() function which was introduced in the 3.0 kernel was supposed to provide a safe way to call an arbitrary function on all super blocks of a specific type. Unfortunately, because a list_head was used a bug was introduced which made it possible for iterate_supers_type() to get stuck spinning on a super block which was just deactivated. This can occur because when the list head is removed from the fs_supers list it is reinitialized to point to itself. If the iterate_supers_type() function happened to be processing the removed list_head it will get stuck spinning on that list_head. The bug was fixed in the 3.3 kernel by converting the list_head to an hlist_node. However, to resolve the issue for existing 3.0 - 3.2 kernels we detect when a list_head is used. Then to prevent the spinning from occurring the .next pointer is set to the fs_supers list_head which ensures the iterate_supers_type() function will always terminate. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1045 Closes #861 Closes #790
* Fix read-only pool hang on unmountBrian Behlendorf2013-07-171-1/+5
| | | | | | | | | | | During mount a filesystem dataset would have the MS_RDONLY bit incorrectly cleared even if the entire pool was read-only. There is existing to code to handle this case but it was being run before the property callbacks were registered. To resolve the issue we move this read-only code after the callback registration. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1338
* Fix zfsctl_expire_snapshot() deadlockBrian Behlendorf2013-07-121-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible for an automounted snapshot which is expiring to deadlock with a manual unmount of the snapshot. This can occur because taskq_cancel_id() will block if the task is currently executing until it completes. But it will never complete because zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which zfsctl_expire_snapshot() must acquire. ---------------------- z_unmount/0:2153 --------------------- mutex_lock <blocking on zsb->z_ctldir_lock> zfsctl_unmount_snapshot zfsctl_expire_snapshot taskq_thread ------------------------- zfs:10690 ------------------------- taskq_wait_id <waiting for z_unmount to exit> taskq_cancel_id __zfsctl_unmount_snapshot zfsctl_unmount_snapshot <takes zsb->z_ctldir_lock> zfs_unmount_snap zfs_ioc_destroy_snaps_nvl zfsdev_ioctl do_vfs_ioctl We resolve the deadlock by dropping the zsb->z_ctldir_lock before calling __zfsctl_unmount_snapshot(). The lock is only there to prevent concurrent modification to the zsb->z_ctldir_snaps AVL tree. Moreover, we're careful to remove the zfs_snapentry_t from the AVL tree before dropping the lock which ensures no other tasks can find it. On failure it's added back to the tree. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Closes #1527
* Improve N-way mirror performanceBrian Behlendorf2013-07-111-3/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The read bandwidth of an N-way mirror can by increased by 50%, and the IOPs by 10%, by more carefully selecting the preferred leaf vdev. The existing algorthm selects a perferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the drives are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. Utilization can be improved by preferentially selecting the leaf vdev with the least pending IO. This prevents leaf vdevs from being starved and compensates for performance differences between disks in the mirror. Faster vdevs will be sent more work and the mirror performance will not be limitted by the slowest drive. In the common case where all the pending queues are full and there is no single least busy leaf vdev a batching stratagy is employed. Of the N least busy vdevs one is selected with equal probability to be the preferred vdev for T microseconds. Compared to randomly selecting a vdev to break the tie batching the requests greatly improves the odds of merging the requests in the Linux elevator. The testing results show a significant performance improvement for all four workloads tested. The workloads were generated using the fio benchmark and are as follows. 1) 1MB sequential reads from 16 threads to 16 files (MB/s). 2) 4KB sequential reads from 16 threads to 16 files (MB/s). 3) 1MB random reads from 16 threads to 16 files (IOP/s). 4) 4KB random reads from 16 threads to 16 files (IOP/s). | Pristine | With 1461 | | Sequential Random | Sequential Random | | 1MB 4KB 1MB 4KB | 1MB 4KB 1MB 4KB | | MB/s MB/s IO/s IO/s | MB/s MB/s IO/s IO/s | ---------------+-----------------------+------------------------+ 2 Striped | 226 243 11 304 | 222 255 11 299 | 2 2-Way Mirror | 302 324 16 534 | 433 448 23 571 | 2 3-Way Mirror | 429 458 24 714 | 648 648 41 808 | 2 4-Way Mirror | 562 601 36 849 | 816 828 82 926 | Signed-off-by: Brian Behlendorf <[email protected]> Closes #1461
* Add new kstat for monitoring time in dmu_tx_assignPrakash Surya2013-07-112-0/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change adds a new kstat to gain some visibility into the amount of time spent in each call to dmu_tx_assign. A histogram is exported via a new dmu_tx_assign_histogram-$POOLNAME file. The information contained in this histogram is the frequency dmu_tx_assign took to complete given an interval range. For example, given the below histogram file: $ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank 12 1 0x01 32 1536 19792068076691 20516481514522 name type data 1 us 4 859 2 us 4 252 4 us 4 171 8 us 4 2 16 us 4 0 32 us 4 2 64 us 4 0 128 us 4 0 256 us 4 0 512 us 4 0 1024 us 4 0 2048 us 4 0 4096 us 4 0 8192 us 4 0 16384 us 4 0 32768 us 4 1 65536 us 4 1 131072 us 4 1 262144 us 4 4 524288 us 4 0 1048576 us 4 0 2097152 us 4 0 4194304 us 4 0 8388608 us 4 0 16777216 us 4 0 33554432 us 4 0 67108864 us 4 0 134217728 us 4 0 268435456 us 4 0 536870912 us 4 0 1073741824 us 4 0 2147483648 us 4 0 one can see most calls to dmu_tx_assign completed in 32us or less, but a few outliers did not. Specifically, 4 of the calls took between 262144us and 131072us. This information is difficult, if not impossible, to gather without this change. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1584
* Log pool suspension warnings to the consoleBrian Behlendorf2013-07-101-0/+3
| | | | | | | | | In the event that a pool gets suspended log this information to the console. This is critical information and we want to make sure it gets logged. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1555
* Use GFP_NOIO in vdev_disk_io_flush()Brian Behlendorf2013-07-101-1/+1
| | | | | | | | | To avoid a potential deadlock when using a zvol as a swap device prevent vdev_disk_io_flush() from performing IO during the bio_alloc(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #1508
* Improve code in arc_buf_remove_refYing Zhu2013-07-091-1/+2
| | | | | | | | | | When we remove references of arc bufs in the arc_anon state we needn't take its header's hash_lock, so postpone it to where we really need it to avoid unnecessary invocations of function buf_hash. Signed-off-by: Ying Zhu <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1557
* Update zio.cShen Yan2013-07-091-1/+1
| | | | | | | The cv_wait_io is used to account io time instead of cv_wait. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1566
* Add zfs_autoimport_disable tunableBrian Behlendorf2013-07-091-0/+8
| | | | | | | | | | There are times when it is desirable for zfs to not automatically populate the spa namespace at module load time using the pools in the /etc/zfs/zpool.cache file. The zfs_autoimport_disable module option has been added to control this behavior. Signed-off-by: Brian Behlendorf <[email protected]> Issue #330
* 3.10 API change: block_device_operations->release() returns voidChris Dunlop2013-07-081-0/+6
| | | | | | | | | | Linux kernel commit torvalds/linux@db2a144 changed the return type of block_device_operations->release() to void. Detect the expected prototype and defined our callout accordingly. Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1494
* Open pools asynchronously after module loadBrian Behlendorf2013-07-031-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the side effects of calling zvol_create_minors() in zvol_init() is that all pools listed in the cache file will be opened. Depending on the state and contents of your pool this operation can take a considerable length of time. Doing this at load time is undesirable because the kernel is holding a global module lock. This prevents other modules from loading and can serialize an otherwise parallel boot process. Doing this after module inititialization also reduces the chances of accidentally introducing a race during module init. To ensure that /dev/zvol/<pool>/<dataset> devices are still automatically created after the module load completes a udev rules has been added. When udev notices that the /dev/zfs device has been create the 'zpool list' command will be run. This then will cause all the pools listed in the zpool.cache file to be opened. Because this process in now driven asynchronously by udev there is the risk of problems in downstream distributions. Signed-off-by: Brian Behlendorf <[email protected]> Issue #756 Issue #1020 Issue #1234
* Cleanup zvol initialization codeRichard Yao2013-07-031-10/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following error will occur on some (possibly all) kernels because blk_init_queue() will try to take the spinlock before we initialize it. BUG: spinlock bad magic on CPU#0, zpool/4054 lock: 0xffff88021a73de60, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 Pid: 4054, comm: zpool Not tainted 3.9.3 #11 Call Trace: [<ffffffff81478ef8>] spin_dump+0x8c/0x91 [<ffffffff81478f1e>] spin_bug+0x21/0x26 [<ffffffff812da097>] do_raw_spin_lock+0x127/0x130 [<ffffffff8147d851>] _raw_spin_lock_irq+0x21/0x30 [<ffffffff812c2c1e>] cfq_init_queue+0x1fe/0x350 [<ffffffff812aacb8>] elevator_init+0x78/0x140 [<ffffffff812b2677>] blk_init_allocated_queue+0x87/0xb0 [<ffffffff812b26d5>] blk_init_queue_node+0x35/0x70 [<ffffffff812b271e>] blk_init_queue+0xe/0x10 [<ffffffff8125211b>] __zvol_create_minor+0x24b/0x620 [<ffffffff81253264>] zvol_create_minors_cb+0x24/0x30 [<ffffffff811bd9ca>] dmu_objset_find_spa+0xea/0x510 [<ffffffff811bda71>] dmu_objset_find_spa+0x191/0x510 [<ffffffff81253ea2>] zvol_create_minors+0x92/0x180 [<ffffffff811f8d80>] spa_open_common+0x250/0x380 [<ffffffff811f8ece>] spa_open+0xe/0x10 [<ffffffff8122817e>] pool_status_check.part.22+0x1e/0x80 [<ffffffff81228a55>] zfsdev_ioctl+0x155/0x190 [<ffffffff8116a695>] do_vfs_ioctl+0x325/0x5a0 [<ffffffff8116a950>] sys_ioctl+0x40/0x80 [<ffffffff814812c9>] ? do_page_fault+0x9/0x10 [<ffffffff81483929>] system_call_fastpath+0x16/0x1b zd0: unknown partition table We fix this by calling spin_lock_init before blk_init_queue. The manner in which zvol_init() initializes structures is suspectible to a race between initialization and a probe on a zvol. We reorganize zvol_init() to prevent that. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Call zvol_create_minors() in spa_open_common() when initializing poolPawel Jakub Dawidek2013-07-032-3/+13
| | | | | | | | | | | | | | | There is an extremely odd bug that causes zvols to fail to appear on some systems, but not others. Recently, I was able to consistently reproduce this issue over a period of 1 month. The issue disappeared after I applied this change from FreeBSD. This is from FreeBSD's pool version 28 import, which occurred in revision 219089. Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #441 Issue #599
* Illumos #3498 panic in arc_read()George Wilson2013-07-0212-177/+70
| | | | | | | | | | | | | | 3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt) Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@1b912ec7100c10e7243bf0879af0fe580e08c73d https://www.illumos.org/issues/3498 Ported-by: Brian Behlendorf <[email protected]> Closes #1249
* Illumos #3122 zfs destroy filesystem should prefetch blocksMatthew Ahrens2013-07-022-15/+85
| | | | | | | | | | | | | | | 3122 zfs destroy filesystem should prefetch blocks Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: illumos/illumos-gate@b4709335aa83dcbfd0dba33c9be21fcabebd28e4 https://www.illumos.org/issues/3122 Ported-by: Brian Behlendorf <[email protected]> Closes #1565
* Add zfs_sync_pass_* tunable parametersCyril Plisko2013-07-021-0/+12
| | | | | | | | | | | | | | | | | | | | | | | Commit 55d85d5a8c45c4559a4a0e675c37b0c3afb19c2f (backport of the upstream changes) replaced three hardcoded constants: #define SYNC_PASS_DEFERRED_FREE 2 /* defer frees after this pass */ #define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */ #define SYNC_PASS_REWRITE 1 /* rewrite new bps after this pass */ with a tunable parameters: int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */ int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */ int zfs_sync_pass_rewrite = 2; /* rewrite new bps starting in this pass */ This commit makes these tunables available as module parameters in Linux. They should only be used for performance analysis because changing them can result in subtle and pathological performance problems. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1562
* Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()Li Dongyang2013-07-022-11/+50
| | | | | | | | | | | | | | | | | | | | | | | | The approach taken was the rework zfs_holey() as little as possible and then just wrap the code as needed to ensure correct locking and error handling. Tested with xfstests 285 and 286. All tests pass except for 7-9 of 285 which try to reserve blocks first via fallocate(2) and fail because fallocate(2) is not yet supported. Note that the filp->f_lock spinlock did not exist prior to Linux 2.6.30, but we avoid the need for autotools check by virtue of the fact that SEEK_DATA/SEEK_HOLE support was not added until Linux 3.1. An autoconf check was added for lseek_execute() which is currently a private function but the expectation is that it will be exported perhaps as early as Linux 3.11. Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1384
* Readd zfs_holey() from OpenSolarisMatthew Ahrens2013-07-021-0/+45
| | | | | | | | | | | This patch restores the zfs_holey() function from OpenSolaris. This was removed by commit 3558fd7 because it wasn't clear we had a use for it in ZoL. However, this functionality is a prerequisite for adding SEEK_DATA/SEEK_HOLE support to the ZPL. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #1384
* kmem_zalloc(..., KM_SLEEP) will never failshenyan12013-07-012-5/+1
| | | | | | | | | By definitition these allocations will never fail. For consistency with the rest of the code remove this dead error handling code. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1558
* Fix zfs_sb_teardown/zfs_resume_fs NULL dereferenceTim Chase2013-07-011-5/+8
| | | | | | | | | | | | | | | Fix a pair of conditions in which a concurrent umount can cause NULL pointer dereferences: * zfs_sb_teardown - prevent a NULL dereference by not calling dmu_objset_pool with a null z_os. * zfs_resume_fs - don't try to unmount with a null z_os. This change makes the ZoL code more consistent with both Illumos and FreeBSD. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1543
* Fix module probe failure on 32-bit systemsYing Zhu2013-06-271-2/+2
| | | | | | | | | | | | | | | | | Previous commit 7ef5e54e2e28884a04dc800657967b891239e933 caused module probe failure on 32-bit systems, dmesg showed Unknown symbol __moddi3 This was caused by the modulo operation 'gethrtime() % tqs->stqs_count' in the committed code. Instead of implementing __moddi3 for all 32-bit systems, Behlendorf advised we can just cast the return value of gethrtime() into a uint64_t, since gethrtime does not return negative value on all circumstances we need not care about the potential overflow. Signed-off-by: Ying Zhu <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1551
* Return -EOPNOTSUPP for ZFS_IOC_{GET|SET}FLAGSBrian Behlendorf2013-06-261-0/+29
| | | | | | | | | | Until these hooks are fully implemented return the expected -EOPNOTSUPP error to indicate they are not functional. This allows test suites such as xfstests to cleanly skip testing this functionality until it's implemented. Signed-off-by: Brian Behlendorf <[email protected]> Issue #229
* Illumos #3805 arc shouldn't cache freed blocksMatthew Ahrens2013-06-202-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3805 arc shouldn't cache freed blocks Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Will Andrews <[email protected]> Approved by: Dan McDonald <[email protected]> References: illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2 https://www.illumos.org/issues/3805 ZFS should proactively evict freed blocks from the cache. On dcenter, we saw that we were caching ~256GB of metadata, while the pool only had <4GB of metadata on disk. We were wasting about half the system's RAM (252GB) on blocks that have been freed. Even though these freed blocks will never be used again, and thus will eventually be evicted, this causes us to use memory inefficiently for 2 reasons: 1. A block that is freed has no chance of being accessed again, but will be kept in memory preferentially to a block that was accessed before it (and is thus older) but has not been freed and thus has at least some chance of being accessed again. 2. We partition the ARC into several buckets: user data that has been accessed only once (MRU) metadata that has been accessed only once (MRU) user data that has been accessed more than once (MFU) metadata that has been accessed more than once (MFU) The user data vs metadata split is somewhat arbitrary, and the primary control on how much memory is used to cache data vs metadata is to simply try to keep the proportion the same as it has been in the past (each bucket "evicts against" itself). The secondary control is to evict data before evicting metadata. Because of this bucketing, we may end up with one bucket mostly containing freed blocks that are very old, while another bucket has more recently accessed, still-allocated blocks. Data in the useful bucket (with still-allocated blocks) may be evicted in preference to data in the useless bucket (with old, freed blocks). On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU data bucket was 27GB and the MRU metadata bucket was 256GB. However, the vast majority of data in the MRU metadata bucket (256GB) was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB) was constantly evicting useful blocks that will be soon needed. The problem of cache segmentation is a larger problem that needs more investigation. However, if we stop caching freed blocks, it should reduce the impact of this more fundamental issue. Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1503
* Illumos #3552, #3564George Wilson2013-06-193-94/+290
| | | | | | | | | | | | | | | | | | 3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread 3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing) Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@16a4a8074274d2d7cc408589cf6359f4a378c861 https://www.illumos.org/issues/3552 https://www.illumos.org/issues/3564 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1513
* Illumos #3006Madhav Suresh2013-06-1928-115/+124
| | | | | | | | | | | | | | | | | | | | 3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument is zero Reviewed by Matt Ahrens <[email protected]> Reviewed by George Wilson <[email protected]> Approved by Eric Schrock <[email protected]> References: illumos/illumos-gate@fb09f5aad449c97fe309678f3f604982b563a96f https://illumos.org/issues/3006 Requires: zfsonlinux/spl@1c6d149feb4033e4a56fb987004edc5d45288bcb Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1509
* Only check directory xattr on ENOENTBrian Behlendorf2013-05-101-1/+1
| | | | | | | | | | | | | | | When SA xattrs are enabled only fallback to checking the directory xattrs when the name is not found as a SA xattr. Otherwise, the SA error which should be returned to the caller is overwritten by the directory xattr errors. Positive return values indicating success will also be immediately returned. In the case of #1437 the ERANGE error was being correctly returned by zpl_xattr_get_sa() only to be overridden with ENOENT which was returned by the subsequent unnessisary call to zpl_xattr_get_dir(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #1437
* zfs_scrub_limit tunable is not used anywhereCyril Plisko2013-05-061-6/+0
| | | | | | | | | | | | As a part of scrub/resilver tuning zfs_scrub_limit fell out of use, but the definition of the variable remained in place. Moreover various guides still (misleadingly) mention it as a way to influence resilver/scrub behavior. This commit removes its finally. Signed-off-by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1444
* Fix incorrect assertions in ddt_phys_decref and ddt_sync_entryYing Zhu2013-05-061-2/+1
| | | | | | | | | | | The assertions in ddt_phys_decref and ddt_sync_entry cast ddp->ddp_refcnt from uint64_t to int64_t, with a reference count bigger than 2^63, e.g. the reference count of zero blocks commonly available in spare files, we may mistakenly hit these assertations, so drop the type conversions here. Signed-off-by: Ying Zhu <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1436
* Use taskq for dump_bytes()Brian Behlendorf2013-05-062-6/+59
| | | | | | | | | | | | | | | | | | | The vn_rdwr() function performs I/O by calling the vfs_write() or vfs_read() functions. These functions reside just below the system call layer and the expectation is they have almost the entire 8k of stack space to work with. In fact, certain layered configurations such as ext+lvm+md+multipath require the majority of this stack to avoid stack overflows. To avoid this posibility the vn_rdwr() call in dump_bytes() has been moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be performed with the majority of the stack space available. This ends up being very similiar to as if the I/O were issued via sys_write() or sys_read(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #1399 Closes #1423
* Illumos #3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock contentionAdam Leventhal2013-05-063-75/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Gordon Ross <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@ec94d32 https://illumos.org/issues/3581 Notes for Linux port: Earlier commit 08d08eb reduced contention on this taskq lock by simply reducing the number of z_fr_iss threads from 100 to one-per-CPU. We also optimized the taskq implementation in zfsonlinux/spl@3c6ed54. These changes significantly improved unlink performance to acceptable levels. This patch further reduces time spent spinning on this lock by randomly dispatching the work items over multiple independent task queues. The Illumos ZFS developers stated that this lock contention only arose after "3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()" was landed. It's not clear if 3329 affects the Linux port or not. I didn't see spa_free_sync_cb() show up in oprofile sessions while unlinking large files, but I may just not have used the right test case. I tested unlinking a 1 TB of data with and without the patch and didn't observe a meaningful difference in elapsed time. However, oprofile showed that the percent time spent in taskq_thread() was reduced from about 16% to about 5%. Aside from a possible slight performance benefit this may be worth landing if only for the sake of maintaining consistency with upstream. Ported-by: Ned Bass <[email protected]> Closes #1327
* Illumos #3329, #3330, #3331, #3335George Wilson2013-05-065-14/+59
| | | | | | | | | | | | | | | | | | | | | | | | 3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb() 3330 space_seg_t should have its own kmem_cache 3331 deferred frees should happen after sync_pass 1 3335 make SYNC_PASS_* constants tunable Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@01f55e48fb4d524eaf70687728aa51b7762e2e97 https://www.illumos.org/issues/3329 https://www.illumos.org/issues/3330 https://www.illumos.org/issues/3331 https://www.illumos.org/issues/3335 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #3306, #3321George Wilson2013-05-033-20/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3306 zdb should be able to issue reads in parallel 3321 'zpool reopen' command should be documented in the man page and help Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: illumos/illumos-gate@31d7e8fa33fae995f558673adb22641b5aa8b6e1 https://www.illumos.org/issues/3306 https://www.illumos.org/issues/3321 The vdev_file.c implementation in this patch diverges significantly from the upstream version. For consistenty with the vdev_disk.c code the upstream version leverages the Illumos bio interfaces. This makes sense for Illumos but not for ZoL for two reasons. 1) The vdev_disk.c code in ZoL has been rewritten to use the Linux block device interfaces which differ significantly from those in Illumos. Therefore, updating the vdev_file.c to use the Illumos interfaces doesn't get you consistency with vdev_disk.c. 2) Using the upstream patch as is would requiring implementing compatibility code for those Solaris block device interfaces in user and kernel space. That additional complexity could lead to confusion and doesn't buy us anything. For these reasons I've opted to simply move the existing vn_rdwr() as is in to the taskq function. This has the advantage of being low risk and easy to understand. Moving the vn_rdwr() function in to its own taskq thread also neatly avoids the possibility of a stack overflow. Finally, because of the additional work which is being handled by the free taskq the number of threads has been increased. The thread count under Illumos defaults to 100 but was decreased to 2 in commit 08d08e due to contention. We increase it to 8 until the contention can be address by porting Illumos #3581. Ported-by: Brian Behlendorf <[email protected]> Closes #1354