aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Add separate aggregation limit for non-rotating mediaAlexander Motin2019-03-131-1/+10
| | | | | | | | | | | | | | | Before sequential scrub patches ZFS never aggregated I/Os above 128KB. Sequential scrub bumped that to 1MB, supposedly to reduce number of head seeks for spinning disks. But for SSDs it makes little to no sense, especially on FreeBSD, where due to MAXPHYS limitation device will likely still see bunch of 128KB I/Os instead of one large. Having more strict aggregation limit for SSDs allows to avoid allocation of large memory buffer and copy to/from it, that is a serious problem when throughput reaches gigabytes per second. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #8494
* OpenZFS 9914 - NV_UNIQUE_NAME_TYPE broken after 9580Andrew Stormont2019-03-131-1/+2
| | | | | | | | | | | | | | | Authored by: Andrew Stormont <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Andy Fiddaman <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9914 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b8a5bee18 Closes #8496
* Detect and prevent mixed raw and non-raw sendsTom Caputi2019-03-139-44/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there is an issue in the raw receive code where raw receives are allowed to happen on top of previously non-raw received datasets. This is a problem because the source-side dataset doesn't know about how the blocks on the destination were encrypted. As a result, any MAC in the objset's checksum-of-MACs tree that is a parent of both blocks encrypted on the source and blocks encrypted by the destination will be incorrect. This will result in authentication errors when we decrypt the dataset. This patch fixes this issue by adding a new check to the raw receive code. The code now maintains an "IVset guid", which acts as an identifier for the set of IVs used to encrypt a given snapshot. When a snapshot is raw received, the destination snapshot will take this value from the DRR_BEGIN payload. Non-raw receives and normal "zfs snap" operations will cause ZFS to generate a new IVset guid. When a raw incremental stream is received, ZFS will check that the "from" IVset guid in the stream matches that of the "from" destination snapshot. If they do not match, the code will error out the receive, preventing the problem. This patch requires an on-disk format change to add the IVset guids to snapshots and bookmarks. As a result, this patch has errata handling and a tunable to help affected users resolve the issue with as little interruption as possible. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8308
* Add bookmark v2 on-disk featureTom Caputi2019-03-132-3/+20
| | | | | | | | | | | | | | This patch adds the bookmark v2 feature to the on-disk format. This feature will be needed for the upcoming redacted sends and for an upcoming fix that for raw receives. The feature is not currently used by any code and thus this change is a no-op, aside from the fact that the user can now enable the feature. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Issue #8308
* Fix handling of maxblkid for raw sendsTom Caputi2019-03-136-27/+81
| | | | | | | | | | | | | | | | | | | | | | Currently, the receive code can create an unreadable dataset from a correct raw send stream. This is because it is currently impossible to set maxblkid to a lower value without freeing the associated object. This means truncating files on the send side to a non-0 size could result in corruption. This patch solves this issue by adding a new 'force' flag to dnode_new_blkid() which will allow the raw receive code to force the DMU to accept the provided maxblkid even if it is a lower value than the existing one. For testing purposes the send_encrypted_files.ksh test has been extended to include a variety of truncated files and multiple snapshots. It also now leverages the xattrtest command to help ensure raw receives correctly handle xattrs. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8168 Closes #8487
* Fix most zfs_arc_* mod params not actually being modifiable at runtimeJustin Gottula2019-03-121-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the zfs_arc_* module parameters do not have their values used by the ARC code directly. Instead, there is a function, arc_tuning_update, which is called during module initialization and periodically thereafter, whose job is to fetch the module parameter values, clamp/ limit them appropriately, and then assign those values to a separate set of internal variables that are actually referenced by the ARC code. Commit 3ec34e55 featured an overhaul of arc_reclaim_thread, which is the former location where the post-init-time calls to arc_tuning_update would occur. The rework split the work previously done by the arc_reclaim_thread into a pair of replacement threads; and unfortunately, the call to arc_tuning_update fell through the cracks and was lost in the reorganization. This meant that changing almost any ARC-related zfs module parameter via /sys/module/zfs/parameters/ would result in the module parameter value itself appearing to change; however the modification would not actually propagate to the ARC code and have any real effect. This commit reinstates the post-init-time call to arc_tuning_update. It is now called during arc_adjust_cb_check; this should be equivalent to its former call location in arc_reclaim_thread. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Justin Gottula <[email protected]> Closes #8405 Closes #8463
* Avoid retrieving unused snapshot propsAlek P2019-03-121-18/+60
| | | | | | | | | | | | | | This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it to take input parameters that alter the way looping through the list of snapshots is performed. The idea here is to restrict functions that throw away some of the snapshots returned by the ioctl to a range of snapshots that these functions actually use. This improves efficiency and execution speed for some rollback and send operations. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #8077
* Fix vdev_initialize_restart / removal raceBrian Behlendorf2019-03-121-2/+4
| | | | | | | | | | | Resolve a vdev_initialize crash uncovered by ztest. Similar to when starting a new initialization verify that a removal is not in progress. Additionally, do not restart when the thread already exists. This check is now congruent with the POOL_INITIALIZE_DO handling in spa_vdev_initialize_impl(). Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8477
* MMP writes rotate over leavesOlaf Faaland2019-03-123-66/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of choosing a leaf vdev quasi-randomly, by starting at the root vdev and randomly choosing children, rotate over leaves to issue MMP writes. This fixes an issue in a pool whose top-level vdevs have different numbers of leaves. The issue is that the frequency at which individual leaves are chosen for MMP writes is based not on the total number of leaves but based on how many siblings the leaves have. For example, in a pool like this: root-vdev +------+---------------+ vdev1 vdev2 | | | +------+-----+-----+----+ disk1 disk2 disk3 disk4 disk5 disk6 vdev1 and vdev2 will each be chosen 50% of the time. Every time vdev1 is chosen, disk1 will be chosen. However, every time vdev2 is chosen, disk2 is chosen 20% of the time. As a result, disk1 will be sent 5x as many MMP writes as disk2. This may create wear issues in the case of SSDs. It also reduces the effectiveness of MMP as it depends on the writes being evenly distributed for the case where some devices fail or are partitioned. The new code maintains a list of leaf vdevs in the pool. MMP records the last leaf used for an MMP write in mmp->mmp_last_leaf. To choose the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list, continuing from the head if the tail is reached. It stops when a suitable leaf is found or all leaves have been examined. Added a test to verify MMP write distribution is even. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7953
* zfs does not honor NFS sync write semanticsGeorge Wilson2019-03-111-2/+30
| | | | | | | | | | | | | | | The linux kernel's nfsd implementation use RWF_SYNC to determine if the write is synchronous or not. This flag is used to set the kernel's I/O control block flags. Unfortunately, ZFS was not updated to inspect these flags so NFS sync writes were not being honored. This change maps the IOCB_* flags to the ZFS equivalent. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #8474 Closes #8452 Closes #8486
* Fix lockdep between ds_lock and dd_lock in dsl_dataset_namelen()mzhivich2019-03-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Booting debug kernel found an inconsistent lock dependency between dataset's ds_lock and its directory's dd_lock. [ 32.215336] ====================================================== [ 32.221859] WARNING: possible circular locking dependency detected [ 32.221861] 4.14.90+ #8 Tainted: G O [ 32.221862] ------------------------------------------------------ [ 32.221863] dynamic_kernel_/4667 is trying to acquire lock: [ 32.221864] (&ds->ds_lock){+.+.}, at: [<ffffffffc10a4bde>] dsl_dataset_check_quota+0x9e/0x8a0 [zfs] [ 32.221941] but task is already holding lock: [ 32.221941] (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs] [ 32.221983] which lock already depends on the new lock. [ 32.221983] the existing dependency chain (in reverse order) is: [ 32.221984] -> #1 (&dd->dd_lock){+.+.}: [ 32.221992] __mutex_lock+0xef/0x14c0 [ 32.222049] dsl_dir_namelen+0xd4/0x2d0 [zfs] [ 32.222093] dsl_dataset_namelen+0x2f1/0x430 [zfs] [ 32.222142] verify_dataset_name_len+0xd/0x40 [zfs] [ 32.222184] dmu_objset_find_dp_impl+0x5f5/0xef0 [zfs] [ 32.222226] dmu_objset_find_dp_cb+0x40/0x60 [zfs] [ 32.222235] taskq_thread+0x969/0x1460 [spl] [ 32.222238] kthread+0x2fb/0x400 [ 32.222241] ret_from_fork+0x3a/0x50 [ 32.222241] -> #0 (&ds->ds_lock){+.+.}: [ 32.222246] lock_acquire+0x14f/0x390 [ 32.222248] __mutex_lock+0xef/0x14c0 [ 32.222291] dsl_dataset_check_quota+0x9e/0x8a0 [zfs] [ 32.222355] dsl_dir_tempreserve_space+0x5d2/0x1290 [zfs] [ 32.222392] dmu_tx_assign+0xa61/0xdb0 [zfs] [ 32.222436] zfs_create+0x4e6/0x11d0 [zfs] [ 32.222481] zpl_create+0x194/0x340 [zfs] [ 32.222484] lookup_open+0xa86/0x16f0 [ 32.222486] path_openat+0xe56/0x2490 [ 32.222488] do_filp_open+0x17f/0x260 [ 32.222490] do_sys_open+0x195/0x310 [ 32.222491] SyS_open+0xbf/0xf0 [ 32.222494] do_syscall_64+0x191/0x4f0 [ 32.222496] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 32.222497] other info that might help us debug this: [ 32.222497] Possible unsafe locking scenario: [ 32.222498] CPU0 CPU1 [ 32.222498] ---- ---- [ 32.222499] lock(&dd->dd_lock); [ 32.222500] lock(&ds->ds_lock); [ 32.222502] lock(&dd->dd_lock); [ 32.222503] lock(&ds->ds_lock); [ 32.222504] *** DEADLOCK *** [ 32.222505] 3 locks held by dynamic_kernel_/4667: [ 32.222506] #0: (sb_writers#9){.+.+}, at: [<ffffffffaf68933c>] mnt_want_write+0x3c/0xa0 [ 32.222511] #1: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffaf652cde>] path_openat+0xe2e/0x2490 [ 32.222515] #2: (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs] The issue is caused by dsl_dataset_namelen() holding ds_lock, followed by acquiring dd_lock on ds->ds_dir in dsl_dir_namelen(). However, ds->ds_dir should not be protected by ds_lock, so releasing it before call to dsl_dir_namelen() prevents the lockdep issue Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: Michael Zhivich <[email protected]> Closes #8413
* Linux 5.1 compat: get_ds() removedBrian Behlendorf2019-03-071-3/+3
| | | | | | | | | | | Commit torvalds/linux@736706bee has removed the get_fs() function as a bit of cleanup. It has been defined as KERNEL_DS on all architectures for all supported kernels. Replace get_fs() with KERNEL_DS as was done in the kernel. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8479
* Stack overflow in recursive bpobj_iterate_implPaul Zuchowski2019-03-062-102/+244
| | | | | | | | | | | | | The function bpobj_iterate_impl overflows the stack when bpobjs are deeply nested. Rewrite the function to eliminate the recursion. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #7674 Closes #7675 Closes #7908
* Fix race in vdev_initialize_threadBrian Behlendorf2019-03-061-0/+7
| | | | | | | | | | | | | Before allowing new allocations to the metaslab we need to ensure that any issued initializing writes have been synced. Otherwise, it's possible for metaslab_block_alloc() to allocate a range which is about to be overwritten by an initializing IO. Serapheim Dimitropoulos <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8461
* Fix style of spl_kmem_cache_create()Matthew Ahrens2019-02-281-35/+34
| | | | | | | | | | Fix indentation of code in ifdef's. Remove obsolete comment. Make if/else statements more readable by adding braces. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8459
* Do not resume a pool if multihost is enabledOlaf Faaland2019-02-281-0/+7
| | | | | | | | | | | When multihost is enabled, and a pool is suspended, return EINVAL in response to "zpool clear <pool>". The pool may have been imported on another host while I/O was suspended. Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #6933 Closes #8460
* abd_alloc should use scatter for >1K allocationsMatthew Ahrens2019-02-281-2/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | abd_alloc() normally does scatter allocations, thus solving the problem that ABD originally set out to: the bulk of ZFS's allocations are single pages, which are faster to allocate and free, and don't suffer from internal fragmentation (and the inability to reclaim memory because some buffers in the slab are still allocated). However, the current code does linear allocations for 4KB and smaller allocations, defeating the purpose of ABD. Scatter ABD's use at least one page each, so sub-page allocations waste some space when allocated as scatter (e.g. 2KB scatter allocation wastes half of each page). Using linear ABD's for small allocations means that they will be put on slabs which contain many allocations. This can improve memory efficiency, but it also makes it much harder for ARC evictions to actually free pages, because all the buffers on one slab need to be freed in order for the slab (and underlying pages) to be freed. Typically, 512B and 1KB kmem caches have 16 buffers per slab, so it's possible for them to actually waste more memory than scatter (one page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting 15/16th). Spill blocks are typically 512B and are heavily used on systems running selinux with the default dnode size and the `xattr=sa` property set. By default we will use linear allocations for 512B and 1KB, and scatter allocations for larger (1.5KB and up). Reviewed-by: George Melikov <[email protected]> Reviewed-by: DHE <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8455
* Fix overly broad spa config lockBrian Behlendorf2019-02-272-5/+4
| | | | | | | | | | | The spa_txg_history_init_io() and spa_txg_history_fini_io() were mistakenly taking SCL_ALL when only SCL_CONFIG is required to access the vdev stats. This could result in a deadlock which was observed when running ztest. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8445
* Error path in metaslab_load_impl() forgets to drop ms_sync_lockSerapheim Dimitropoulos2019-02-251-1/+3
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8444
* zvol: allow rename of in use ZVOL datasetloli10K2019-02-221-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While ZFS allow renaming of in use ZVOLs at the DSL level without issues the ZVOL layer does not correctly update the renamed dataset if the device node is open (zv->zv_open_count > 0): trying to access the stale dataset name, for instance during a zfs receive, will cause the following failure: VERIFY3(zv->zv_objset->os_dsl_dataset->ds_owner == zv) failed ((null) == ffff8800dbb6fc00) PANIC at zvol.c:1255:zvol_resume() Showing stack for process 1390 CPU: 0 PID: 1390 Comm: zfs Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.51-3 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000000 ffffffff8151ea00 ffffffffa0758a80 ffff88028aefba30 ffffffffa0417219 ffff880037179220 ffffffff00000030 ffff88028aefba40 ffff88028aefb9e0 2833594649524556 6f5f767a3e2d767a 6f3e2d7465736a62 Call Trace: [<0>] ? dump_stack+0x5d/0x78 [<0>] ? spl_panic+0xc9/0x110 [spl] [<0>] ? mutex_lock+0xe/0x2a [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs] [<0>] ? rrw_exit+0xc8/0x2e0 [zfs] [<0>] ? mutex_lock+0xe/0x2a [<0>] ? dmu_objset_from_ds+0x9a/0x250 [zfs] [<0>] ? dmu_objset_hold_flags+0x71/0xc0 [zfs] [<0>] ? zvol_resume+0x178/0x280 [zfs] [<0>] ? zfs_ioc_recv_impl+0x88b/0xf80 [zfs] [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs] [<0>] ? zfs_ioc_recv+0x1c2/0x2a0 [zfs] [<0>] ? dmu_buf_get_user+0x13/0x20 [zfs] [<0>] ? __alloc_pages_nodemask+0x166/0xb50 [<0>] ? zfsdev_ioctl+0x896/0x9c0 [zfs] [<0>] ? handle_mm_fault+0x464/0x1140 [<0>] ? do_vfs_ioctl+0x2cf/0x4b0 [<0>] ? __do_page_fault+0x177/0x410 [<0>] ? SyS_ioctl+0x81/0xa0 [<0>] ? async_page_fault+0x28/0x30 [<0>] ? system_call_fast_compare_end+0x10/0x15 Reviewed by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6263 Closes #8371
* zpool reports 16E expandsize on disks with oddball number of sectorsloli10K2019-02-222-4/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue is caused by a small discrepancy in how userland creates the partition layout and the kernel estimates available space: * zpool command: subtract 9M from the usable device size, then align to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M EFI "reserved" partition. * kernel module: subtract 10M from the device size. 10M is the sum of 1M "start" partition alignment + 1m "end" partition alignment + 8M EFI "reserved" partition. For devices where the number of sectors is not a multiple of the alignment size the zpool command will create a partition layout which reserves less than 1M after the 8M EFI "reserved" partition: Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F Device Start End Sectors Size Type /dev/sda1 2048 2078719 2076672 1014M Solaris /usr & Apple ZFS /dev/sda9 2078720 2095103 16384 8M Solaris reserved 1 When the kernel module vdev_open() the device its max_asize ends up being slightly smaller than asize: this results in a huge number (16E) reported by metaslab_class_expandable_space(). This change prevents bdev_max_capacity() from returing a size smaller than bdev_capacity(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed by: Sara Hartse <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #1468 Closes #8391
* Fix dnode_hold_impl() soft lockuplidongyang2019-02-221-56/+52
| | | | | | | | | | | | | | Soft lockups could happen when multiple threads trying to get zrl on the same dnode handle in order to allocate and initialize the dnode marked as DN_SLOT_ALLOCATED. Don't loop from beginning when we can't get zrl, otherwise we would increase the zrl refcount and nobody can actually lock it. Reviewed by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Li Dongyang <[email protected]> Closes #8433
* Don't enter zvol's rangelock for read bio with size 0Tomohiro Kusumi2019-02-201-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SCST driver (SCSI target driver implementation) and possibly others may issue read bio's with a length of zero bytes. Although this is unusual, such bio's issued under certain condition can cause kernel oops, due to how rangelock is implemented. rangelock_add_reader() is not made to handle overlap of two (or more) ranges from read bio's with the same offset when one of them has size of 0, even though they conceptually overlap. Allowing them to enter rangelock results in kernel oops by dereferencing invalid pointer, or assertion failure on AVL tree manipulation with debug enabled kernel module. For example, this happens when read bio whose (offset, size) is (0, 0) enters rangelock followed by another read bio with (0, 4096) when (0, 0) rangelock is still locked, when there are no pending write bio's. It can also happen with reverse order, which is (0, N) followed by (0, 0) when (0, N) is still locked. More details mentioned in #8379. Kernel Oops on ->make_request_fn() of ZFS volume https://github.com/zfsonlinux/zfs/issues/8379 Prevent this by returning bio with size 0 as success without entering rangelock. This has been done for write bio after checking flusher bio case (though not for the same reason), but not for read bio. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8379 Closes #8401
* Introduce auxiliary metaslab histogramsSerapheim Dimitropoulos2019-02-202-18/+296
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces 3 new histograms per metaslab. These histograms track segments that have made it to the metaslab's space map histogram (and are part of the spacemap) but have not yet reached the ms_allocatable tree on loaded metaslab's because these metaslab's are currently syncing and haven't gone through metaslab_sync_done() yet. The histograms help when we decide whether to load an unloaded metaslab in-order to allocate from it. When calculating the weight of an unloaded metaslab traditionally, we look at the highest bucket of its spacemap's histogram. The problem is that we are not guaranteed to be able to allocated that segment when we load the metaslab because it may still be at the freeing, freed, or defer trees. The new histograms are used when we try to calculate an unloaded metaslab's weight to deal with this issue by removing segments that have would not be in the allocatable tree at runtime. Note, that this method of dealing with this is not completely accurate as adjacent segments are not always consolidated in the space map histogram of a metaslab. In addition and to make things deterministic, we always reset the weight of unloaded metaslabs based on their space map weight (instead of doing that on a need basis). Thus, every time a metaslab is loaded and its weight is reset again (from the weight based on its space map to the one based on its allocatable range tree) we expect (and assert) that this change in weight can only get better if it doesn't stay the same. Reviewed by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8358
* Prevent user accounting on readonly poolloli10K2019-02-191-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trying to mount a dataset from a readonly pool could inadvertently start the user accounting upgrade task, leading to the following failure: VERIFY3(tx->tx_threads == 2) failed (0 == 2) PANIC at txg.c:680:txg_wait_synced() Showing stack for process 2541 CPU: 2 PID: 2541 Comm: z_upgrade Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.51-3 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: [<0>] ? dump_stack+0x5d/0x78 [<0>] ? spl_panic+0xc9/0x110 [spl] [<0>] ? dnode_next_offset+0x1d4/0x2c0 [zfs] [<0>] ? dmu_object_next+0x77/0x130 [zfs] [<0>] ? dnode_rele_and_unlock+0x4d/0x120 [zfs] [<0>] ? txg_wait_synced+0x91/0x220 [zfs] [<0>] ? dmu_objset_id_quota_upgrade_cb+0x10f/0x140 [zfs] [<0>] ? dmu_objset_upgrade_task_cb+0xe3/0x170 [zfs] [<0>] ? taskq_thread+0x2cc/0x5d0 [spl] [<0>] ? wake_up_state+0x10/0x10 [<0>] ? taskq_thread_should_stop.part.3+0x70/0x70 [spl] [<0>] ? kthread+0xbd/0xe0 [<0>] ? kthread_create_on_node+0x180/0x180 [<0>] ? ret_from_fork+0x58/0x90 [<0>] ? kthread_create_on_node+0x180/0x180 This patch updates both functions responsible for checking if we can perform user accounting to verify the pool is not readonly. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8424
* Delay injection can cause indefinitely hung ziosSara Hartse2019-02-151-0/+1
| | | | | | | | | | | If we hit the (NSEC_TO_TICK(diff) == 0) condition in zio_delay_interrupt, zio_interrupt is never called and the zio does not progress. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: sara hartse <[email protected]> Closes #8404
* zio_deadman_impl() fix and enhancementTim Chase2019-02-151-9/+14
| | | | | | | | | | | | | Add the zio_deadman_log_all tunable to print all zios in zio_deadman_impl(). Also, in all cases, display the depth of the zio relative to the original parent zio. This is meant to be used by developers to gain diagnostic information for hangs which don't involve fully set-up zio trees or are otherwise stuck or hung in an early stage. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #8362
* zfs should optionally send holdsPaul Zuchowski2019-02-152-2/+23
| | | | | | | | | | | | | Add -h switch to zfs send command to send dataset holds. If holds are present in the stream, zfs receive will create them on the target dataset, unless the zfs receive -h option is used to skip receive of holds. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #7513
* Fix obsolete comment on rangelockTomohiro Kusumi2019-02-141-1/+1
| | | | | | | | 5d43cc9a59 renamed it to rangelock_enter(). Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8408
* Freeing throttle should account for holesAlek P2019-02-121-10/+31
| | | | | | | | | | | | | Deletion throttle currently does not account for holes in a file. This means that it can activate when it shouldn't. To fix it we switch the throttle to be based on the number of L1 blocks we will have to dirty when freeing Reviewed by: Tom Caputi <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #7725 Closes #7888
* port async unlinked drain from illumos-nexentaAlek P2019-02-125-9/+142
| | | | | | | | | | | | | | | | | This patch is an async implementation of the existing sync zfs_unlinked_drain() function. This function is called at mount time and is responsible for freeing znodes that we didn't get to freeing before. We don't have to hold mounting of the dataset until the unlinked list is fully drained as is done now. Since we can process the unlinked set asynchronously this results in a better user experience when mounting a dataset with entries in the unlinked set. Reviewed by: Jorgen Lundman <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #8142
* Get rid of space_map_update() for ms_synced_lengthSerapheim Dimitropoulos2019-02-128-194/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Initially, metaslabs and space maps used to be the same thing in ZFS. Later, we started differentiating them by referring to the space map as the on-disk state of the metaslab, making the metaslab a higher-level concept that is metadata that deals with space accounting. Today we've managed to split that code furthermore, with the space map being its own on-disk data structure used in areas of ZFS besides metaslabs (e.g. the vdev-wide space maps used for zpool checkpoint or vdev removal features). This patch refactors the space map code to further split the space map code from the metaslab code. It does so by getting rid of the idea that the space map can have a different in-core and on-disk length (sm_length vs smp_length) which is something that is only used for the metaslab code, and other consumers of space maps just have to deal with. Instead, this patch introduces changes that move the old in-core length of the metaslab's space map to the metaslab structure itself (see ms_synced_length field) while making the space map code only care about the actual space map's length on-disk. The result of this is that space map consumers no longer have to deal with syncing two different lengths for the same structure (e.g. space_map_update() goes away) while metaslab specific behavior stays within the metaslab code. Specifically, the ms_synced_length field keeps track of the amount of data metaslab_load() can read from the metaslab's space map while working concurrently with metaslab_sync() that may be appending to that same space map. As a side note, the patch also adds a few comments around the metaslab code documenting some assumptions and expected behavior. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8328
* ZVOLs should not be allowed to have childrenloli10K2019-02-084-7/+80
| | | | | | | | | | | | | | | zfs create, receive and rename can bypass this hierarchy rule. Update both userland and kernel module to prevent this issue and use pyzfs unit tests to exercise the ioctls directly. Note: this commit slightly changes zfs_ioc_create() ABI. This allow to differentiate a generic error (EINVAL) from the specific case where we tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: loli10K <[email protected]>
* Pool allocation classes misplacing small file blocksloli10K2019-02-081-1/+1
| | | | | | | | | | | | | | Due to an off-by-one condition in spa_preferred_class() we are picking the "normal" allocation class instead of the "special" one for file blocks with size equal to the special_small_blocks property value. This change fix the small code issue, update the ZFS Test Suite and the zfs(8) man page. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8351 Closes #8361
* Fix ARC stats for embedded blkptrsTim Chase2019-02-041-15/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | Re-factor arc_read() to better account for embedded data blkptrs. Previously, reading the payload from an embedded blkptr would cause arcstats such as demand_metadata_misses to be bumped when there was actually no cache "miss" because the data are already available in the blkptr. The following test procedure was used to demonstrate the problem: zpool create tank ... zfs create -o compression=lz4 tank/fs echo blah > /tank/fs/blah stat /tank/fs/blah grep 'meta.*mis' /proc/spl/kstat/zfs/arcstats and repeating the last two steps to watch the metadata miss counter increment. This can also be demonstrated via the zfs_arc_miss DTRACE4 probe in arc_read(). Reviewed-by: loli10K <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #8319
* Simplify log vdev removal codeSerapheim Dimitropoulos2019-01-312-54/+14
| | | | | | | | | | Get rid of the majority metaslab metadata when removing log vdevs in spa_vdev_remove_log() with a call to metaslab_fini() instead of duplicating a lot of that in vdev_remove_empty_log(). Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8347
* vs_alloc can underflow in L2ARC vdevsSerapheim Dimitropoulos2019-01-312-6/+16
| | | | | | | | | | | | | | | | | | | | The current L2 ARC device code consistently uses psize to increment vs_alloc but varies between psize and lsize when decrementing it. The result of this behavior is that vs_alloc can be decremented more that it is incremented and underflow. This patch changes the code so asize is used anywhere. In addition, it ensures that vs_alloc gets incremented by the L2 ARC device code as buffers are written and not at the end of the l2arc_write_buffers() routine. The latter (and old) way would temporarily underflow vs_alloc as buffers that were just written, would be destroyed while l2arc_write_buffers() was still looping. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8298
* Don't acquire zthr_request_lock in zthr_wakeupSara Hartse2019-01-301-18/+36
| | | | | | | | | | | | Address a deadlock caused by simultaneous wakeup and cancel on a zthr by remove the hold of zthr_request_lock from zthr_wakeup. This allows thr_wakeup to not block a thread that is in the process of being cancelled. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Closes #8333
* Linux 5.0 compat: Fix bio_set_dev()Brian Behlendorf2019-01-281-2/+27
| | | | | | | | | | | The Linux 5.0 kernel updated the bio_set_dev() macro so it calls the GPL-only bio_associate_blkg() symbol thus inadvertently converting the entire macro. Provide a minimal version which always assigns the request queue's root_blkg to the bio. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8287
* Linux 5.0 compat: Fix SUBDIRsTony Hutter2019-01-281-3/+3
| | | | | | | | | SUBDIRs has been deprecated for a long time, and was finally removed in the 5.0 kernel. Use "M=" instead. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8257
* Linux 5.0 compat: Convert MS_* macros to SB_*Tony Hutter2019-01-282-12/+14
| | | | | | | | | | | In the 5.0 kernel, only the mount namespace code should use the MS_* macos. Filesystems should use the SB_* ones. https://patchwork.kernel.org/patch/10552493/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8264
* Linux 5.0 compat: Use totalram_pages()Tony Hutter2019-01-281-2/+2
| | | | | | | | | | | | | totalram_pages() was converted to an atomic variable in 5.0: https://patchwork.kernel.org/patch/10652795/ Its value should now be read though the totalram_pages() helper function. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8263
* Linux 5.0 compat: access_ok() drops 'type' parameterTony Hutter2019-01-281-2/+1
| | | | | | | | access_ok no longer needs a 'type' parameter in the 5.0 kernel. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8261
* Change target size of metaslabs from 256GB to 16GBSerapheim Dimitropoulos2019-01-251-36/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | = Old behavior For vdev sizes 100GB to 50TB we keep ~200 metaslabs per vdev and the metaslab size grows from 512MB to 256GB. For vdev's bigger than that we start increasing the number of metaslabs until we hit the 128K limit. = New Behavior For vdev sizes 100GB to 3TB we keep ~200 metaslabs per vdev and the metaslab size grows from 512MB to 16GB. For vdev's bigger than that we start increasing the number of metaslabs until we hit the 128K limit. = Reasoning The old behavior makes metaslabs grow in size when the vdev range is between 3TB (ms_size 16GB) and 32PB (ms_size 256GB). Even though keeping the number of metaslabs is good in terms of potential number of I/Os per TXG, these bigger metaslabs take longer to be loaded and after they are loaded they can take up a lot of memory because of their range trees. This change tries to put a boundary in memory and loading time for the specific range of vdev sizes. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8324
* Rename range_tree_verify to range_tree_verify_not_presentSerapheim Dimitropoulos2019-01-252-11/+11
| | | | | | | | | | | | The range_tree_verify function looks for a segment in a range tree and panics if the segment is present on the tree. This patch gives the function a more descriptive name. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8327
* Use proper tag for spa config refcounts in mmp_write_uberblock()Tim Chase2019-01-251-1/+1
| | | | | | | | | This allows the spa config refcounts to use tracking in debug builds without triggering the "No such hold %p on refcount" panic. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #8326
* Fix bad kmem_free() in zvol_rename_minors_impl()Tom Caputi2019-01-231-1/+1
| | | | | | | | | | | | | | | | Currently, zvol_rename_minors_impl() calls kmem_asprintf() to allocate and initialize a string. This function is a thin wrapper around the kernel's kvasprintf() and does not call into the SPL's kmem tracking code when it is enabled. However, this function frees the string with the tracked kmem_free() instead of the untracked strfree(), which causes the SPL kmem tracking code to believe that the function is attempting to free memory it never allocated, triggering an ASSERT. This patch simply corrects this issue. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8307
* ztest: creates partially initialized root datasetloli10K2019-01-181-8/+10
| | | | | | | | | | | | | | | | Since d8fdfc2 was integrated dsl_pool_create() does not call dmu_objset_create_impl() for the root dataset when running in userland (ztest): this creates a pool with a partially initialized root dataset. Trying to import and use this pool results in both zpool and zfs executables dumping core. Fix this by adopting an alternative change suggested in OpenZFS 8607 code review. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Tom Caputi <[email protected]> Original-patch-by: Robert Mustacchi <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8277
* Remove zfs_sync() panicking kernel checkBrian Behlendorf2019-01-181-7/+0
| | | | | | | | | | | | This check provides no real additional protection and unnecessarily introduces a dependency on the "oops_in_progress" kernel symbol. Remove the check, it there are special circumstances on other platforms which make this a requirement it can be reintroduced for all relevant call paths in a more portable comprehensive manor. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8297
* Factor metaslab_load_wait() in metaslab_load()Serapheim Dimitropoulos2019-01-182-47/+49
| | | | | | | | | | | | | | Most callers that need to operate on a loaded metaslab, always call metaslab_load_wait() before loading the metaslab just in case someone else is already doing the work. Factoring metaslab_load_wait() within metaslab_load() makes the later more robust, as callers won't have to do the load-wait check explicitly every time they need to load a metaslab. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8290