summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Obsolete earlier packages due to version bumpBrian Behlendorf2021-04-071-0/+2
| | | | | | | | | | | In order for package managers such as dnf to upgrade cleanly after the package SONAME bump the obsolete package names must be known. Update the new packages to correctly obsolete the old ones. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11844 Closes #11847
* i-t: don't brokenly set the scheduler for root pool vdev's disksнаб2021-04-061-24/+0
| | | | | | | | | | | | | | | | | | | | This effectively reverts 4fc411f7a3ecee8a70fc8d6c687fae9a1cf20b31 (part of #6807) and f6fbe25664629d1ae6a3b186f14ec69dbe6c6232 (#9042) ‒ the code itself and latter PR cite symmetry with whole-disk-vdev behaviour (presumably because rootfs vdevs are rarely whole disks), but the code is broken for NVME devices (indeed, it'd strip the controller number instead of the (potential) partition number, turning "nvme0n1p1" into "nvmen1p1", which would then subsequently fail the sysfs existence check); it could be fixed to handle those (and any others) rather easily by dereferencing /sys/class/block/$devname, but this isn't the place for setting this ‒ as noted in the commit that removed setting the scheduler by default (9e17e6f2541c69a7a5e0ed814a7f5e71cbf8b90a) ‒ use an udev rule Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11838
* i-t: fix root=zfs:AUTOнаб2021-04-061-4/+9
| | | | | | | | | | | | | | | | IFS= would break loops in import_pool(), which would fault any automatic import Additionally $ZFS_BOOTFS from cmdline would interfere with find_rootfs() If many pools were present, same thing could happen across multiple find_rootfs() runs, so bail out early and clean up in error path Suggested-by: @nachtgeist Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11278 Closes #11838
* zfs get -p only outputs 3 columns if "clones" property is emptymatt-fidd2021-04-063-4/+21
| | | | | | | | | | get_clones_string currently returns an empty string for filesystem snapshots which have no clones. This breaks parsable `zfs get` output as only three columns are output, instead of 4. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Fiddaman <[email protected]> Co-authored-by: matt <[email protected]> Closes #11837
* kmem_alloc(KM_SLEEP) should use kvmalloc()Matthew Ahrens2021-04-061-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | `kmem_alloc(size>PAGESIZE, KM_SLEEP)` is backed by `kmalloc()`, which finds contiguous physical memory. If there isn't enough contiguous physical memory available (e.g. due to physical page fragmentation), the OOM killer will be invoked to make more memory available. This is not ideal because processes may be killed when there is still plenty of free memory (it just happens to be in individual pages, not contiguous runs of pages). We have observed this when allocating the ~13KB `zfs_cmd_t`, for example in `zfsdev_ioctl()`. This commit changes the behavior of `kmem_alloc(size>PAGESIZE, KM_SLEEP)` when there are insufficient contiguous free pages. In this case we will find individual pages and stitch them together using virtual memory. This is accomplished by using `kvmalloc()`, which implements the described behavior by trying `kmalloc(__GFP_NORETRY)` and falling back on `vmalloc()`. The behavior of `kmem_alloc(KM_NOSLEEP)` is not changed; it continues to use `kmalloc(GPF_ATOMIC | __GFP_NORETRY)`. This is because `vmalloc()` may sleep. Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11461
* zpool-features.5: remove "booting not possible with this feature"sнаб2021-04-061-4/+0
| | | | | | | | | | | | The exact limitations on what features are supported when booting vary considerably depending on the environment. In order to minimize confusion avoid categorical statements which assume GRUB2 is being used. The supported GRUB2 features are covered earlier in this man page for easy reference. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11842
* man: fix wrong .Xr macros usagesGeorge Melikov2021-04-067-10/+10
| | | | | | | | In addition, html doc will have working hyperlinks. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #11845
* libzutil: zfs_isnumber(): return false if input emptyнаб2021-04-062-4/+7
| | | | | | | | | | | | | | | | | | | | | | zpool list, which is the only user, would mistakenly try to parse the empty string as the interval in this case: $ zpool list "a" cannot open 'a': no such pool $ zpool list "" interval cannot be zero usage: <usage string follows> which is now symmetric with zpool get: $ zpool list "" cannot open '': name must begin with a letter Avoid breaking the "interval cannot be zero" string. There simply isn't a need for this, and it's user-facing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11841 Closes #11843
* ZTS: pool_checkpoint improvementsBrian Behlendorf2021-04-034-7/+15
| | | | | | | | | | | | | | | | | | | The pool_checkpoint tests may incorrectly fail because several of them invoke zdb for an imported pool. In this scenario it's not unexpected for zdb to fail if the pool is modified. To resolve this these zdb checks are now done after the pool has been exported. Additionally, the default cleanup functions assumed the pool would be imported when they were run. If this was not the case they're exit early and fail to cleanup all of the test state causing subsequent tests to fail. Add a check to only destroy the pool when it is imported. Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11832
* Fix various typosAndrea Gelmini2021-04-0257-75/+75
| | | | | | | | | | Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11774
* bash_completion.d: always call zfs/zpool binaries directlyнаб2021-04-023-9/+8
| | | | | | | | | | | | | | | /dev/zfs is 0:0 666 on most systems, so the [ -w /dev/zfs ] check always succeeds, but if zfs isn't in $PATH (e.g. when completing from "/sbin/zfs list" on a regular account) this can lead to error spew like nabijaczleweli@szarotka:~$ /sbin/zfs list bash: zfs: command not found @ bash: zfs: command not found We only do read-only commands, and quite general ones at that, so there's no need to elevate one way or another. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11828
* Add RELEASES.md fileBrian Behlendorf2021-04-022-2/+39
| | | | | | | | | Document the project's policy regarding publishing and maintaining official OpenZFS releases. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11821
* zed: allow limiting concurrent jobsнаб2021-04-028-23/+59
| | | | | | | | | 200ms time-out is relatively long, but if we already hit the cap, then we'll likely be able to spawn multiple new jobs when we wake up Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11807
* zed: remove unused zed_file_close_on_exec()наб2021-04-022-30/+0
| | | | | | | | | The FIXME comment was there since the initial implementation in 2014, there are no users Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11807
* zed: use separate reaper thread and collect ZEDLETs asynchronouslyнаб2021-04-025-55/+157
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11807
* zed: set names for all threadsнаб2021-04-023-0/+3
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11807
* ZTS: inheritance/inherit_001_pos is flakyRyan Moeller2021-04-021-0/+1
| | | | | | | | Add inheritance/inherit_001_pos to the maybe fails on FreeBSD list. Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11830
* Avoid taking global lock to destroy zfsdev stateRyan Moeller2021-04-022-21/+11
| | | | | | | | | | | | | | We have exclusive access to our zfsdev state object in this section until it is invalidated by setting zs_minor to -1, so we can destroy the state without taking a lock if we do the invalidation last, after a member to ensure correct ordering. While here, strengthen the assertions that zs_minor is valid when we enter. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11751
* FreeBSD: Fix stable/12 after AT_BENEATH removalRyan Moeller2021-04-021-3/+1
| | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11827
* Bump libzfs.so and libzpool.so versionsBrian Behlendorf2021-04-014-25/+25
| | | | | | | | | | | | | | | | | | | | | | | | Bump the library versions as advised by the libtool guidelines. https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html Two new functions were added but no existing functions were changed, so we increase the version and the age (version:revision:age). Added functions (2): - boolean_t zpool_is_draid_spare(const char *); - zpool_compat_status_t zpool_load_compat(const char *, boolean_t *, char *, char *); Additionally bump the libzpool.so version information. This library is for internal use but we still want to update the version to track major changes to the interfaces. The libzfsbootenv, libuutil, libnvpair and libzfs_core libraries have not been updated. Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11817
* Allow pool names that look like Solaris disk namesRyan Moeller2021-04-012-7/+1
| | | | | | | | | | | Nothing bad happens if a prefix of your pool name matches a disk name. This is a bit of a silly restriction at this point. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11781 Closes #11813
* Don't scale zfs_zevent_len_max by CPU countRyan Moeller2021-04-012-9/+5
| | | | | | | | | | The lower bound for this scaling to too low and the upper bound is too high. Use a fixed default length of 512 instead, which is a reasonable value on any system. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11822
* Atomically check and set dropped zevent countRyan Moeller2021-04-011-2/+1
| | | | | | | | | ratelimit_dropped isn't protected by a lock and is expected to be updated atomically. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11822
* CI: Increase free space in workflowBrian Behlendorf2021-04-012-0/+12
| | | | | | | | | | | | | Recently we've been running out of free space in the ubuntu 20.04 environment resulting in test failures. This appears to be caused by a change in the default available free space and not because of any change in OpenZFS. Try and avoid this failure by applying a suggested workaround which removes some unnecessary files. https://github.com/actions/virtual-environments/issues/2840 Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11826
* Fixing m4 iops rename checkBrian Atkinson2021-04-011-0/+1
| | | | | | | | | The configure check for iops->rename wanting flags was missing the AC_MSG_CHECKING() so it would just print yes without saying what was being checked. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11825
* fsck.zfs: implement 4/8 exit codes as suggested in manpageнаб2021-03-315-20/+64
| | | | | | | | | | | | | Update the fsck.zfs helper to bubble up some already-known-about errors if they are detected in the pool. health=degraded => 4/"Filesystem errors left uncorrected" health=faulted && dataset in /etc/fstab => 8/"Operational error" pool not found => 8/"Operational error" everything else => 0 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11806
* Add compatibility file sets (ZoL 0.6.1, 0.6.4, OpenZFS 2.1)Mike Swanson2021-03-315-0/+87
| | | | | | | | | | | ZoL 0.6.1 introduced feature flags with the three features that all implementations at the time were guaranteed to have. 0.6.4 introduced a few more until 0.6.5 added two after that. OpenZFS 2.1 added the dRAID feature. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mike Swanson <[email protected]> Closes #11818
* Update METAzfs-2.1.99Brian Behlendorf2021-03-301-2/+2
| | | | | | | | Increase the version to 2.1.99 to indicate the master branch is newer than the 2.1.x release. This ensures packages built from master branch are considered to be newer than the last release. Signed-off-by: Brian Behlendorf <[email protected]>
* Tag 2.1.0-rc1zfs-2.1.0-rc1Brian Behlendorf2021-03-291-1/+1
| | | | | | | | | New features: - Distributed Spare (dRAID) Feature - Added "compatibility" property for zpool feature sets - Added zpool_influxdb command to collect zpool statistics Signed-off-by: Brian Behlendorf <[email protected]>
* zed: reap child after killing on time-outнаб2021-03-262-2/+3
| | | | | | | | | | | | When a child process is killed waitpid() must be called on the pid the reap the zombie process. Update BUGS section to reflect reality by replacing "zedlets aren't time limited with "zedlets can be interrupted". Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11769 Closes #11798
* Use a helper function to clarify gang block sizeMatthew Ahrens2021-03-264-11/+30
| | | | | | | | | | | | | For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the gang DVA including its children BP's. The space allocated at each DVA's vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`. This commit makes this relationship more clear by using a helper function, `vdev_gang_header_asize()`, for the space allocated at the gang block's vdev/offset. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11744
* When specifying raidz vdev name, parity count should matchMatthew Ahrens2021-03-261-0/+30
| | | | | | | | | | | | | | When specifying the name of a RAIDZ vdev on the command line, it can be specified as raidz-<vdevID> or raidzP-<vdevID>. e.g. `zpool clear poolname raidz-0` or `zpool clear poolname raidz2-0` If the parity is specified in the vdev name, it should match the actual parity of that RAIDZ vdev, otherwise the command should fail. This commit makes it so. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Stuart Maybee <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11742
* Fix error code on __zpl_ioctl_setflags()Luis Henriques2021-03-261-1/+1
| | | | | | | | | | Other (all?) Linux filesystems seem to return -EPERM instead of -EACCESS when trying to set FS_APPEND_FL or FS_IMMUTABLE_FL without the CAP_LINUX_IMMUTABLE capability. This was detected by generic/545 test in the fstest suite. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Luis Henriques <[email protected]> Closes #11791
* Support running FreeBSD buildworld on Arm-based macOS hostsJessica Clarke2021-03-261-0/+11
| | | | | | | | | Arm-based Macs are like FreeBSD and provide a full 64-bit stat from the start, so have no stat64 variants. Thus, define stat64 and fstat64 as aliases for the normal versions. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Jessica Clarke <[email protected]> Closes #11771
* Removed duplicated includesAndrea Gelmini2021-03-2215-20/+0
| | | | | | Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11775
* Fix typo in Python method nameAndrea Gelmini2021-03-221-1/+1
| | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11776
* Split dmu_zfetch() speculation and execution partsAlexander Motin2021-03-194-119/+194
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #11652
* Fix zfs_get_data access to files with wrong generationChunwei Chen2021-03-197-8/+28
| | | | | | | | | | | | | | If TX_WRITE is create on a file, and the file is later deleted and a new directory is created on the same object id, it is possible that when zil_commit happens, zfs_get_data will be called on the new directory. This may result in panic as it tries to do range lock. This patch fixes this issue by record the generation number during zfs_log_write, so zfs_get_data can check if the object is valid. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #10593 Closes #11682
* Fix regression in POSIX mode behaviorAndrew2021-03-195-4/+154
| | | | | | | | | | | | | | | | | Commit 235a85657 introduced a regression in evaluation of POSIX modes that require group DENY entries in the internal ZFS ACL. An example of such a POSX mode is 007. When write_implies_delete_child is set, then ACE_WRITE_DATA is added to `wanted_dirperms` in prior to calling zfs_zaccess_common(). This occurs is zfs_zaccess_delete(). Unfortunately, when zfs_zaccess_aces_check hits this particular DENY ACE, zfs_groupmember() is checked to determine whether access should be denied, and since zfs_groupmember() always returns B_TRUE on Linux and so this check is failed, resulting ultimately in EPERM being returned. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrew Walker <[email protected]> Closes #11760
* ZTS: New test for kernel panic induced by redacted sendPalash Gandhi2021-03-193-2/+47
| | | | | | | | | | This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in https://github.com/openzfs/zfs/pull/11297. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: John Kennedy <[email protected]> Signed-off-by: Palash Gandhi <[email protected]> Closes #11764
* Allow setting bootfs property on pools with indirect vdevsMartin Matuška2021-03-191-3/+1
| | | | | | | | | The FreeBSD boot loader relies on the bootfs property and is capable of booting from removed (indirect) vdevs. Reviewed-by Eric van Gyzen Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #11763
* Fix typo in zgenhostid.8Ryan Moeller2021-03-191-2/+2
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11770
* Removing old code for k(un)map_atomicBrian Atkinson2021-03-193-10/+8
| | | | | | | | | | | | It used to be required to pass a enum km_type to kmap_atomic() and kunmap_atomic(), however this is no longer necessary and the wrappers zfs_k(un)map_atomic removed these. This is confusing in the ABD code as the struct abd_iter member iter_km no longer exists and the wrapper macros simply compile them out. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11768
* Initialize metaslab range trees in metaslab_init Serapheim Dimitropoulos2021-03-191-94/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | = Motivation We've noticed several zloop crashes within Delphix generated due to the following sequence of events: - A device gets expanded and new metaslabas are allocated for it. These metaslabs go through `metaslab_init()` but haven't gone through `metaslab_sync_done()` yet. This meas that the only range tree that's actually set is the `ms_allocatable`. All the others are NULL. - A vdev_initialization is issues and `vdev_initialize_thread` starts processing one of these new metaslabs of the expanded vdev. - As part of `vdev_initialize_calculate_progress()` we call into `metaslab_load()` and `metaslab_load_impl()` which in turn tries to dereference the metaslabs trees that are still NULL and therefore we crash. The same failure can come up from the `vdev_trim` code paths. = This Patch We considered the following solutions to deal with this issue: [A] Add logic to `vdev_initialize/trim` to skip those new metaslabs. We decided against this as it would be good to avoid exposing this lower-level detail to higer-level operations. [B] Have `metaslab_load_impl()` return early for new metaslabs and thus never touch those range_trees that are NULL at that time. This seemed more of a work-around for the bug and not a clear-cut solution. [C] Refactor our logic so all metaslabs have their range_trees created at the time of their creatin in `metaslab_init()`. In this patch we decided to go with [C] because: (1) It doesn't expose more metaslab details to higher level operations such as vdev initialize and trim. (2) The current behavior of creating the range trees lazily in `metaslab_sync_done()` is unnecessarily complicated. (3) Always initializing the metaslab range_trees makes other parts of the codebase cleaner. For example, we used to use `ms_freed` as the reference value for knowing whether all the range_trees have been initialized. Now we no longer need to do that check in most places (and in the few that we do we use the `ms_new` boolean field now which is more readable). = Side Changes Probably due to a mismerge we set `ms_loaded` to `B_TRUE` twice in `metasloab_load_impl()`. In this patch we remove the extraneous assignment. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #11737
* Linux 5.12 update: bio_max_segs() replaces BIO_MAX_PAGESColeman Kane2021-03-193-0/+30
| | | | | | | | | | | The BIO_MAX_PAGES macro is being retired in favor of a bio_max_segs() function that implements the typical MIN(x,y) logic used throughout the kernel for bounding the allocation, and also the new implementation is intended to be signed-safe (which the former was not). Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11765
* Linux 5.12 compat: idmapped mountsColeman Kane2021-03-1922-123/+557
| | | | | | | | | | | | | | | | In Linux 5.12, the filesystem API was modified to support ipmapped mounts by adding a "struct user_namespace *" parameter to a number functions and VFS handlers. This change adds the needed autoconf macros to detect the new interfaces and updates the code appropriately. This change does not add support for idmapped mounts, instead it preserves the existing behavior by passing the initial user namespace where needed. A subsequent commit will be required to add support for idmapped mounted. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11712
* Clean up RAIDZ/DRAID ereport codeMatthew Ahrens2021-03-1911-448/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11735
* FreeBSD: make seqc asserts conditional on replayMateusz Guzik2021-03-171-3/+6
| | | | | | | Avoids tripping on asserts when doing pool recovery. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11739
* Remove unused rr_codeMatthew Ahrens2021-03-172-47/+23
| | | | | | | | | | The `rr_code` field in `raidz_row_t` is unused. This commit removes the field, as well as the code that's used to set it. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11736
* FreeBSD: Fix memory leaks in kstatsRyan Moeller2021-03-171-7/+4
| | | | | | | | | | | Don't handle (incorrectly) kmem_zalloc() failure. With KM_SLEEP, will never return NULL. Free the data allocated for non-virtual kstats when deleting the object. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11767