aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs/dsl_dataset.c
Commit message (Collapse)AuthorAgeFilesLines
* head_errlog: fix use-after-freeGeorge Amanakis2024-07-151-2/+5
| | | | | | | | | | | | | In the commit of the head_errlog feature we introduced a bug in dsl_dataset_promote_sync(): we may dereference origin_head and hds, both dereferencing ddpa after calling promote_sync() on ddpa. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #16272 Closes #16273
* Use list_remove_head() where possible.Alexander Motin2023-06-091-2/+1
| | | | | | | | | | | ... instead of list_head() + list_remove(). On FreeBSD the list functions are not inlined, so in addition to more compact code this also saves another function call. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14955
* nvpair: Constify string functionsRichard Yao2023-03-141-6/+6
| | | | | | | | | | | | | | After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14612
* Revert zfeature_active() to staticGeorge Amanakis2023-02-281-1/+1
| | | | | | | | | Commit 34ce4c4 made zfeature_active() non-static. This is not required. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14546
* Move dmu_buf_rele() after dsl_dataset_sync_done()George Amanakis2023-02-231-7/+2
| | | | | | | | | | Otherwise the dataset may be freed after the last dmu_buf_rele() leading to a panic. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14522 Closes #14523
* Partially revert eee9362a7George Amanakis2023-02-211-19/+0
| | | | | | | | | | With commit 34ce4c42f applied, there is no need for eee9362a7. Revert that aside from the test. All tests introduced in those commits pass. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14502
* Fix a race condition in dsl_dataset_sync() when activating featuresGeorge Amanakis2023-02-131-11/+12
| | | | | | | | | | | | | | | | | | | | The zio returned from arc_write() in dmu_objset_sync() uses zio_nowait(). However we may reach the end of dsl_dataset_sync() which checks if we need to activate features in the filesystem without knowing if that zio has even run through the ZIO pipeline yet. In that case we will flag features to be activated in dsl_dataset_block_born() but dsl_dataset_sync() has already completed its run and those features will not actually be activated. Mitigate this by moving the feature activation code in dsl_dataset_sync_done(). Also add new ASSERTs in dsl_scan_visitbp() checking if a block contradicts any filesystem flags. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #13816
* Activate filesystem features only in syncing contextGeorge Amanakis2023-01-111-9/+13
| | | | | | | | | | | When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14304 Closes #14252
* Remove duplicate statically allocated variableRichard Yao2022-12-081-1/+0
| | | | | | | | | | | | | | | dsl_dataset_snapshot_sync_impl() declares `static zil_header_t zero_zil __maybe_unused;`, but this is also declared globally. This wastes memory. CodeQL's cpp/local-variable-hides-global-variable check caught this. Reviewed-by: Damian Szuberski <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14263
* Fix setting the large_block feature after receiving a snapshotGeorge Amanakis2022-11-181-0/+15
| | | | | | | | | | | | | We are not allowed to dirty a filesystem when done receiving a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will not be set on that filesystem since the filesystem is not on dp_dirty_datasets, and a subsequent encrypted raw send will fail. Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature needs to be activated and do so if appropriate. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #13699 Closes #13782
* Cleanup: Address Clang's static analyzer's unused code complaintsRichard Yao2022-10-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These were categorized as the following: * Dead assignment 23 * Dead increment 4 * Dead initialization 6 * Dead nested assignment 18 Most of these are harmless, but since actual issues can hide among them, we correct them. That said, there were a few return values that were being ignored that appeared to merit some correction: * `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from `destroy_batched()`. We handle it by returning -1 if there is an error. * `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from `zfs_for_each()`. We handle it by doing a binary OR of the error value from the subsequent `zfs_for_each()` call to the existing value. This is how errors are mostly handled inside `zfs_for_each()`. The error value here is passed to exit from the zfs command, so doing a binary or on it is better than what we did previously. * `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from `dsl_prop_get_ds()` when the property is not of type string. We return an error when it does. There is a small concern that the `zfs_get_temporary_prop()` call would handle things, but in the case that it does not, we would be pushing an uninitialized numval onto the lua stack. It is expected that `dsl_prop_get_ds()` will succeed anytime that `zfs_get_temporary_prop()` does, so that not giving it a chance to fix things is not a problem. * `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used `nvlist_add_nvlist()` twice in ways in which errors are expected to be impossible, so we switch to `fnvlist_add_nvlist()`. A few notable ones did not merit use of the return value, so we suppressed it with `(void)`: * `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error value from `describe_free()`. A look through the commit history revealed that this was intentional. * `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the returned handle from `arc_hdr_realloc()` because it is already referenced in lists. * `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly saying not to use the error from `vdev_label_init()` because whatever causes the error could be the reason why a detach is being done. Unfortunately, I am not presently able to analyze the kernel modules with Clang's static analyzer, so I could have missed some cases of this. In cases where reports were present in code that is duplicated between Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version too. After this commit is merged, regressions like dee8934 should become extremely obvious with Clang's static analyzer since a regression would appear in the results as the only instance of unused code. That assumes that Coverity does not catch the issue first. My local branch with fixes from all of my outstanding non-draft pull requests shows 118 reports from Clang's static anlayzer after this patch. That is down by 51 from 169. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Cedric Berger <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13986
* Fix potential NULL pointer dereference in dsl_dataset_promote_check()Richard Yao2022-09-301-5/+3
| | | | | | | | | | | | | | | | | | If the `list_head()` returns NULL, we dereference it, right before we check to see if it returned NULL. We have defined two different pointers that both point to the same thing, which are `origin_head` and `origin_ds`. Almost everything uses `origin_ds`, so we switch them to use `origin_ds`. We also promote `origin_ds` to a const pointer so that the compiler verifies that nothing modifies it. Coverity complained about this. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13967
* Fix unsafe string operationsRichard Yao2022-09-271-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`, `nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use of `strlcat()` in `nfs_init_tmpfile()`. Inspired by this, I did an audit of every single usage of `strcpy()` and `strcat()` in the code. If I could not prove that the usage was safe, I changed the code to use either `strlcpy()` or `strlcat()`, depending on which function was originally used. In some cases, `snprintf()` was used to replace multiple uses of `strcat` because it was cleaner. Whenever I changed a function, I preferred to use `sizeof(dst)` when the compiler is able to provide the string size via that. When it could not because the string was passed by a caller, I checked the entire call tree of the function to find out how big the buffer was and hard coded it. Hardcoding is less than ideal, but it is safe unless someone shrinks the buffer sizes being passed. Additionally, Coverity reported three more string related issues: * It caught a case where we do an overlapping memory copy in a call to `snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`. * It caught `sizeof (buf)` being used instead of `buflen` in `zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to `snprintf()`. We change that to pass `buflen`. * It caught a theoretical unterminated string passed to `strcmp()`. This one is likely a false positive, but we have the information needed to do this more safely, so we change this to silence the false positive not just in coverity, but potentially other static analysis tools too. We switch to `strncmp()`. * There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We suppress it by switching to `snprintf()` since other static analysis tools might complain about it too. Interestingly, there is a possible real bug there too, since it assumes that the passed directory path ends with '/'. We add a '/' to fix that potential bug. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13913
* Cleanup: Specify unsignedness on things that should not be signedRichard Yao2022-09-271-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In #13871, zfs_vdev_aggregation_limit_non_rotating and zfs_vdev_aggregation_limit being signed was pointed out as a possible reason not to eliminate an unnecessary MAX(unsigned, 0) since the unsigned value was assigned from them. There is no reason for these module parameters to be signed and upon inspection, it was found that there are a number of other module parameters that are signed, but should not be, so we make them unsigned. Making them unsigned made it clear that some other variables in the code should also be unsigned, so we also make those unsigned. This prevents users from setting negative values that could potentially cause bad behaviors. It also makes the code slightly easier to understand. Mostly module parameters that deal with timeouts, limits, bitshifts and percentages are made unsigned by this. Any that are boolean are left signed, since whether booleans should be considered signed or unsigned does not matter. Making zfs_arc_lotsfree_percent unsigned caused a `zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was removed. Removing the check was also necessary to prevent a compiler error from -Werror=type-limits. Several end of line comments had to be moved to their own lines because replacing int with uint_t caused us to exceed the 80 character limit enforced by cstyle.pl. The following were kept signed because they are passed to taskq_create(), which expects signed values and modifying the OpenSolaris/Illumos DDI is out of scope of this patch: * metaslab_load_pct * zfs_sync_taskq_batch_pct * zfs_zil_clean_taskq_nthr_pct * zfs_zil_clean_taskq_minalloc * zfs_zil_clean_taskq_maxalloc * zfs_arc_prune_task_threads Also, negative values in those parameters was found to be harmless. The following were left signed because either negative values make sense, or more analysis was needed to determine whether negative values should be disallowed: * zfs_metaslab_switch_threshold * zfs_pd_bytes_max * zfs_livelist_min_percent_shared zfs_multihost_history was made static to be consistent with other parameters. A number of module parameters were marked as signed, but in reality referenced unsigned variables. upgrade_errlog_limit is one of the numerous examples. In the case of zfs_vdev_async_read_max_active, it was already uint32_t, but zdb had an extern int declaration for it. Interestingly, the documentation in zfs.4 was right for upgrade_errlog_limit despite the module parameter being wrongly marked, while the documentation for zfs_vdev_async_read_max_active (and friends) was wrong. It was also wrong for zstd_abort_size, which was unsigned, but was documented as signed. Also, the documentation in zfs.4 incorrectly described the following parameters as ulong when they were int: * zfs_arc_meta_adjust_restarts * zfs_override_estimate_recordsize They are now uint_t as of this patch and thus the man page has been updated to describe them as uint. dbuf_state_index was left alone since it does nothing and perhaps should be removed in another patch. If any module parameters were missed, they were not found by `grep -r 'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed, but only because they were in files that had hits. This patch intentionally did not attempt to address whether some of these module parameters should be elevated to 64-bit parameters, because the length of a long on 32-bit is 32-bit. Lastly, it was pointed out during review that uint_t is a better match for these variables than uint32_t because FreeBSD kernel parameter definitions are designed for uint_t, whose bit width can change in future memory models. As a result, we change the existing parameters that are uint32_t to use uint_t. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13875
* Cleanup: Remove ineffective unsigned comparisons against 0Richard Yao2022-09-261-2/+0
| | | | | | | | | | | | | Coverity found a number of places where we either do MAX(unsigned, 0) or do assertions that a unsigned variable is >= 0. These do nothing, so let us drop them all. It also found a spot where we do `if (unsigned >= 0 && ...)`. Let us also drop the unsigned >= 0 check. Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13871
* Fix unchecked return values and unused return valuesRichard Yao2022-09-231-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverity complained about unchecked return values and unused values that turned out to be unused return values. Different approaches were used to handle the different cases of unchecked return values: * cmd/zdb/zdb.c: VERIFY0 was used in one place since the existing code had no error handling. An error message was printed in another to match the rest of the code. * cmd/zed/agents/zfs_retire.c: We dismiss the return value with `(void)` because the value is expected to be potentially unset. * cmd/zpool_influxdb/zpool_influxdb.c: We dismiss the return value with `(void)` because the values are expected to be potentially unset. * cmd/ztest.c: VERIFY0 was used since we want failures if something goes wrong in ztest. * module/zfs/dsl_dir.c: We dismiss the return value with `(void)` because there is no guarantee that the zap entry will always be there. For example, old pools imported readonly would not have it and we do not want to fail here because of that. * module/zfs/zfs_fm.c: `fnvlist_add_*()` was used since the allocations sleep and thus can never fail. * module/zfs/zvol.c: We dismiss the return value with `(void)` because we do not need it. This matches what is already done in the analogous `zfs_replay_write2()`. * tests/zfs-tests/cmd/draid.c: We suppress one return value with `(void)` since the code handles errors already. The other return value is handled by switching to `fnvlist_lookup_uint8_array()`. * tests/zfs-tests/cmd/file/file_fadvise.c: We add error handling. * tests/zfs-tests/cmd/mmap_sync.c: We add error handling for munmap, but ignore failures on remove() with (void) since it is expected to be able to fail. * tests/zfs-tests/cmd/mmapwrite.c: We add error handling. As for unused return values, they were all in places where there was error handling, so logic was added to handle the return values. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13920
* Add assertion to dsl_dataset_set_compression_syncRichard Yao2022-09-141-0/+1
| | | | | | | | | | | | Coverity pointed out that if we somehow receive SPA_FEATURE_NONE, we will use a negative number as an array index. A defensive assertion seems appropriate. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13872
* Add zfs.sync.snapshot_renameAndriy Gapon2022-09-021-10/+2
| | | | | | | | | Only the single snapshot rename is provided. The recursive or more complex rename can be scripted. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #13802
* Revert "Avoid panic with recordsize > 128k, raw sending and no large_blocks"Brian Behlendorf2022-08-251-26/+20
| | | | | | | | | | This reverts commit 80a650b7bb04bce3aef5e4cfd1d966e3599dafd4. This change inadvertently introduced a regression in ztest where one of the new ASSERTs is triggered in dsl_scan_visitbp(). Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #12275 Closes #13799
* Prevent zevent list from consuming all of kernel memoryPaul Dagnelie2022-08-221-1/+7
| | | | | | | | | | | | | | | | | | | | There are a couple changes included here. The first is to introduce a cap on the size the ZED will grow the zevent list to. One million entries is more than enough for most use cases, and if you are overflowing that value, the problem needs to be addressed another way. The value is also tunable, for those who want the limit to be higher or lower. The other change is to add a kernel module parameter that allows snapshot creation/deletion to be exempted from the history logging; for most workloads, having these things logged is valuable, but for some workloads it produces large quantities of log spam and isn't especially helpful. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Issue #13374 Closes #13753
* Add snapshots_changed as propertyUmer Saleem2022-08-021-2/+4
| | | | | | | | | | | | | | | | | | | Make dd_snap_cmtime property persistent across mount and unmount operations by storing in ZAP and restore the value from ZAP on hold into dd_snap_cmtime instead of updating it. Expose dd_snap_cmtime as 'snapshots_changed' property that provides a mechanism to quickly determine whether snapshot list for dataset has changed without having to mount a dataset or iterate the snapshot list. It specifies the time at which a snapshot for a dataset was last created or deleted. This allows us to be more efficient how often we query snapshots. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #13635
* Replace dead opensolaris.org license linkTino Reichardt2022-07-111-1/+1
| | | | | | | | | The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13619
* Enable -Wwrite-stringsнаб2022-06-291-21/+22
| | | | | | | | Also, fix leak from ztest_global_vars_to_zdb_args() Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #13348
* Avoid panic with recordsize > 128k, raw sending and no large_blocksGeorge Amanakis2022-06-271-20/+26
| | | | | | | | | | | | | | | | | | | | | The current codebase does not support raw sending buffers with block size > 128kB when large_blocks is not active. This can happen in the codepath dsl_dataset_sync()->dmu_objset_sync()->zio_nowait() which calls back dmu_objset_write_done()->dsl_dataset_block_born(). If dsl_dataset_sync() completes its run before dsl_dataset_block_born() is called, we will end up not activating some of the necessary flags, while having blocks based on those flags written in the filesystem. A subsequent send will then panic. Fix this by directly deciding in dmu_objset_sync() whether these flags need to be activated later by dsl_dataset_sync(). Instead of panicking due to a NULL pointer dereference in dmu_dump_write() in case of a send, print out an error message. Also during scrub verify there are no contradicting filesystem flags. Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12275 Closes #12438
* Default zfs_max_recordsize to 16MRich Ercolani2022-04-281-12/+13
| | | | | | | | | | | | | | | | | | | | Increase the default allowed maximum recordsize from 1M to 16M. As described in the zfs(4) man page, there are significant costs which need to be considered before using very large blocks. However, there are scenarios where they make good sense and it should no longer be necessary to artificially restrict their use behind a module option. Note that for 32-bit platforms we continue to leave this restriction in place due to the limited virtual address space available (256-512MB). On these systems only a handful of blocks could be cached at any one time severely impacting performance and potentially stability. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12830 Closes #13302
* Improve zpool status output, list all affected datasetsGeorge Amanakis2022-04-251-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, determining which datasets are affected by corruption is a manual process. The primary difficulty in reporting the list of affected snapshots is that since the error was initially found, the snapshot where the error originally occurred in, may have been deleted. To solve this issue, we add the ID of the head dataset of the original snapshot which the error was detected in, to the stored error report. Then any time a filesystem is deleted, the errors associated with it are deleted as well. Any time a clone promote occurs, we modify reports associated with the original head to refer to the new head. The stored error reports are identified by this head ID, the birth time of the block which the error occurred in, as well as some information about the error itself are also stored. Once this information is stored, we can find the set of datasets affected by an error by walking back the list of snapshots in the given head until we find one with the appropriate birth txg, and then traverse through the snapshots of the clone family, terminating a branch if the block was replaced in a given snapshot. Then we report this information back to libzfs, and to the zpool status command, where it is displayed as follows: pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3 08:27:57 2021 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 sdb ONLINE 0 0 1.58K errors: Permanent errors have been detected in the following files: test@1:/test.0.0 /test/test.0.0 /test/1clone/test.0.0 A new feature flag is introduced to mark the presence of this change, as well as promotion and backwards compatibility logic. This is an updated version of #9175. Rebase required fixing the tests, updating the ABI of libzfs, updating the man pages, fixing bugs, fixing the error returns, and updating the old on-disk error logs to the new format when activating the feature. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Co-authored-by: TulsiJain <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #9175 Closes #12812
* Remove bcopy(), bzero(), bcmp()наб2022-03-151-7/+7
| | | | | | | | | | bcopy() has a confusing argument order and is actually a move, not a copy; they're all deprecated since POSIX.1-2001 and removed in -2008, and we shim them out to mem*() on Linux anyway Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12996
* Silence uninitialized warnings in dsl_dataset.cRich Ercolani2022-02-141-6/+6
| | | | | | | | On newer compilers, dsl_dataset.c now warns (or, on DEBUG, errors) on uninitialized variable usage. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #13083
* Remove unneeded "extern inline" function declarationsTomohiro Kusumi2022-02-081-2/+0
| | | | | | | | | All of these externs are already #included as static inline functions via corresponding headers. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #13073
* Simplify resume token generationRyan Moeller2022-02-011-166/+135
| | | | | | | | | | * Improve naming. * Reduce indentation. * Avoid boilerplate logic duplication. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #12967
* Clean up CSTYLEDsнаб2022-01-261-2/+0
| | | | | | | | | | | | | | | | | | | | 69 CSTYLED BEGINs remain, appx. 30 of which can be removed if cstyle(1) had a useful policy regarding CALL(ARG1, ARG2, ARG3); above 2 lines. As it stands, it spits out *both* sysctl_os.c: 385: continuation line should be indented by 4 spaces sysctl_os.c: 385: indent by spaces instead of tabs which is very cool Another >10 could be fixed by removing "ulong" &al. handling. I don't foresee anyone actually using it intentionally (does it even exist in modern headers? why did it in the first place?). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12993
* module/*.ko: prune .data, global .rodataнаб2022-01-141-3/+3
| | | | | | | | | | | | Evaluated every variable that lives in .data (and globals in .rodata) in the kernel modules, and constified/eliminated/localised them appropriately. This means that all read-only data is now actually read-only data, and, if possible, at file scope. A lot of previously- global-symbols became inlinable (and inlined!) constants. Probably not in a big Wowee Performance Moment, but hey. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12899
* module: zfs: fix unused, remove argsusedнаб2021-12-231-3/+1
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12844
* Introduce dsl_dir_diduse_transfer_space()Alexander Motin2021-07-161-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling them one after another is a waste of time and even more contention. Doing that twice for each rewritten block within dbuf_write_done() via dsl_dataset_block_kill() and dsl_dataset_block_born() created one of the biggest CPU overheads in case of small blocks rewrite. dsl_dir_diduse_transfer_space() combines functionality of these two functions for cases where it is needed, but without double overhead, practically for the cost of dsl_dir_diduse_space() or even cheaper. While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space() and dsl_dir_transfer_space(). It seems Clang detects some aliasing there, repeating dd->dd_dbuf->db_data dereference multiple times, increasing dd_lock scope and contention. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Author: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12300
* Annotated dprintf as printf-likeRich Ercolani2021-06-221-2/+2
| | | | | | | | | | ZFS loves using %llu for uint64_t, but that requires a cast to not be noisy - which is even done in many, though not all, places. Also a couple places used %u for uint64_t, which were promoted to %llu. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12233
* Re-embed multilist_t storageAlexander Motin2021-06-101-2/+1
| | | | | | | | | | | | This commit partially reverts changes to multilists in PR 7968 (multi-threaded spa-sync()) and adds some cache line alignments to separate read-only multilists and heavily modified refcount's to different cache lines. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-by: iXsystems, Inc. Closes #12158
* Revert special case code from pre-hashtable nvlist eraMark Maybee2021-01-271-12/+1
| | | | | | | | | | | | | Before a hash table was added on top of the nvlist code, there were cases where the nvlist allocation was changed from fnvlist_alloc() to nvlist_alloc() to avoid expensive NV_UNIQUE_NAME checks. Now this is no longer necessary. These changes should be reverted to be consistent with other code. There are some cases where this change will also reduce the number of iterations. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mark Maybee <[email protected]> Closes #11464
* Improve zfs receive performance with lightweight writeMatthew Ahrens2020-12-111-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The performance of `zfs receive` can be bottlenecked on the CPU consumed by the `receive_writer` thread, especially when receiving streams with small compressed block sizes. Much of the CPU is spent creating and destroying dbuf's and arc buf's, one for each `WRITE` record in the send stream. This commit introduces the concept of "lightweight writes", which allows `zfs receive` to write to the DMU by providing an ABD, and instantiating only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this "dirty leaf block" are not instantiated. Because there is no dbuf with the dirty data, this mechanism doesn't support reading from "lightweight-dirty" blocks (they would see the on-disk state rather than the dirty data). Since the dedup-receive code has been removed, `zfs receive` is write-only, so this works fine. Because there are no arc bufs for the received data, the received data is no longer cached in the ARC. Testing a receive of a stream with average compressed block size of 4KB, this commit improves performance by 50%, while also reducing CPU usage by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer() and dbuf_evict() is now 1/7th (14%) of what it was. Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35% New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0% The code is also restructured in a few ways: Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies some existing code that no longer needs `DB_DNODE_ENTER()` and related routines. The new field is needed by the lightweight-type dirty record. To ensure that the `dr_dnode` field remains valid until the dirty record is freed, we have to ensure that the `dnode_move()` doesn't relocate the dnode_t. To do this we keep a hold on the dnode until it's zio's have completed. This is already done by the user-accounting code (`userquota_updates_task()`), this commit extends that so that it always keeps the dnode hold until zio completion (see `dnode_rele_task()`). `dn_dirty_txg` was previously zeroed when the dnode was synced. This was not necessary, since its meaning can be "when was this dnode last dirtied". This change simplifies the new `dnode_rele_task()` code. Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11105
* Add zstd support to zfsMichael Niewöhner2020-08-201-1/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds two new compression types, based on ZStandard: - zstd: A basic ZStandard compression algorithm Available compression. Levels for zstd are zstd-1 through zstd-19, where the compression increases with every level, but speed decreases. - zstd-fast: A faster version of the ZStandard compression algorithm zstd-fast is basically a "negative" level of zstd. The compression decreases with every level, but speed increases. Available compression levels for zstd-fast: - zstd-fast-1 through zstd-fast-10 - zstd-fast-20 through zstd-fast-100 (in increments of 10) - zstd-fast-500 and zstd-fast-1000 For more information check the man page. Implementation details: Rather than treat each level of zstd as a different algorithm (as was done historically with gzip), the block pointer `enum zio_compress` value is simply zstd for all levels, including zstd-fast, since they all use the same decompression function. The compress= property (a 64bit unsigned integer) uses the lower 7 bits to store the compression algorithm (matching the number of bits used in a block pointer, as the 8th bit was borrowed for embedded block pointers). The upper bits are used to store the compression level. It is necessary to be able to determine what compression level was used when later reading a block back, so the concept used in LZ4, where the first 32bits of the on-disk value are the size of the compressed data (since the allocation is rounded up to the nearest ashift), was extended, and we store the version of ZSTD and the level as well as the compressed size. This value is returned when decompressing a block, so that if the block needs to be recompressed (L2ARC, nop-write, etc), that the same parameters will be used to result in the matching checksum. All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`, `zio_prop_t`, etc.) uses the separated _compress and _complevel variables. Only the properties ZAP contains the combined/bit-shifted value. The combined value is split when the compression_changed_cb() callback is called, and sets both objset members (os_compress and os_complevel). The userspace tools all use the combined/bit-shifted value. Additional notes: zdb can now also decode the ZSTD compression header (flag -Z) and inspect the size, version and compression level saved in that header. For each record, if it is ZSTD compressed, the parameters of the decoded compression header get printed. ZSTD is included with all current tests and new tests are added as-needed. Per-dataset feature flags now get activated when the property is set. If a compression algorithm requires a feature flag, zfs activates the feature when the property is set, rather than waiting for the first block to be born. This is currently only used by zstd but can be extended as needed. Portions-Sponsored-By: The FreeBSD Foundation Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Co-authored-by: Sebastian Gottschall <[email protected]> Co-authored-by: Kjeld Schouten-Lebbing <[email protected]> Co-authored-by: Michael Niewöhner <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Signed-off-by: Kjeld Schouten-Lebbing <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #6247 Closes #9024 Closes #10277 Closes #10278
* FreeBSD: fix the build with Clang 11Ryan Moeller2020-08-171-2/+2
| | | | | | | | | | | | * Cast void * to uintptr_t before casting to boolean_t. * Avoid clashing definition of __asm when not on Linux to prevent duplicate __volatile__. This was already done in some places but not all. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10723
* zfs promote does not delete livelist of originMatthew Ahrens2020-07-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a clone is promoted, its livelist is no longer accurate, so it is discarded. If the clone's origin is also a clone (i.e. we are promoting a clone of a clone), then the origin's livelist is also no longer accurate, so it should be discarded, but the code doesn't actually do that. Consider a pool with: * Filesystem A * Clone B, a clone of A * Clone C, a clone of B If we promote C, it discards C's livelist. It should discard B's livelist, but that is not happening. The impact is that when B is destroyed, we use the livelist to find the blocks to free, but the livelist is no longer correct so we end up freeing blocks that are still in use by C. The incorrectly-freed blocks can be reallocated causing checksum errors. And when C is destroyed it can double-free the incorrectly-freed blocks. The problem is that we remove the livelist of `origin_ds->ds_dir`, but the origin snapshot has already been moved to the promoted dsl_dir. So this is actually trying to remove the livelist of the promoted dsl_dir, which was already removed. As explained in a comment in the beginning of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the origin's dsl_dir. Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed by: Sara Hartse <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10652
* filesystem_limit/snapshot_limit is incorrectly enforced against rootMatthew Ahrens2020-07-111-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The filesystem_limit and snapshot_limit properties limit the number of filesystems or snapshots that can be created below this dataset. According to the manpage, "The limit is not enforced if the user is allowed to change the limit." Two types of users are allowed to change the limit: 1. Those that have been delegated the `filesystem_limit` or `snapshot_limit` permission, e.g. with `zfs allow USER filesystem_limit DATASET`. This works properly. 2. A user with elevated system privileges (e.g. root). This does not work - the root user will incorrectly get an error when trying to create a snapshot/filesystem, if it exceeds the `_limit` property. The problem is that `priv_policy_ns()` does not work if the `cred_t` is not that of the current process. This happens when `dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a sync task's check func) to determine the permissions of the corresponding user process. This commit fixes the issue by passing the `task_struct` (typedef'ed as a `proc_t`) to syncing context, and then using `has_capability()` to determine if that process is privileged. Note that we still need to pass the `cred_t` to syncing context so that we can check if the user was delegated this permission with `zfs allow`. This problem only impacts Linux. Wrappers are added to FreeBSD but it continues to use `priv_check_cred()`, which works on arbitrary `cred_t`. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8226 Closes #10545
* Replace sprintf()->snprintf() and strcpy()->strlcpy()Jorgen Lundman2020-06-071-5/+9
| | | | | | | | | | | | | | | The strcpy() and sprintf() functions are deprecated on some platforms. Care is needed to ensure correct size is used. If some platforms miss snprintf, we can add a #define to sprintf, likewise strlcpy(). The biggest change is adding a size parameter to zfs_id_to_fuidstr(). The various *_impl_get() functions are only used on linux and have not yet been updated. Reviewed by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #10400
* Add 'zfs wait' commandPaul Dagnelie2020-04-011-10/+18
| | | | | | | | | | | | | | | | | | | | | | Add a mechanism to wait for delete queue to drain. When doing redacted send/recv, many workflows involve deleting files that contain sensitive data. Because of the way zfs handles file deletions, snapshots taken quickly after a rm operation can sometimes still contain the file in question, especially if the file is very large. This can result in issues for redacted send/recv users who expect the deleted files to be redacted in the send streams, and not appear in their clones. This change duplicates much of the zpool wait related logic into a zfs wait command, which can be used to wait until the internal deleteq has been drained. Additional wait activities may be added in the future. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9707
* Refactor dnode dirty context from dbuf_dirtyMatthew Macy2020-02-261-2/+2
| | | | | | | | | | | * Add dedicated donde_set_dirtyctx routine. * Add empty dirty record on destroy assertion. * Make much more extensive use of the SET_ERROR macro. Reviewed-by: Will Andrews <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9924
* async zvol minor node creation interferes with receiveMatthew Ahrens2020-02-031-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see #3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes #7863 Closes #9885
* Mark dsl_dataset_deactivate_feature_impl staticMatthew Macy2019-12-091-1/+1
| | | | | | | | | The dsl_dataset_deactivate_feature_impl() function is private and should be marked as such. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9696
* Replace ASSERTV macro with compiler annotationMatthew Macy2019-12-051-5/+5
| | | | | | | | | | | Remove the ASSERTV macro and handle suppressing unused compiler warnings for variables only in ASSERTs using the __attribute__((unused)) compiler annotation. The annotation is understood by both gcc and clang. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9671
* Fix strdup conflict on other platformsMatthew Macy2019-10-101-4/+4
| | | | | | | | | | | | | | | | In the FreeBSD kernel the strdup signature is: ``` char *strdup(const char *__restrict, struct malloc_type *); ``` It's unfortunate that the developers have chosen to change the signature of libc functions - but it's what I have to deal with. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9433
* Enable compiler to typecheck loggingMatthew Macy2019-09-121-2/+2
| | | | | | | | | | Annotate spa logging declarations with printflike Workaround gcc bug (non disable-able warning) by replacing "" with " " Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9316