aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs/zfs_replay.c
Commit message (Collapse)AuthorAgeFilesLines
* Linux 6.3 compat: idmapped mount API changesyouzhongyang2023-04-101-7/+7
| | | | | | | | | Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap type for mounts (struct mnt_idmap), we need to detect these changes and make zfs work with the new APIs. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14682
* Implementation of block cloning for ZFSPawel Jakub Dawidek2023-03-101-0/+30
| | | | | | | | | | | | | | | Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #13392
* Suppress Clang static analyzer complaint in zfs_replay_create()Richard Yao2023-02-141-3/+3
| | | | | | | | | | | | | | Clang's static analyzer incorrectly complains about an undefined value here when lr->lr_common.lrc_txtype == TX_SYMLINK and txtype == TX_CREATE. This is impossible, because of this line: txtype = (lr->lr_common.lrc_txtype & ~TX_CI((uint64_t)0x1 << 63)); Changing the code to compare against txtype suppresses the report. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14472
* Support idmapped mount in user namespaceyouzhongyang2022-11-081-0/+35
| | | | | | | | | | | | | | | | | | Linux 5.17 commit torvalds/linux@5dfbfe71e enables "the idmapping infrastructure to support idmapped mounts of filesystems mounted with an idmapping". Update the OpenZFS accordingly to improve the idmapped mount support. This pull request contains the following changes: - xattr setter functions are fixed to take mnt_ns argument. Without this, cp -p would fail for an idmapped mount in a user namespace. - idmap_util is enhanced/fixed for its use in a user ns context. - One test case added to test idmapped mount in a user ns. Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14097
* zfs_rename: support RENAME_* flagsAleksa Sarai2022-10-281-10/+96
| | | | | | | | | | | | | | | | | | | | | | Implement support for Linux's RENAME_* flags (for renameat2). Aside from being quite useful for userspace (providing race-free ways to exchange paths and implement mv --no-clobber), they are used by overlayfs and are thus required in order to use overlayfs-on-ZFS. In order for us to represent the new renameat2(2) flags in the ZIL, we create two new transaction types for the two flags which need transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT). RENAME_NOREPLACE does not need any ZIL support because we know that if the operation succeeded before creating the ZIL entry, there was no file to be clobbered and thus it can be treated as a regular TX_RENAME. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]> Closes #12209 Closes #14070
* Support idmapped mountyouzhongyang2022-10-191-7/+7
| | | | | | | | | | | | Adds support for idmapped mounts. Supported as of Linux 5.12 this functionality allows user and group IDs to be remapped without changing their state on disk. This can be useful for portable home directories and a variety of container related use cases. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #12923 Closes #13671
* Replace dead opensolaris.org license linkTino Reichardt2022-07-111-1/+1
| | | | | | | | | The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13619
* Remove bcopy(), bzero(), bcmp()наб2022-03-151-9/+6
| | | | | | | | | | bcopy() has a confusing argument order and is actually a move, not a copy; they're all deprecated since POSIX.1-2001 and removed in -2008, and we shim them out to mem*() on Linux anyway Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12996
* log xattr=sa create/remove/update to ZILJitendra Patidar2022-02-221-0/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As such, there are no specific synchronous semantics defined for the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is done, if sync=always is set on dataset. This provides sync semantics for xattr=on with sync=always set on dataset. For the xattr=sa implementation, it doesn't log to ZIL, so, even with sync=always, xattrs are not guaranteed to be synced before xattr call returns to caller. So, xattr can be lost if system crash happens, before txg carrying xattr transaction is synced. This change adds xattr=sa logging to ZIL on xattr create/remove/update and xattrs are synced to ZIL (zil_commit() done) for sync=always. This makes xattr=sa behavior similar to xattr=on. Implementation notes: The actual logging is fairly straight-forward and does not warrant additional explanation. However, it has been 14 years since we last added new TX types to the ZIL [1], hence this is the first time we do it after the introduction of zpool features. Therefore, here is an overview of the feature activation and deactivation workflow: 1. The feature must be enabled. Otherwise, we don't log the new record type. This ensures compatibility with older software. 2. The feature is activated per-dataset, since the ZIL is per-dataset. 3. If the feature is enabled and dataset is not for zvol, any append to the ZIL chain will activate the feature for the dataset. Likewise for starting a new ZIL chain. 4. A dataset that doesn't have a ZIL chain has the feature deactivated. We ensure (3) by activating on the first zil_commit() after the feature was enabled. Since activating the features requires waiting for txg sync, the first zil_commit() after enabling the feature will be slower than usual. The downside is that this is really a conservative approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL chain, we pay the penalty for feature activation. The upside is that the user is in control of when we pay the penalty, i.e., upon enabling the feature. We ensure (4) by hooking into zil_sync(), where ZIL destroy actually happens. One more piece on feature activation, since it's spread across multiple functions: zil_commit() zil_process_commit_list() if lwb == NULL // first zil_commit since zil_open zil_create() if no log block pointer in ZIL header: if feature enabled and not active: // CASE 1 enable, COALESCE txg wait with dmu_tx that allocated the log block else // log block was allocated earlier than this zil_open if feature enabled and not active: // CASE 2 enable, EXPLICIT txg wait else // already have an in-DRAM LWB if feature enabled and not active: // this happens when we enable the feature after zil_create // CASE 3 enable, EXPLICIT txg wait [1] https://github.com/illumos/illumos-gate/commit/da6c28aaf62fa55f0fdb8004aa40f88f23bf53f0 Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jitendra Patidar <[email protected]> Closes #8768 Closes #9078
* module: mark arguments usedнаб2022-02-181-1/+1
| | | | | | | Reviewed-by: Alejandro Colomar <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #13110
* Rename fallthrough to zfs_fallthroughJorgen Lundman2022-02-151-4/+4
| | | | | | | | Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Damian Szuberski <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #13097
* module/*.ko: prune .data, global .rodataнаб2022-01-141-1/+1
| | | | | | | | | | | | Evaluated every variable that lives in .data (and globals in .rodata) in the kernel modules, and constified/eliminated/localised them appropriately. This means that all read-only data is now actually read-only data, and, if possible, at file scope. A lot of previously- global-symbols became inlinable (and inlined!) constants. Probably not in a big Wowee Performance Moment, but hey. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12899
* Use fallthrough macroBrian Behlendorf2021-09-141-4/+4
| | | | | | | | | | | | | | | As of the Linux 5.9 kernel a fallthrough macro has been added which should be used to anotate all intentional fallthrough paths. Once all of the kernel code paths have been updated to use fallthrough the -Wimplicit-fallthrough option will because the default. To avoid warnings in the OpenZFS code base when this happens apply the fallthrough macro. Additional reading: https://lwn.net/Articles/794944/ Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12441
* macOS can also set va_typeJorgen Lundman2021-07-261-1/+1
| | | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #12357
* Normalise /*FALLTHR{OUGH,U}*/наб2021-07-261-4/+4
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Issue #12201
* Simplify FreeBSD's locking requirements in zfs_replay.cMatthew Macy2020-01-221-24/+12
| | | | | | | | | | | Now that the FreeBSD zfs_vnops code avoids asserting that a vnode lock is held when z_replay is true we can limit the FreeBSD specific changes to the couple of changes where it is necessary to drop the vnode locks because a function returns with it held. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9865
* Fix zfs_xattr_owner_unlinked on FreeBSD and commentMatthew Macy2019-12-161-0/+10
| | | | | | | | Explain FreeBSD VFS' unfortunate idiosyncratic locking requirements. There is no functional change for other platforms. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9720
* Make zfs_replay.c work on FreeBSDMatthew Macy2019-12-131-11/+23
| | | | | | | | | | | | | FreeBSD's vfs currently doesn't permit file systems to do their own locking. To avoid having to have duplicate zfs functions with and without locking add locking here. With luck these changes can be removed in the future. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9715
* Eliminate Linux specific inode usage from common code Matthew Macy2019-12-111-41/+41
| | | | | | | | | | Change many of the znops routines to take a znode rather than an inode so that zfs_replay code can be largely shared and in the future the much of the znops code may be shared. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9708
* Add zfs_file_* interface, remove vnodesMatthew Macy2019-11-211-3/+1
| | | | | | | | | | | | | | | | | | Provide a common zfs_file_* interface which can be implemented on all platforms to perform normal file access from either the kernel module or the libzpool library. This allows all non-portable vnode_t usage in the common code to be replaced by the new portable zfs_file_t. The associated vnode and kobj compatibility functions, types, and macros have been removed from the SPL. Moving forward, vnodes should only be used in platform specific code when provided by the native operating system. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9556
* Fix zil replay panic when TX_REMOVE followed by TX_CREATEChunwei Chen2019-08-281-4/+4
| | | | | | | | | | | | | | | | | | | | If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #7151 Closes #8910 Closes #9123 Closes #9145
* Use SEEK_{SET,CUR,END} for file seek "whence"Tomohiro Kusumi2019-04-251-1/+1
| | | | | | | | Use either SEEK_* or 0,1,2..., but not both. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8656
* Update build system and packagingBrian Behlendorf2018-05-291-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_*. * Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> TEST_ZIMPORT_SKIP="yes" Closes #7556
* Project Quota on ZFSNasf-Fan2018-02-131-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Project quota is a new ZFS system space/object usage accounting and enforcement mechanism. Similar as user/group quota, project quota is another dimension of system quota. It bases on the new object attribute - project ID. Project ID is a numerical value to indicate to which project an object belongs. An object only can belong to one project though you (the object owner or privileged user) can change the object project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly. The object also can inherit the project ID from its parent when created if the parent has the project inherit flag (that can be set via 'chattr +P' or 'zfs project -s [-p]'). By accounting the spaces/objects belong to the same project, we can know how many spaces/objects used by the project. And if we set the upper limit then we can control the spaces/objects that are consumed by such project. It is useful when multiple groups and users cooperate for the same project, or a user/group needs to participate in multiple projects. Support the following commands and functionalities: zfs set projectquota@project zfs set projectobjquota@project zfs get projectquota@project zfs get projectobjquota@project zfs get projectused@project zfs get projectobjused@project zfs projectspace zfs allow projectquota zfs allow projectobjquota zfs allow projectused zfs allow projectobjused zfs unallow projectquota zfs unallow projectobjquota zfs unallow projectused zfs unallow projectobjused chattr +/-P chattr -p project_id lsattr -p This patch also supports tree quota based on the project quota via "zfs project" commands set as following: zfs project [-d|-r] <file|directory ...> zfs project -C [-k] [-r] <file|directory ...> zfs project -c [-0] [-d|-r] [-p id] <file|directory ...> zfs project [-p id] [-r] [-s] <file|directory ...> For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on the $DIR, then the proejct [obj]quota and [obj]used values for the $DIR's project ID will be shown as the total/free (avail) resource. Keep the same behavior as EXT4/XFS does. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by Ned Bass <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Fan Yong <[email protected]> TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master" Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c Closes #6290
* OpenZFS 8081 - Compiler warnings in zdbBrian Behlendorf2017-10-271-35/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix compiler warnings in zdb. With these changes, FreeBSD can compile zdb with all compiler warnings enabled save -Wunused-parameter. usr/src/cmd/zdb/zdb.c usr/src/cmd/zdb/zdb_il.c usr/src/uts/common/fs/zfs/sys/sa.h usr/src/uts/common/fs/zfs/sys/spa.h Fix numerous warnings, including: * const-correctness * shadowing global definitions * signed vs unsigned comparisons * missing prototypes, or missing static declarations * unused variables and functions * Unreadable array initializations * Missing struct initializers usr/src/cmd/zdb/zdb.h Add a header file to declare common symbols usr/src/lib/libzpool/common/sys/zfs_context.h usr/src/uts/common/fs/zfs/arc.c usr/src/uts/common/fs/zfs/dbuf.c usr/src/uts/common/fs/zfs/spa.c usr/src/uts/common/fs/zfs/txg.c Add a function prototype for zk_thread_create, and ensure that every callback supplied to this function actually matches the prototype. usr/src/cmd/ztest/ztest.c usr/src/uts/common/fs/zfs/sys/zil.h usr/src/uts/common/fs/zfs/zfs_replay.c usr/src/uts/common/fs/zfs/zvol.c Add a function prototype for zil_replay_func_t, and ensure that every function of this type actually matches the prototype. usr/src/uts/common/fs/zfs/sys/refcount.h Change FTAG so it discards any constness of __func__, necessary since existing APIs expect it passed as void *. Porting Notes: - Many of these fixes have already been applied to Linux. For consistency the OpenZFS version of a change was applied if the warning was addressed in an equivalent but different fashion. Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Authored by: Alan Somers <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8081 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843abe1b8a Closes #6787
* Enhance comments for large dnode projectMatthew Ahrens2017-08-291-3/+3
| | | | | | | | | | | | | Fix a few nits in the comments from large dnodes. Also import some of the commit message as a comment in the code, making it more accessible. Reviewed-by: @rottegift Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matt Ahrens <[email protected]> Closes #6551
* Rename zfs_sb_t -> zfsvfs_tBrian Behlendorf2017-03-101-57/+58
| | | | | | | | | | | The use of zfs_sb_t instead of zfsvfs_t results in unnecessary conflicts with the upstream source. Change all instances of zfs_sb_t to zfsvfs_t including updating the variables names. Whenever possible the code was updated to be consistent with hope it appears in the upstream OpenZFS source. Signed-off-by: Brian Behlendorf <[email protected]>
* OpenZFS 6328 - Fix cstyle errors in zfs codebaseGeorge Melikov2017-01-121-2/+2
| | | | | | | | | | | | | | Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Jorgen Lundman <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6328 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9a686fb Closes #5579
* Fix spellingka72017-01-031-1/+1
| | | | | | | | | Reviewed-by: Brian Behlendorf <[email protected] Reviewed-by: Giuseppe Di Natale <[email protected]>> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Haakan T Johansson <[email protected]> Closes #5547 Closes #5543
* Fix coverity defectsluozhengzheng2016-09-211-2/+2
| | | | | | | | | | | | | coverity scan CID:147633,type: sizeof not portable coverity scan CID:147637,type: sizeof not portable coverity scan CID:147638,type: sizeof not portable coverity scan CID:147640,type: sizeof not portable In these particular cases sizeof (XX **) happens to be equal to sizeof (X *), but this is not a portable assumption. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5144
* Simplify time handling logic in zfs_settattrNikolay Borisov2016-09-131-0/+2
| | | | | | | | | | | | | | | | | Simplify time handling in zfs_setattr by mimicking the logic in setattr_copy from the linux kernel. In order to achieve this in the case when ZFS' log is being replayed it is necessary to unconditionally set the ctime in zfs_replay_setattr. Also use the timespec_trunc function when assigning values to the generic inode struct. This is currently a noop since zfs sets s_time_gran to 1, however in the future rules about precision might change. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Closes #4916
* Implement large_dnode pool featureNed Bass2016-06-241-9/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3542
* Add pn_alloc()/pn_free() functionsBrian Behlendorf2016-04-211-2/+2
| | | | | | | | | | | | | | In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and pn_free() functions must be implemented. The existing illumos implementation were used for this purpose. The `flags` argument which was used in places wrapped by the HAVE_PN_UTILS condition has beed added back to zfs_remove() and zfs_link() functions. This removes a small point of divergence between the ZoL code and upstream. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4522
* Linux AIO SupportRichard Yao2014-09-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to know that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's responsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #223 Closes #2373
* Illumos #3598Matthew Ahrens2013-10-311-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | 3598 want to dtrace when errors are generated in zfs Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/3598 illumos/illumos-gate@be6fd75a69ae679453d9cda5bff3326111e6d1ca Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1775 Porting notes: 1. include/sys/zfs_context.h has been modified to render some new macros inert until dtrace is available on Linux. 2. Linux-specific changes have been adapted to use SET_ERROR(). 3. I'm NOT happy about this change. It does nothing but ugly up the code under Linux. Unfortunately we need to take it to avoid more merge conflicts in the future. -Brian
* Constify structures containing function pointersRichard Yao2013-03-041-22/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PaX team modified the kernel's modpost to report writeable function pointers as section mismatches because they are potential exploit targets. We could ignore the warnings, but their presence can obscure actual issues. Proper const correctness can also catch programming mistakes. Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2 kernel reports 133 section mismatches prior to this patch. This patch eliminates 130 of them. The quantity of writeable function pointers eliminated by constifying each structure is as follows: vdev_opts_t 52 zil_replay_func_t 24 zio_compress_info_t 24 zio_checksum_info_t 9 space_map_ops_t 7 arc_byteswap_func_t 5 The remaining 3 writeable function pointers cannot be addressed by this patch. 2 of them are in zpl_fs_type. The kernel's sget function requires that this be non-const. The final writeable function pointer is created by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and remove_shrinker() functions also require that this be non-const. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1300
* ZFS replay transaction error 5Cyril Plisko2012-09-171-8/+7
| | | | | | | | | | | | | | | | | | | | When zfs_replay_write() replays TX_WRITE records from ZIL it calls zpl_write_common() to perform the actual write. zpl_write_common() returns the number of bytes written (similar to write() system call) or an (negative) error. However, the code expects the positive return value to be a residual counter. Thus when zpl_write_common() successfully completes it is mistakenly considered to be a partial write and the error code delivered further. At this point the ZIL processing is aborted with famous "ZFS replay transaction error 5" error message given to the message buffer. The fix is to compare the zpl_write_commmon() return value with the buffer size and flag error only when they disagree. Signed-off-by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #933
* Use Linux ATTR_ versionsBrian Behlendorf2011-03-031-2/+2
| | | | | | The AT_ versions of these macros are used on Solaris and while they map to their Linux equivilants the code has been updated to use the ATTR_ versions.
* Drop HAVE_XVATTR macrosBrian Behlendorf2011-03-021-67/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I began work on the Posix layer it immediately became clear to me that to integrate cleanly with the Linux VFS certain Solaris specific things would have to go. One of these things was to elimate as many Solaris specific types from the ZPL layer as possible. They would be replaced with their Linux equivalents. This would not only be good for performance, but for the general readability and health of the code. The Solaris and Linux VFS are different beasts and should be treated as such. Most of the code remains common for constructing transactions and such, but there are subtle and important differenced which need to be repsected. This policy went quite for for certain types such as the vnode_t, and it initially seemed to be working out well for the vattr_t. There was a relatively small amount of related xvattr_t code I was forced to comment out with HAVE_XVATTR. But it didn't look that hard to come back soon and replace it all with a native Linux type. However, after going doing this path with xvattr some distance it clear that this code was woven in the ZPL more deeply than I thought. In particular its hooks went very deep in to the ZPL replay code and replacing it would not be as easy as I originally thought. Rather than continue persuing replacing and removing this code I've taken a step back and reevaluted things. This commit reverts many of my previous commits which removed xvattr related code. It restores much of the code to its original upstream state and now relies on improved xvattr_t support in the zfs package itself. The result of this is that much of the code which I had commented out, which accidentally broke things like replay, is now back in place and working. However, there may be a small performance impact for getattr/setattr operations because they now require a translation from native Linux to Solaris types. For now that's a price I'm willing to pay. Once everything is completely functional we can revisting the issue of removing the vattr_t/xvattr_t types. Closes #111
* Wrap with HAVE_XVATTRBrian Behlendorf2011-02-101-151/+177
| | | | | | For the moment exactly how to handle xvattr is not clear. This change largely consists of the code to comment out the offending bits until something reasonable can be done.
* Replace VOP_* calls with direct zfs_* callsBrian Behlendorf2011-02-101-15/+16
| | | | | These generic Solaris wrappers are no longer required. Simply directly call the correct zfs functions for clarity.
* Add linux kernel disk supportBrian Behlendorf2010-08-311-0/+3
| | | | | | Native Linux vdev disk interfaces Signed-off-by: Brian Behlendorf <[email protected]>
* Update to onnv_147Brian Behlendorf2010-08-261-17/+44
| | | | | This is the last official OpenSolaris tag before the public development tree was closed.
* Update core ZFS code from build 121 to build 141.Brian Behlendorf2010-05-281-47/+73
|
* Move the world out of /zfs/ and seperate out module build treeBrian Behlendorf2008-12-111-0/+878