aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs/zfs_vfsops.c
Commit message (Collapse)AuthorAgeFilesLines
* OpenZFS restructuring - move platform specific sourcesMatthew Macy2019-09-061-2562/+0
| | | | | | | | | | | | Move platform specific Linux source under module/os/linux/ and update the build system accordingly. Additional code restructuring will follow to make the common code fully portable. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Closes #9206
* Fix typos in module/zfs/Andrea Gelmini2019-09-021-2/+2
| | | | | | | | Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #9240
* Fix deadlock in 'zfs rollback'Tom Caputi2019-08-271-1/+15
| | | | | | | | | | | | | | | | | | | | Currently, the 'zfs rollback' code can end up deadlocked due to the way the kernel handles unreferenced inodes on a suspended fs. Essentially, the zfs_resume_fs() code path may cause zfs to spawn new threads as it reinstantiates the suspended fs's zil. When a new thread is spawned, the kernel may attempt to free memory for that thread by freeing some unreferenced inodes. If it happens to select inodes that are a a part of the suspended fs a deadlock will occur because freeing inodes requires holding the fs's z_teardown_inactive_lock which is still held from the suspend. This patch corrects this issue by adding an additional reference to all inodes that are still present when a suspend is initiated. This prevents them from being freed by the kernel for any reason. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9203
* Make txg_wait_synced conditional in zfsvfs_teardownPaul Zuchowski2019-08-151-1/+10
| | | | | | | | | | | The call to txg_wait_synced in zfsvfs_teardown should be made conditional on the objset having dirty data. This can prevent unnecessary txg_wait_synced during some unmount operations. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9115
* Implement Redacted Send/ReceivePaul Dagnelie2019-06-191-1/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Prashanth Sreenivasa <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Chris Williamson <[email protected]> Reviewed-by: Pavel Zhakarov <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #7958
* Add missing trailing '\n' in printk() messagesTomohiro Kusumi2019-05-081-1/+1
| | | | | | | | | These messages will want '\n' like any other regular printk() messages. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8726
* Fix `zfs set atime|relatime=off|on` behavior on inherited datasetsTomohiro Kusumi2019-05-071-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `zfs set atime|relatime=off|on` doesn't disable or enable the property on read for datasets whose property was inherited from parent, until a dataset is once unmounted and mounted again. (The properties start to work properly if a dataset is once unmounted and mounted again. The difference comes from regular mount process, e.g. via zpool import, uses mount options based on properties read from ondisk layout for each dataset, whereas `zfs set atime|relatime=off|on` just remounts a specified dataset.) -- # zpool create p1 <device> # zfs create p1/f1 # zfs set atime=off p1 # echo test > /p1/f1/test # sync # zfs list NAME USED AVAIL REFER MOUNTPOINT p1 176K 18.9G 25.5K /p1 p1/f1 26K 18.9G 26K /p1/f1 # zfs get atime NAME PROPERTY VALUE SOURCE p1 atime off local p1/f1 atime off inherited from p1 # stat /p1/f1/test | grep Access | tail -1 Access: 2019-04-26 23:32:33.741205192 +0900 # cat /p1/f1/test test # stat /p1/f1/test | grep Access | tail -1 Access: 2019-04-26 23:32:50.173231861 +0900 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2) -- The problem is that zfsvfs::z_atime which was probably intended to keep incore atime state just gets updated by a callback function of "atime" property change, atime_changed_cb(), and never used for anything else. Since now that all file read and atime update use a common function zpl_iter_read_common() -> file_accessed(), and whether to update atime via ->dirty_inode() is determined by atime_needs_update(), atime_needs_update() needs to return false once atime is turned off. It currently continues to return true on `zfs set atime=off`. Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super block depending on a new atime value, so that atime_needs_update() works as expected after property change. The same problem applies to "relatime" except that a self contained relatime test is needed. This is because relatime_need_update() is based on a mount option flag MNT_RELATIME, which doesn't exist in datasets with inherited "relatime" property via `zfs set relatime=...`, hence it needs its own relatime test zfs_relatime_need_update(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8674 Closes #8675
* Fix memory leaks in zfsvfs_create_impl()Tom Caputi2019-03-141-9/+14
| | | | | | | | | | This patch simply fixes some small memory leaks that can happen during error handling in zfsvfs_create_impl(). If the function fails, it frees all the memory / references it created. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8490
* port async unlinked drain from illumos-nexentaAlek P2019-02-121-5/+31
| | | | | | | | | | | | | | | | | This patch is an async implementation of the existing sync zfs_unlinked_drain() function. This function is called at mount time and is responsible for freeing znodes that we didn't get to freeing before. We don't have to hold mounting of the dataset until the unlinked list is fully drained as is done now. Since we can process the unlinked set asynchronously this results in a better user experience when mounting a dataset with entries in the unlinked set. Reviewed by: Jorgen Lundman <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #8142
* Linux 5.0 compat: Convert MS_* macros to SB_*Tony Hutter2019-01-281-11/+12
| | | | | | | | | | | In the 5.0 kernel, only the mount namespace code should use the MS_* macos. Filesystems should use the SB_* ones. https://patchwork.kernel.org/patch/10552493/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8264
* Remove zfs_sync() panicking kernel checkBrian Behlendorf2019-01-181-7/+0
| | | | | | | | | | | | This check provides no real additional protection and unnecessarily introduces a dependency on the "oops_in_progress" kernel symbol. Remove the check, it there are special circumstances on other platforms which make this a requirement it can be reintroduced for all relevant call paths in a more portable comprehensive manor. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8297
* zfs filesystem skipped by df -hPaul Zuchowski2019-01-131-0/+2
| | | | | | | | | | On full pool when pool root filesystem references very few bytes, the f_blocks returned to statvfs is 0 but should be at least 1. Reviewed by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #8253 Closes #8254
* Fix statfs(2) for 32-bit user spaceBrian Behlendorf2018-09-241-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | When handling a 32-bit statfs() system call the returned fields, although 64-bit in the kernel, must be limited to 32-bits or an EOVERFLOW error will be returned. This is less of an issue for block counts since the default reported block size in 128KiB. But since it is possible to set a smaller block size, these values will be scaled as needed to fit in a 32-bit unsigned long. Unlike most other filesystems the total possible file counts are more likely to overflow because they are calculated based on the available free space in the pool. In order to prevent this the reported value must be capped at 2^32-1. This is only for statfs(2) reporting, there are no changes to the internal ZFS limits. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #7927 Closes #7122 Closes #7937
* Always wait for txg sync when umounting datasetTom Caputi2018-08-271-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | Currently, when unmounting a filesystem, ZFS will only wait for a txg sync if the dataset is dirty and not readonly. However, this can be problematic in cases where a dataset is remounted readonly immediately before being unmounted, which often happens when the system is being shut down. Since encrypted datasets require that all I/O is completed before the dataset is disowned, this issue causes problems when write I/Os leak into the txgs after the dataset is disowned, which can happen when sync=disabled. While looking into fixes for this issue, it was discovered that dsl_dataset_is_dirty() does not return B_TRUE when the dataset has been removed from the txg dirty datasets list, but has not actually been processed yet. Furthermore, the implementation is comletely different from dmu_objset_is_dirty(), adding to the confusion. Rather than relying on this function, this patch forces the umount code path (and the remount readonly code path) to always perform a txg sync on read-write datasets and removes the function altogether. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7753 Closes #7795
* Introduce read/write kstats per datasetSerapheim Dimitropoulos2018-08-201-1/+5
| | | | | | | | | | | | | The following patch introduces a few statistics on reads and writes grouped by dataset. These statistics are implemented as kstats (backed by aggregate sums for performance) and can be retrieved by using the dataset objset ID number. The motivation for this change is to provide some preliminary analytics on dataset usage/performance. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #7705
* OpenZFS 9337 - zfs get all is slow due to uncached metadataMatthew Ahrens2018-07-121-4/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This project's goal is to make read-heavy channel programs and zfs(1m) administrative commands faster by caching all the metadata that they will need in the dbuf layer. This will prevent the data from being evicted, so that any future call to i.e. zfs get all won't have to go to disk (very much). There are two parts: The dbuf_metadata_cache. We identify what to put into the cache based on the object type of each dbuf. Caching objset properties os {version,normalization,utf8only,casesensitivity} in the objset_t. The reason these needed to be cached is that although they are queried frequently, they aren't stored in a dbuf type which we can easily recognize and cache in the dbuf layer; instead, we have to explicitly store them. There's already existing infrastructure for maintaining cached properties in the objset setup code, so I simply used that. Performance Testing: - Disabled kmem_flags - Tuned dbuf_cache_max_bytes very low (128K) - Tuned zfs_arc_max very low (64M) Created test pool with 400 filesystems, and 100 snapshots per filesystem. Later on in testing, added 600 more filesystems (with no snapshots) to make sure scaling didn't look different between snapshots and filesystems. Results: | Test | Time (trunk / diff) | I/Os (trunk / diff) | +------------------------+---------------------+---------------------+ | zpool import | 0:05 / 0:06 | 12.9k / 12.9k | | zfs get all (uncached) | 1:36 / 0:53 | 16.7k / 5.7k | | zfs get all (cached) | 1:36 / 0:51 | 16.0k / 6.0k | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Thomas Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Alek Pinchuk <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> OpenZFS-issue: https://illumos.org/issues/9337 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f Closes #7668
* Update build system and packagingBrian Behlendorf2018-05-291-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_*. * Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> TEST_ZIMPORT_SKIP="yes" Closes #7556
* ZIL claiming should not start user accountingTom Caputi2018-02-201-24/+19
| | | | | | | | | | | | | | | | | | | | Currently, ZIL claiming dirties objsets which causes dsl_pool_sync() to attempt to perform user accounting on them. This causes problems for encrypted datasets that were raw received before the system went offline since they cannot perform user accounting until they have their keys loaded. This triggers an ASSERT in zio_encrypt(). Since encryption was added, the code now depends on the fact that data should only be written when objsets are owned. This patch adds a check in dmu_objset_do_userquota_updates() to ensure that useraccounting is only done when the objsets are actually owned for write. As part of this work, the zfsvfs and zvol code was updated so that it no longer lies about owning objsets readonly. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6916 Closes #7163
* Project Quota on ZFSNasf-Fan2018-02-131-70/+252
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Project quota is a new ZFS system space/object usage accounting and enforcement mechanism. Similar as user/group quota, project quota is another dimension of system quota. It bases on the new object attribute - project ID. Project ID is a numerical value to indicate to which project an object belongs. An object only can belong to one project though you (the object owner or privileged user) can change the object project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly. The object also can inherit the project ID from its parent when created if the parent has the project inherit flag (that can be set via 'chattr +P' or 'zfs project -s [-p]'). By accounting the spaces/objects belong to the same project, we can know how many spaces/objects used by the project. And if we set the upper limit then we can control the spaces/objects that are consumed by such project. It is useful when multiple groups and users cooperate for the same project, or a user/group needs to participate in multiple projects. Support the following commands and functionalities: zfs set projectquota@project zfs set projectobjquota@project zfs get projectquota@project zfs get projectobjquota@project zfs get projectused@project zfs get projectobjused@project zfs projectspace zfs allow projectquota zfs allow projectobjquota zfs allow projectused zfs allow projectobjused zfs unallow projectquota zfs unallow projectobjquota zfs unallow projectused zfs unallow projectobjused chattr +/-P chattr -p project_id lsattr -p This patch also supports tree quota based on the project quota via "zfs project" commands set as following: zfs project [-d|-r] <file|directory ...> zfs project -C [-k] [-r] <file|directory ...> zfs project -c [-0] [-d|-r] [-p id] <file|directory ...> zfs project [-p id] [-r] [-s] <file|directory ...> For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on the $DIR, then the proejct [obj]quota and [obj]used values for the $DIR's project ID will be shown as the total/free (avail) resource. Keep the same behavior as EXT4/XFS does. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by Ned Bass <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Fan Yong <[email protected]> TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master" Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c Closes #6290
* OpenZFS 7431 - ZFS Channel ProgramsChris Williamson2018-02-081-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Chris Williamson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported-by: Don Brady <[email protected]> Ported-by: John Kennedy <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7431 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533 Porting Notes: * The CLI long option arguments for '-t' and '-m' don't parse on linux * Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc * Lua implementation is built as its own module (zlua.ko) * Lua headers consumed directly by zfs code moved to 'include/sys/lua/' * There is no native setjmp/longjump available in stock Linux kernel. Brought over implementations from illumos and FreeBSD * The get_temporary_prop() was adapted due to VFS platform differences * Use of inline functions in lua parser to reduce stack usage per C call * Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
* Encryption Stability and On-Disk Format FixesTom Caputi2018-02-021-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The on-disk format for encrypted datasets protects not only the encrypted and authenticated blocks themselves, but also the order and interpretation of these blocks. In order to make this work while maintaining the ability to do raw sends, the indirect bps maintain a secure checksum of all the MACs in the block below it along with a few other fields that determine how the data is interpreted. Unfortunately, the current on-disk format erroneously includes some fields which are not portable and thus cannot support raw sends. It is not possible to easily work around this issue due to a separate and much smaller bug which causes indirect blocks for encrypted dnodes to not be compressed, which conflicts with the previous bug. In addition, the current code generates incompatible on-disk formats on big endian and little endian systems due to an issue with how block pointers are authenticated. Finally, raw send streams do not currently include dn_maxblkid when sending both the metadnode and normal dnodes which are needed in order to ensure that we are correctly maintaining the portable objset MAC. This patch zero's out the offending fields when computing the bp MAC and ensures that these MACs are always calculated in little endian order (regardless of the host system's byte order). This patch also registers an errata for the old on-disk format, which we detect by adding a "version" field to newly created DSL Crypto Keys. We allow datasets without a version (version 0) to only be mounted for read so that they can easily be migrated. We also now include dn_maxblkid in raw send streams to ensure the MAC can be maintained correctly. This patch also contains minor bug fixes and cleanups. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6845 Closes #6864 Closes #7052
* Fix 'zfs get {user|group}objused@' functionalityLOLi2017-11-291-1/+1
| | | | | | | | | | | | | | Fix a regression accidentally introduced in 1b81ab4 that prevents 'zfs get {user|group}objused@' from correctly reporting the requested value. Update "userspace_003_pos.ksh" and "groupspace_003_pos.ksh" to verify this functionality. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6908
* Long hold the dataset during upgradeArkadiusz Bubała2017-11-101-1/+6
| | | | | | | | | | | | | | | If the receive or rollback is performed while filesystem is upgrading the objset may be evicted in `dsl_dataset_clone_swap_sync_impl`. This will lead to NULL pointer dereference when upgrade tries to access evicted objset. This commit adds long hold of dataset during whole upgrade process. The receive and rollback will return an EBUSY error until the upgrade is not finished. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Arkadiusz Bubała <[email protected]> Closes #5295 Closes #6837
* Fix remounting snapshots read-writeLOLi2017-08-171-1/+9
| | | | | | | | | | | | It's not enough to preserve/restore MS_RDONLY on the superblock flags to avoid remounting a snapshot read-write: be explicit about our intentions to the VFS layer so the readonly bit is updated correctly in do_remount_sb(). Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6510 Closes #6515
* Native Encryption for ZFS on LinuxTom Caputi2017-08-141-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change incorporates three major pieces: The first change is a keystore that manages wrapping and encryption keys for encrypted datasets. These commands mostly involve manipulating the new DSL Crypto Key ZAP Objects that live in the MOS. Each encrypted dataset has its own DSL Crypto Key that is protected with a user's key. This level of indirection allows users to change their keys without re-encrypting their entire datasets. The change implements the new subcommands "zfs load-key", "zfs unload-key" and "zfs change-key" which allow the user to manage their encryption keys and settings. In addition, several new flags and properties have been added to allow dataset creation and to make mounting and unmounting more convenient. The second piece of this patch provides the ability to encrypt, decyrpt, and authenticate protected datasets. Each object set maintains a Merkel tree of Message Authentication Codes that protect the lower layers, similarly to how checksums are maintained. This part impacts the zio layer, which handles the actual encryption and generation of MACs, as well as the ARC and DMU, which need to be able to handle encrypted buffers and protected data. The last addition is the ability to do raw, encrypted sends and receives. The idea here is to send raw encrypted and compressed data and receive it exactly as is on a backup system. This means that the dataset on the receiving system is protected using the same user key that is in use on the sending side. By doing so, datasets can be efficiently backed up to an untrusted system without fear of data being compromised. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #494 Closes #5769
* OpenZFS 5428 - provide fts(), reallocarray(), and strtonum()Yuri Pankov2017-07-081-1/+1
| | | | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Robert Mustacchi <[email protected]> Approved by: Joshua M. Clulow <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: * All hunks unrelated to ZFS were dropped. OpenZFS-issue: https://www.illumos.org/issues/5428 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130 Closes #6326
* Linux 4.12 compat: fix super_setup_bdi_name() callLOLi2017-05-251-2/+1
| | | | | | | | | | Provide a format parameter to super_setup_bdi_name() so we don't create duplicate names in '/devices/virtual/bdi' sysfs namespace which would prevent us from mounting more than one ZFS filesystem at a time. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6147
* Linux 4.12 compat: super_setup_bdi_name()Brian Behlendorf2017-05-021-5/+6
| | | | | | | | | | All filesystems were converted to dynamically allocated BDIs. The destruction of backing_dev_info structures is handled as part of super block destruction. Refactor the code to abstract away the details of creating and destroying a BDI. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6089
* OpenZFS 6101 - attempt to lzc_create() a filesystem under a volume results ↵Andriy Gapon2017-04-141-1/+3
| | | | | | | | | | | | | | | | | in a panic Authored by: Andriy Gapon <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> When querying ZPL properties verify that the objset is of type DMU_OST_ZFS. OpenZFS-issue: https://www.illumos.org/issues/6101 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce2243a Closes #6015
* OpenZFS 6874 - rollback and receive need to reset ZPL state to what's on diskMatthew Ahrens2017-03-131-106/+116
| | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Ported-by: Brian Behlendorf <[email protected]> When we do a clone swap (caused by "zfs rollback" or "zfs receive"), the ZPL doesn't completely reload the state from the DMU; some values remain cached in the zfsvfs_t. OpenZFS-issue: https://www.illumos.org/issues/6874 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fdcbd0 Closes #5888
* Restructure mount option handlingBrian Behlendorf2017-03-101-66/+251
| | | | | | | | | | | | | | | | | | | | | | Restructure the handling of mount options to be consistent with upstream OpenZFS. This required making the following changes. - The zfs_mntopts_t was renamed vfs_t and adjusted to provide the minimal needed functionality. This includes a pointer back to the associated zfsvfs_t. Plus it made it possible to revert zfs_register_callbacks() and zfsvfs_create() back to their original prototypes. - A zfs_mnt_t structure was added for the sole purpose of providing a structure to pass the osname and raw mount pointer to zfs_domount() without having to copy them. - Mount option parsing was moved down from the zpl_* wrapper functions in to the zfs_* functions. This allowed for the code to be simplied and it's where similar functionality appears on other platforms. Signed-off-by: Brian Behlendorf <[email protected]>
* Rename zfs_* functionsBrian Behlendorf2017-03-101-41/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | Several functions were renamed when ZFS was originally ported to Linux. Revert the code to the original names to minimize the delta with upstream OpenZFS. zfs_sb_teardown -> zfsvfs_teardown zfs_sb_create -> zfsvfs_create zfs_sb_setup -> zfsvfs_setup zfs_sb_free -> zfsvfs_free get_zfs_sb -> getzfsvfs zfs_sb_hold -> zfsvfs_hold zfs_sb_rele -> zfsvfs_rele zfs_sb_prune_aliases -> zfs_prune_aliases (Linux-only) zfs_sb_prune -> zfs_prune (Linux only) Align the zfs_vnops.h and zfs_vfsops.h with upstream as much as possible. Several prototypes were removed and those that remain were reordered. Move the EXPORT_SYMBOL lines to the end of the source files for consistency with the other source files. Signed-off-by: Brian Behlendorf <[email protected]>
* Rename zfs_sb_t -> zfsvfs_tBrian Behlendorf2017-03-101-348/+351
| | | | | | | | | | | The use of zfs_sb_t instead of zfsvfs_t results in unnecessary conflicts with the upstream source. Change all instances of zfs_sb_t to zfsvfs_t including updating the variables names. Whenever possible the code was updated to be consistent with hope it appears in the upstream OpenZFS source. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix nfs snapdir automountChunwei Chen2017-03-081-17/+24
| | | | | | | | | | | | | | | | | | | | | | | | The current implementation for allowing nfs to access snapdir is very buggy. It uses a special fh for snapdirs, such that the next time nfsd does fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd never knows it cross a mountpoint. The problem is that nfsd will not hold a reference on the vfsmount of the snapshot. This cause auto unmounter to unmount the snapshot even though nfs is still holding dentries in it. To fix this, we return the inode for the snapdirs themselves. However, we also trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate and see the mountpoint and do crossmnt. Because nfsd will now be aware that these are different filesystems users must add crossmnt to their export options to access snapshot directories. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #3794 Closes #4716 Closes #5810 Closes #5833
* OpenZFS 7301 - zpool export -f should be able to interrupt file freeingGeorge Melikov2017-01-271-2/+24
| | | | | | | | | | | | | | | Authored by: Alek Pinchuk <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Sanjay Nadkarni <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: John Kennedy <[email protected]> Approved by: Gordon Ross <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7301 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb72182 Closes #5680
* OpenZFS 7180 - potential race between zfs_suspend_fs+zfs_resume_fs and ↵George Melikov2017-01-231-7/+6
| | | | | | | | | | | | | | zfs_ioc_rename Authored by: Andriy Gapon <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Richard Lowe <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7180 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/690041b Closes #5627
* OpenZFS 7181 - race between zfs_mount and zfs_ioc_rollbackBrian Behlendorf2017-01-131-7/+7
| | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Gordon Ross <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7181 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90f2c09 Closes #5585
* Fix zmo leak when zfs_sb_create failsChunwei Chen2016-12-191-10/+9
| | | | | | | | | zfs_sb_create would normally takes ownership of zmo, and it will be freed in zfs_sb_free. However, when zfs_sb_create fails we need to explicit free it. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5490 Closes #5496
* Use cstyle -cpP in `make cstyle` checkBrian Behlendorf2016-12-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | Enable picky cstyle checks and resolve the new warnings. The vast majority of the changes needed were to handle minor issues with whitespace formatting. This patch contains no functional changes. Non-whitespace changes are as follows: * 8 times ; to { } in for/while loop * fix missing ; in cmd/zed/agents/zfs_diagnosis.c * comment (confim -> confirm) * change endline , to ; in cmd/zpool/zpool_main.c * a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks * /* CSTYLED */ markers * change == 0 to ! * ulong to unsigned long in module/zfs/dsl_scan.c * rearrangement of module_param lines in module/zfs/metaslab.c * add { } block around statement after for_each_online_node Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Håkan Johansson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5465
* Use system_delay_taskq for long delay tasksChunwei Chen2016-12-011-0/+1
| | | | | | | | | Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire. This free system_taskq from the above long delay tasks, and allow us to do taskq_wait_outstanding on system_taskq without being blocked forever, making system_taskq more generic and useful. Signed-off-by: Chunwei Chen <[email protected]>
* Batch free zpl_posix_acl_releaseChunwei Chen2016-11-071-1/+4
| | | | | | | | | | | | | | | | | | | | | | | Currently every calls to zpl_posix_acl_release will schedule a delayed task, and each delayed task will add a timer. This used to be fine except for possibly bad performance impact. However, in Linux 4.8, a new timer wheel implementation[1] is introduced. In this new implementation, the larger the delay, the less accuracy the timer is. So when we have a flood of timer from zpl_posix_acl_release, they will expire at the same time. Couple with the fact that task_expire will do linear search with lock held. This causes an extreme amount of contention inside interrupt and would actually lockup the system. We fix this by doing batch free to prevent a flood of delayed task. Every call to zpl_posix_acl_release will put the posix_acl to be freed on a lockless list. Every batch window, 1 sec, the zpl_posix_acl_free will fire up and free every posix_acl that passed the grace period on the list. This way, we only have one delayed task every second. [1] https://lwn.net/Articles/646950/ Signed-off-by: Chunwei Chen <[email protected]>
* Fix coverity defects: CID 49339, 153393cao2016-10-181-1/+1
| | | | | | | | CID 49339: Type:Buffer not null terminated CID 153393: Type:Buffer not null terminated Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: <cao.xuewen [email protected]> Closes #5296
* Add support for user/group dnode accounting & quotaJinshan Xiong2016-10-071-13/+99
| | | | | | | | | | | | | | | | This patch tracks dnode usage for each user/group in the DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode accounting have the key prefixed with "obj-" followed by the UID/GID in string format (as done for the block accounting). A new SPA feature has been added for dnode accounting as well as a new ZPL version. The SPA feature must be enabled in the pool before upgrading the zfs filesystem. During the zfs version upgrade, a "quotacheck" will be executed by marking all dnode as dirty. ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500 Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Johann Lombardi <[email protected]>
* Remove znode's z_uid/z_gid memberNikolay Borisov2016-07-251-1/+2
| | | | | | | | | | | | | | | Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid|gid)_(read|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227
* Fix NULL pointer in zfs_preumount from 1d9b3bdChunwei Chen2016-07-201-20/+29
| | | | | | | | | | | | | | When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4867 Issue #4854
* Wait iput_async before evict_inodes to prevent raceChunwei Chen2016-07-191-0/+18
| | | | | | | | | | | | | Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4854
* fh_to_dentry should return ESTALE when generation mismatchChunwei Chen2016-07-121-1/+1
| | | | | | | | | | When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828
* Don't allow accessing XATTR via export handleChunwei Chen2016-07-121-0/+8
| | | | | | | | | | Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828
* OpenZFS 6314 - buffer overflow in dsl_dataset_nameIgor Kozhukhov2016-06-281-1/+1
| | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee
* Linux 4.6 compat: Fall back to d_prune_aliases() if necessaryTim Chase2016-06-171-2/+15
| | | | | | | | | | | | | | | As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes: #4726