aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux
Commit message (Collapse)AuthorAgeFilesLines
...
* Update utsname supportBrian Behlendorf2014-10-172-1/+31
| | | | | | | | | | | | | | | | | | Modify the code to use the utsname() kernel function rather than a global variable. This results is cleaner more portable code because utsname() is already provided by the kernel and can be easily emulated in user space via uname(2). This means that it will behave consistently in both contexts. This is also has the benefit that it allows the removal of a few _KERNEL pre-processor conditions. And it also is a pre-requisite for a proper FUSE port because we need to provide a valid utsname. Finally, it allows us to remove this functionality from the SPL and all the related compatibility code. Signed-off-by: Brian Behlendorf <[email protected]> Issue #2757
* Include sys/taskq.h in linux/vfs_compat.hRichard Yao2014-08-141-0/+2
| | | | | | | | | | | | We should have included sys/taskq.h directly because we use the taskq code here, but we instead had files that included sys/taskq.h also include sys/kmem.h, which happened to include sys/taskq.h. sys/kmem.h no longer does this, so we must define the include as we should have done in the first place. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2411
* Allow building without ACLsChris Wedgwood2014-05-301-15/+15
| | | | | | | | | | Some kernel definitions were buried inside the #if... #endif logic for ACLs. When ACLs are not available these definitions get lost causing the build to fail. Signed-off-by: Chris Wedgwood <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2349
* Implement File Attribute SupportRichard Yao2014-05-011-3/+0
| | | | | | | | | | | | | | | | | | | | | | | We add support for lsattr and chattr to resolve a regression caused by 88c283952f0bfeab54612f9ce666601d83c4244f that broke Python's xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr, which depended on Python's xattr.list(). Only attributes common to both Solaris and Linux are supported. These are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File attributes exclusive to Solaris are present in the ZFS code, but cannot be accessed or modified through this method. That was the case prior to this patch. The resolution of issue zfsonlinux/zfs#229 should implement some method to permit access and modification of Solaris-specific attributes. References: https://bugs.gentoo.org/show_bug.cgi?id=483516 Original-patch-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1691
* Linux 3.14 compat: rq_for_each_segment in dmu_req_copyChunwei Chen2014-04-101-0/+24
| | | | | | | | | | rq_for_each_segment changed from taking bio_vec * to taking bio_vec. We provide rq_for_each_segment4 which takes both. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2124
* Linux 3.14 compat: Immutable biovec changes in vdev_disk.cChunwei Chen2014-04-101-0/+10
| | | | | | | | | | bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter. This patch creates BIO_BI_*(bio) macros to hide the differences. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2124
* Linux 3.14 compat: posix_acl_{create,chmod}Chunwei Chen2014-04-101-3/+8
| | | | | | | | | posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod} Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2124
* cstyle: Resolve C style issuesMichael Kjorling2013-12-184-105/+115
| | | | | | | | | | | | | | | | | | The vast majority of these changes are in Linux specific code. They are the result of not having an automated style checker to validate the code when it was originally written. Others were caused when the common code was slightly adjusted for Linux. This patch contains no functional changes. It only refreshes the code to conform to style guide. Everyone submitting patches for inclusion upstream should now run 'make checkstyle' and resolve any warning prior to opening a pull request. The automated builders have been updated to fail a build if when 'make checkstyle' detects an issue. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1821
* Honor CONFIG_FS_POSIX_ACL kernel optionMassimo Maggi2013-11-051-0/+2
| | | | | | | | | | | | | | | | The required Posix ACL interfaces are only available for kernels with CONFIG_FS_POSIX_ACL defined. Therefore, only enable Posix ACL support for these kernels. All major distribution kernels enable CONFIG_FS_POSIX_ACL by default. If your kernel does not support Posix ACLs the following warning will be printed at ZFS module load time. "ZFS: Posix ACLs disabled by kernel" Signed-off-by: Massimo Maggi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1825
* Posix ACL SupportMassimo Maggi2013-10-292-0/+177
| | | | | | | | | | | | | | | | | | | | | | | | This change adds support for Posix ACLs by storing them as an xattr which is common practice for many Linux file systems. Since the Posix ACL is stored as an xattr it will not overwrite any existing ZFS/NFSv4 ACLs which may have been set. The Posix ACL will also be non-functional on other platforms although it may be visible as an xattr if that platform understands SA based xattrs. By default Posix ACLs are disabled but they may be enabled with the new 'aclmode=noacl|posixacl' property. Set the property to 'posixacl' to enable them. If ZFS/NFSv4 ACL support is ever added an appropriate acltype will be added. This change passes the POSIX Test Suite cleanly with the exception of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too). http://www.tuxera.com/community/posix-test-suite/ Signed-off-by: Massimo Maggi <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #170
* Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()Li Dongyang2013-07-021-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | The approach taken was the rework zfs_holey() as little as possible and then just wrap the code as needed to ensure correct locking and error handling. Tested with xfstests 285 and 286. All tests pass except for 7-9 of 285 which try to reserve blocks first via fallocate(2) and fail because fallocate(2) is not yet supported. Note that the filp->f_lock spinlock did not exist prior to Linux 2.6.30, but we avoid the need for autotools check by virtue of the fact that SEEK_DATA/SEEK_HOLE support was not added until Linux 3.1. An autoconf check was added for lseek_execute() which is currently a private function but the expectation is that it will be exported perhaps as early as Linux 3.11. Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1384
* Return -EOPNOTSUPP for ZFS_IOC_{GET|SET}FLAGSBrian Behlendorf2013-06-261-0/+3
| | | | | | | | | | Until these hooks are fully implemented return the expected -EOPNOTSUPP error to indicate they are not functional. This allows test suites such as xfstests to cleanly skip testing this functionality until it's implemented. Signed-off-by: Brian Behlendorf <[email protected]> Issue #229
* Fix compile warning on 32-bit systemsYing Zhu2013-06-191-1/+1
| | | | | | | | | | | | | The definition of zfs_vdev_holder casts VDEV_HOLDER into a function pointer passing to linux kernel's block layer function blkdev_get_by_path. However current VDEV_HOLDER is defined to be wider than 32 bits and the compiler warns about potential overflows. Instead of specifying different values for 32-bit and 64-bit systems using ifdefs, choose the common factor 32-bit addresses. Redefine VDEV_HOLDER to 0x2401de7("zholder") here. Signed-off-by: Ying Zhu <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1520
* Change zfs-kmod-devel install pathBrian Behlendorf2013-03-131-1/+1
| | | | | | | | | | | | | | | Install the common zfs kernel development headers under /usr/src/zfs-<version>/ rather than in a kernel specific directory. The kernel specific build products such as zfs_config.h and Modules.symvers are left installed under /usr/src/zfs-<version>/<kernel>. This was done to be consistent with where dkms expects kernel module source to be packaged. It also allows for a common zfs-kmod-devel package which includes the headers, and per-kernel zfs-kmod-devel-<kernel> packages. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix hot sparesBrian Behlendorf2013-03-011-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue with hot spares in ZoL is because it opens all leaf vdevs exclusively (O_EXCL). On Linux, exclusive opens cause subsequent exclusive opens to fail with EBUSY. This could be resolved by not opening any of the devices exclusively, which is what Illumos does, but the additional protection offered by exclusive opens is desirable. It cleanly prevents you from accidentally adding an in-use non-ZFS device to your pool. To fix this we very slightly relaxed the usage of O_EXCL in the following ways. 1) Functions which open the device but only read had the O_EXCL flag removed and were updated to use O_RDONLY. 2) A common holder was added to the vdev disk code. This allow the ZFS code to internally open the device multiple times but non-ZFS callers may not. 3) An exception was added to make_disks() for hot spare when creating partition tables. For hot spare devices which are already opened exclusively we skip creating the partition table because this must already have been done when the disk was originally added as a hot spare. Additional minor changes include fixing check_in_use() to use a partition instead of a slice suffix. And is_spare() was moved above make_disks() to avoid adding a forward reference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #250
* Linux 2.6.26 compat, lookup_bdev()Brian Behlendorf2013-01-281-0/+9
| | | | | | | | | | | | | | | | | | | It's doubtful many people were impacted by this but commit 6c28567 accidentally broke ZFS builds for 2.6.26 and earlier kernels. This commit depends on the lookup_bdev() function which exists in 2.6.26 but wasn't exported until 2.6.27. The availability of the function isn't critical so a wrapper is introduced which returns ERR_PTR(-ENOTSUP) when the function isn't defined. This will have the effect of causing zvol_is_zvol() to always fail for 2.6.26 kernels. This in turn means vdevs will always get opened concurrently which is good for normal usage. This will only become an issue if your using a zvol as a vdev in another pool. In which case you really should be using a newer kernel anyway. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1205
* Add d_clear_d_op() compatibilityBrian Behlendorf2013-01-231-0/+20
| | | | | | | | | | | | | Added d_clear_d_op() helper function which clears some flags and the registered dentry->d_op table. This is required because d_set_d_op() issues a warning when the dentry operations table is already set. For the .zfs control directory to work properly we must be able to override the default operations table and register custom .d_automount and .d_revalidate callbacks. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #1230
* Fix 'zfs rollback' on mounted file systemsBrian Behlendorf2013-01-171-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rolling back a mounted filesystem with open file handles and cached dentries+inodes never worked properly in ZoL. The major issue was that Linux provides no easy mechanism for modules to invalidate the inode cache for a file system. Because of this it was possible that an inode from the previous filesystem would not get properly dropped from the cache during rolling back. Then a new inode with the same inode number would be create and collide with the existing cached inode. Ideally this would trigger an VERIFY() but in practice the error wasn't handled and it would just NULL reference. Luckily, this issue can be resolved by sprucing up the existing Solaris zfs_rezget() functionality for the Linux VFS. The way it works now is that when a file system is rolled back all the cached inodes will be traversed and refetched from disk. If a version of the cached inode exists on disk the in-core copy will be updated accordingly. If there is no match for that object on disk it will be unhashed from the inode cache and marked as stale. This will effectively make the inode unfindable for lookups allowing the inode number to be immediately recycled. The inode will then only be accessible from the cached dentries. Subsequent dentry lookups which reference a stale inode will result in the dentry being invalidated. Once invalidated the dentry will drop its reference on the inode allowing it to be safely pruned from the cache. Special care is taken for negative dentries since they do not reference any inode. These dentires will be invalidate based on when they were added to the dentry cache. Entries added before the last rollback will be invalidate to prevent them from masking real files in the dataset. Two nice side effects of this fix are: * Removes the dependency on spl_invalidate_inodes(), it can now be safely removed from the SPL when we choose to do so. * zfs_znode_alloc() no longer requires a dentry to be passed. This effectively reverts this portition of the code to its upstream counterpart. The dentry is not instantiated more correctly in the Linux ZPL layer. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #795
* Fix false ENOENT on snapshot control dentriesNed Bass2013-01-161-0/+25
| | | | | | | | | | | | | | | | | | | | | | Lookups in the snapshot control directory for an existing snapshot fail with ENOENT if an earlier lookup failed before the snapshot was created. This is because the earlier lookup causes a negative dentry to be cached which is never invalidated. The bug can be reproduced as follows (the second ls should succeed): $ ls /tank/.zfs/snapshot/s ls: cannot access /tank/.zfs/snapshot/s: No such file or directory $ zfs snap tank@s $ ls /tank/.zfs/snapshot/s ls: cannot access /tank/.zfs/snapshot/s: No such file or directory To remedy this, always invalidate cached dentries in the snapshot control directory. Since these entries never exist on disk there is no significant performance penalty for the extra lookups. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1192
* Improve AF hard disk detectionBrian Behlendorf2012-11-151-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | Use the bdev_physical_block_size() interface to determine the minimize write size which can be issued without incurring a read-modify-write operation. This is used to set the ashift correctly to prevent a performance penalty when using AF hard disks. Unfortunately, this interface isn't entirely reliable because it's not uncommon for disks to misreport this value. For this reason you may still need to manually set your ashift with: zpool create -o ashift=12 ... The solution to this in the upstream Illumos source was to add a white list of known offending drives. Maintaining such a list will be a burden, but it still may be worth doing if we can detect a large number of these drives. This should be considered as future work. Reported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #916
* Linux 3.6 compat, iops->mkdir()Richard Yao2012-10-141-1/+1
| | | | | | | | | | | | Use .mkdir instead of .create in 3.3 compatibility check. Linux 3.6 modifies inode_operations->create's function prototype. This causes an autotools Linux 3.3. compatibility check for a function prototype change in create, mkdir and mknode to fail. Since mkdir and mknode are unchanged, we modify the check to examine it instead. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
* Linux 3.6 compat, sget()Yuxuan Shui2012-10-141-0/+10
| | | | | | | | | | | | | As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the mount flags are now passed to sget() so they can be used when initializing a new superblock. ZFS never uses sget() in this fashion so we can simply pass a zero and add a zpl_sget() compatibility wrapper. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
* Revert "Improve AF hard disk detection"Brian Behlendorf2012-09-111-19/+5
| | | | | | | | | | | | | | | | | | This reverts commit 395350c85d9903beba43bac7ae79092ae25f1526 which accidentally introduced issue #955. Pools using AF drives which were originally created with a sector size of 512 bytes will now be correctly detected to have physical sector size of 4096. This is desirable for a new pool, however for an existing pool abruptly changing the sector size causes problems. For this reason, this change is being reverted until the additional logic can be added to detect the existing pool case. Existing pools must use the ashift size stored in the label regardless of what the disk reports. This is critical for compatibility. Signed-off-by: Brian Behlendorf <[email protected]> Issue #955
* Improve AF hard disk detectionBrian Behlendorf2012-09-041-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | Use the bdev_physical_block_size() interface to determine the minimize write size which can be issued without incurring a read-modify-write operation. This is used to set the ashift correctly to prevent a performance penalty when using AF hard disks. Unfortunately, this interface isn't entirely reliable because it's not uncommon for disks to misreport this value. For this reason you may still need to manually set your ashift with: zpool create -o ashift=12 ... The solution to this in the upstream Illumos source was to add a while list of known offending drives. Maintaining such a list will be a burden, but it still may be worth doing if we can detect a large number of these drives. This should be considered as future work. Reported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #916
* Remove autotools productsBrian Behlendorf2012-08-271-621/+0
| | | | | | | | Remove all of the generated autotools products from the repository and update the .gitignore files accordingly. Signed-off-by: Brian Behlendorf <[email protected]> Closes #718
* Set zvol discard_granularity to the volblocksize.Etienne Dechamps2012-08-072-0/+16
| | | | | | | | | | | | | | | | | | Currently, zvols have a discard granularity set to 0, which suggests to the upper layer that discard requests of arbirarily small size and alignment can be made efficiently. In practice however, ZFS does not handle unaligned discard requests efficiently: indeed, it is unable to free a part of a block. It will write zeros to the specified range instead, which is both useless and inefficient (see dnode_free_range). With this patch, zvol block devices expose volblocksize as their discard granularity, so the upper layer is aware that it's not supposed to send discard requests smaller than volblocksize. Signed-off-by: Brian Behlendorf <[email protected]> Closes #862
* Linux 3.5 compat, end_writeback() changed to clear_inode()Richard Yao2012-07-232-0/+11
| | | | | | | | | | | | | | | | The end_writeback() function was changed by moving the call to inode_sync_wait() earlier in to evict(). This effecitvely changes the ordering of the sync but it does not impact the details of the zfs implementation. However, as part of this change end_writeback() was renamed to clear_inode() to reflect the new semantics. This change does impact us and clear_inode() now maps to end_writeback() for kernels prior to 3.5. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #784
* Linux 3.5 compat, iops->truncate_range() removedRichard Yao2012-07-231-0/+1
| | | | | | | | | The vmtruncate_range() support has been removed from the kernel in favor of using the fallocate method in the file_operations table. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #784
* Linux 3.5 compat, eops->encode_fh() takes inodesRichard Yao2012-07-231-0/+1
| | | | | | | | | | | | | | The export_operations member ->encode_fh() has been updated to take both the child and parent inodes. This interface used to take the child dentry and a bool describing if the parent is needed. NOTE: While updating this code I noticed that we do not currently cleanly handle the case where we're passed a connectable parent. This code should be audited to make sure we're doing the right thing. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #784
* Move partition scanning from userspace to module.Etienne Dechamps2012-07-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, zpool online -e (dynamic vdev expansion) doesn't work on whole disks because we're invoking ioctl(BLKRRPART) from userspace while ZFS still has a partition open on the disk, which results in EBUSY. This patch moves the BLKRRPART invocation from the zpool utility to the module. Specifically, this is done just before opening the device in vdev_disk_open() which is called inside vdev_reopen(). This requires jumping through some hoops to get to the disk device from the partition device, and to make sure we can still open the partition after the BLKRRPART call. Note that this new code path is triggered on dynamic vdev expansion only; other actions, like creating a new pool, are unchanged and still call BLKRRPART from userspace. This change also depends on API changes which are available in 2.6.37 and latter kernels. The build system has been updated to detect this, but there is no compatibility mode for older kernels. This means that online expansion will NOT be available in older kernels. However, it will still be possible to expand the vdev offline. Signed-off-by: Brian Behlendorf <[email protected]> Closes #808
* Linux 3.4 compat, d_make_root() replaces d_alloc_root()Richard Yao2012-06-112-0/+5
| | | | | | | | | | | | | | | | | | | | torvalds/linux@adc0e91ab142abe93f5b0d7980ada8a7676231fe introduced introduced d_make_root() as a replacement for d_alloc_root(). Further commits appear to have removed d_alloc_root() from the Linux source tree. This causes the following failure: error: implicit declaration of function 'd_alloc_root' [-Werror=implicit-function-declaration] To correct this we update the code to use the current d_make_root() interface for readability. Then we introduce an autotools check to determine if d_make_root() is available. If it isn't then we define some compatibility logic which used the older d_alloc_root() interface. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #776
* Linux 3.3 compat, iops->create()/mkdir()/mknod()Brian Behlendorf2012-04-302-0/+13
| | | | | | | | | | The mode argument of iops->create()/mkdir()/mknod() was changed from an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf check was added to detect the API change and then correctly set a zpl_umode_t typedef. There is no functional change. Signed-off-by: Brian Behlendorf <[email protected]> Closes #701
* Add --enable-debug-dmu-tx configure optionBrian Behlendorf2012-03-231-0/+1
| | | | | | | | | | | | | | | | | | Allow rigorous (and expensive) tx validation to be enabled/disabled indepentantly from the standard zfs debugging. When enabled these checks ensure that all txs are constructed properly and that a dbuf is never dirtied without taking the correct tx hold. This checking is particularly helpful when adding new dmu consumers like Lustre. However, for established consumers such as the zpl with no known outstanding tx construction problems this is just overhead. --enable-debug-dmu-tx - Enable/disable validation of each tx as --disable-debug-dmu-tx it is constructed. By default validation is disabled due to performance concerns. Signed-off-by: Brian Behlendorf <[email protected]>
* Add .zfs control directoryBrian Behlendorf2012-03-222-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for the .zfs control directory. This was accomplished by leveraging as much of the existing ZFS infrastructure as posible and updating it for Linux as required. The bulk of the core functionality is now all there with the following limitations. *) The .zfs/snapshot directory automount support requires a 2.6.37 or newer kernel. The exception is RHEL6.2 which has backported the d_automount patches. *) Creating/destroying/renaming snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory works as expected. However, this functionality is only available to root until zfs delegations are finished. * mkdir - create a snapshot * rmdir - destroy a snapshot * mv - rename a snapshot The following issues are known defeciences, but we expect them to be addressed by future commits. *) Add automount support for kernels older the 2.6.37. This should be possible using follow_link() which is what Linux did before. *) Accessing the .zfs/snapshot directory via NFS is not yet possible. The majority of the ground work for this is complete. However, finishing this work will require resolving some lingering integration issues with the Linux NFS kernel server. *) The .zfs/shares directory exists but no futher smb functionality has yet been implemented. Contributions-by: Rohan Puri <[email protected]> Contributiobs-by: Andrew Barnes <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #173
* Cleanly support debug packagesBrian Behlendorf2012-02-271-0/+1
| | | | | | | | | | | | | | | | | | | Allow a source rpm to be rebuilt with debugging enabled. This avoids the need to have to manually modify the spec file. By default debugging is still largely disabled. To enable specific debugging features use the following options with rpmbuild. '--with debug' - Enables ASSERTs # For example: $ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm Additionally, ZFS_CONFIG has been added to zfs_config.h for packages which build against these headers. This is critical to ensure both zfs and the dependant package are using the same prototype and structure definitions. Signed-off-by: Brian Behlendorf <[email protected]>
* Add support for DISCARD to ZVOLs.Etienne Dechamps2012-02-092-0/+9
| | | | | | | | | | | | | | | | | | | | DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus optimizing disk space usage by allowing a ZVOL to shrink instead of just grow. We can't use zfs_space() or zfs_freesp() here, since these functions only work on regular files, not volumes. Fortunately we can use the low-level function dmu_free_long_range() which does exactly what we want. Currently the discard operation is not added to the log. That's not a big deal since losing discard requests cannot result in data corruption. It would however result in disk space usage higher than it should be. Thus adding log support to zvol_discard() is probably a good idea for a future improvement. Signed-off-by: Brian Behlendorf <[email protected]>
* Support the fallocate() file operation.Etienne Dechamps2012-02-091-0/+1
| | | | | | | | | | | | | | Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported, since it's the only one that matches the behavior of zfs_space(). This makes it pretty much useless in its current form, but it's a start. To support other flag combinations we would need to modify zfs_space() to make it more flexible, or emulate the desired functionality in zpl_fallocate(). Signed-off-by: Brian Behlendorf <[email protected]> Issue #334
* Improve ZVOL queue behavior.Etienne Dechamps2012-02-072-0/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering. Detailed rationale: - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide. - max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide. - physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow. - The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix synchronicity for ZVOLs.Etienne Dechamps2012-02-072-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zvol_write() assumes that the write request must be written to stable storage if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does *not* mean what we think it means in the context of the Linux block layer. This is well explained in linux/fs.h: WRITE: A normal async write. Device will be plugged. WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down the hint that someone will be waiting on this IO shortly. WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush. WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on non-volatile media on completion. In other words, SYNC does not *mean* that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs. The request for stable storage is expressed by the FUA flag, so we must commit the ZIL after the write if the FUA flag is set. In addition, we must commit the ZIL before the write if the FLUSH flag is set. Also, we must inform the block layer that we actually support FLUSH and FUA. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.3 compat, sops->show_options()Brian Behlendorf2012-02-031-0/+1
| | | | | | | | | | The second argument of sops->show_options() was changed from a 'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check to detect the API change and then conditionally define the expected interface. In either case we are only interested in the zfs_sb_t. Signed-off-by: Brian Behlendorf <[email protected]> Closes #549
* Add the release component to headersBrian Behlendorf2012-01-182-2/+2
| | | | | | | | | When the original build system code was added the release component was accidentally omited from the development header install path. This patch adds the missing path component so it's always clear exactly what release your compiling against. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.1 compat, super_block->s_shrinkBrian Behlendorf2012-01-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux 3.1 kernel has introduced the concept of per-filesystem shrinkers which are directly assoicated with a super block. Prior to this change there was one shared global shrinker. The zfs code relied on being able to call the global shrinker when the arc_meta_limit was exceeded. This would cause the VFS to drop references on a fraction of the dentries in the dcache. The ARC could then safely reclaim the memory used by these entries and honor the arc_meta_limit. Unfortunately, when per-filesystem shrinkers were added the old interfaces were made unavailable. This change adds support to use the new per-filesystem shrinker interface so we can continue to honor the arc_meta_limit. The major benefit of the new interface is that we can now target only the zfs filesystem for dentry and inode pruning. Thus we can minimize any impact on the caching of other filesystems. In the context of making this change several other important issues related to managing the ARC were addressed, they include: * The dnlc_reduce_cache() function which was called by the ARC to drop dentries for the Posix layer was replaced with a generic zfs_prune_t callback. The ZPL layer now registers a callback to drop these dentries removing a layering violation which dates back to the Solaris code. This callback can also be used by other ARC consumers such as Lustre. arc_add_prune_callback() arc_remove_prune_callback() * The arc_reduce_dnlc_percent module option has been changed to arc_meta_prune for clarity. The dnlc functions are specific to Solaris's VFS and have already been largely eliminated already. The replacement tunable now represents the number of bytes the prune callback will request when invoked. * Less aggressively invoke the prune callback. We used to call this whenever we exceeded the arc_meta_limit however that's not strictly correct since it results in over zeleous reclaim of dentries and inodes. It is now only called once the arc_meta_limit is exceeded and every effort has been made to evict other data from the ARC cache. * More promptly manage exceeding the arc_meta_limit. When reading meta data in to the cache if a buffer was unable to be recycled notify the arc_reclaim thread to invoke the required prune. * Added arcstat_prune kstat which is incremented when the ARC is forced to request that a consumer prune its cache. Remember this will only occur when the ARC has no other choice. If it can evict buffers safely without invoking the prune callback it will. * This change is also expected to resolve the unexpect collapses of the ARC cache. This would occur because when exceeded just the arc_meta_limit reclaim presure would be excerted on the arc_c value via arc_shrink(). This effectively shrunk the entire cache when really we just needed to reclaim meta data. Signed-off-by: Brian Behlendorf <[email protected]> Closes #466 Closes #292
* Linux 3.2 compat: set_nlink()Darik Horn2011-12-162-0/+16
| | | | | | | | | | | Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170 Use the new set_nlink() kernel function instead. Signed-off-by: Brian Behlendorf <[email protected]> Closes: #462
* Add make rule for building Arch Linux packagesPrakash Surya2011-12-141-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added the necessary build infrastructure for building packages compatible with the Arch Linux distribution. As such, one can now run: $ ./configure $ make pkg # Alternatively, one can run 'make arch' as well on the Arch Linux machine to create two binary packages compatible with the pacman package manager, one for the zfs userland utilities and another for the zfs kernel modules. The new packages can then be installed by running: # pacman -U $package.pkg.tar.xz In addition, source-only packages suitable for an Arch Linux chroot environment or remote builder can also be build using the 'sarch' make rule. NOTE: Since the source dist tarball is created on the fly from the head of the build tree, it's MD5 hash signature will be continually influx. As a result, the md5sum variable was intentionally omitted from the PKGBUILD files, and the '--skipinteg' makepkg option is used. This may or may not have any serious security implications, as the source tarball is not being downloaded from an outside source. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #491
* Simplify BDI integrationBrian Behlendorf2011-11-082-16/+30
| | | | | | | | | | | | | | Update the code to use the bdi_setup_and_register() helper to simplify the bdi integration code. The updated code now just registers the bdi during mount and destroys it during unmount. The only complication is that for 2.6.32 - 2.6.33 kernels the helper wasn't available so in these cases the zfs code must provide it. Luckily the bdi_setup_and_register() function is trivial. Signed-off-by: Brian Behlendorf <[email protected]> Closes #367
* Autogen refresh for udev changesBrian Behlendorf2011-08-081-0/+3
| | | | | | | | Run autogen.sh using the same autotools versions as upstream: * autoconf-2.63 * automake-1.11.1 * libtool-2.2.6b
* Add backing_device_info per-filesystemBrian Behlendorf2011-08-042-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a long time now the kernel has been moving away from using the pdflush daemon to write 'old' dirty pages to disk. The primary reason for this is because the pdflush daemon is single threaded and can be a limiting factor for performance. Since pdflush sequentially walks the dirty inode list for each super block any delay in processing can slow down dirty page writeback for all filesystems. The replacement for pdflush is called bdi (backing device info). The bdi system involves creating a per-filesystem control structure each with its own private sets of queues to manage writeback. The advantage is greater parallelism which improves performance and prevents a single filesystem from slowing writeback to the others. For a long time both systems co-existed in the kernel so it wasn't strictly required to implement the bdi scheme. However, as of Linux 2.6.36 kernels the pdflush functionality has been retired. Since ZFS already bypasses the page cache for most I/O this is only an issue for mmap(2) writes which must go through the page cache. Even then adding this missing support for newer kernels was overlooked because there are other mechanisms which can trigger writeback. However, there is one critical case where not implementing the bdi functionality can cause problems. If an application handles a page fault it can enter the balance_dirty_pages() callpath. This will result in the application hanging until the number of dirty pages in the system drops below the dirty ratio. Without a registered backing_device_info for the filesystem the dirty pages will not get written out. Thus the application will hang. As mentioned above this was less of an issue with older kernels because pdflush would eventually write out the dirty pages. This change adds a backing_device_info structure to the zfs_sb_t which is already allocated per-super block. It is then registered when the filesystem mounted and unregistered on unmount. It will not be registered for mounted snapshots which are read-only. This change will result in flush-<pool> thread being dynamically created and destroyed per-mounted filesystem for writeback. Signed-off-by: Brian Behlendorf <[email protected]> Closes #174
* Provide a rc.d script for archlinuxzfs-0.6.0-rc5Kyle Fuller2011-07-111-0/+1
| | | | | | | | | | | Unlike most other Linux distributions archlinux installs its init scripts in /etc/rc.d insead of /etc/init.d. This commit provides an archlinux rc.d script for zfs and extends the build infrastructure to ensure it get's installed in the correct place. Signed-off-by: Brian Behlendorf <[email protected]> Closes #322
* Linux compat 2.6.39: mount_nodev()Brian Behlendorf2011-07-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The .get_sb callback has been replaced by a .mount callback in the file_system_type structure. When using the new interface the caller must now use the mount_nodev() helper. Unfortunately, the new interface no longer passes the vfsmount down to the zfs layers. This poses a problem for the existing implementation because we currently save this pointer in the super block for latter use. It provides our only entry point in to the namespace layer for manipulating certain mount options. This needed to be done originally to allow commands like 'zfs set atime=off tank' to work properly. It also allowed me to keep more of the original Solaris code unmodified. Under Solaris there is a 1-to-1 mapping between a mount point and a file system so this is a fairly natural thing to do. However, under Linux they many be multiple entries in the namespace which reference the same filesystem. Thus keeping a back reference from the filesystem to the namespace is complicated. Rather than introduce some ugly hack to get the vfsmount and continue as before. I'm leveraging this API change to update the ZFS code to do things in a more natural way for Linux. This has the upside that is resolves the compatibility issue for the long term and fixes several other minor bugs which have been reported. This commit updates the code to remove this vfsmount back reference entirely. All modifications to filesystem mount options are now passed in to the kernel via a '-o remount'. This is the expected Linux mechanism and allows the namespace to properly handle any options which apply to it before passing them on to the file system itself. Aside from fixing the compatibility issue, removing the vfsmount has had the benefit of simplifying the code. This change which fairly involved has turned out nicely. Closes #246 Closes #217 Closes #187 Closes #248 Closes #231
* Linux compat 2.6.39: security_inode_init_security()Brian Behlendorf2011-07-012-0/+9
| | | | | | | | | | | The security_inode_init_security() function now takes an additional qstr argument which must be passed in from the dentry if available. Passing a NULL is safe when no qstr is available the relevant security checks will just be skipped. Closes #246 Closes #217 Closes #187