aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Drain iput taskq outside z_teardown_lockBrian Behlendorf2014-01-091-8/+8
| | | | | | | | | | | | | | It's unsafe to drain the iput taskq while holding the z_teardown_lock as a writer. This is because when the last reference on an inode is dropped it may still have pages which need to be written to disk. This will be done through zpl_writepages which will acquire the z_teardown_lock as a reader in ZFS_ENTER. Therefore, if we're holding the lock as a writer in zfs_sb_teardown the unmount will deadlock. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #1988
* Force LZ4_FORCE_SW_BITCOUNT for SparcBrian Behlendorf2014-01-091-0/+3
| | | | | | | | | | | | | | This change was proposed for Sparc but it's not clear to me why it's required. Proper support exists in the lz4 code to detect the endianness and the required builtins are available for gcc. Still I'm including the patch because it will only impact Sparc and it may resolve a case which hasn't occured to me. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700
* Fix zfs_getattr_fast typesBrian Behlendorf2014-01-091-1/+6
| | | | | | | | | | | | | On Sparc sp->blksize will be a 64-bit value which is then cast incorrectly to a 32-bit value. For big endian systems this results in an incorrect value for sp->blksize. To resolve the problem local variables of the correct size are used and then assigned to sp->blksize. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700
* Fix nvlist 'Bus Error' for SparcBrian Behlendorf2014-01-091-2/+4
| | | | | | | | | | | | | The mis-aligned memory accesses in nvpair_native_embedded() and nvpair_native_embedded_array() will cause a 'Bus Error' for architectures such as Sparc which not fully byte addressible. To avoid this issue care is taken to avoid dereferencing the potentially mis-aligned packed nvlist_t. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700
* Use local variable to read zp->z_modeBrian Behlendorf2014-01-092-2/+5
| | | | | | | | | | | | | | | | | | When accessing the zp->z_mode through the SA bulk interface we expect that 64-bits are available to hold the result. However, on 32-bit platforms mode_t will only be 32-bits so we cannot pass it to SA_ADD_BULK_ATTR(). Instead a local uint64_t variable must be used and the result assigned to zp->z_mode. This went unnoticed on 32-bit little endian platforms because the bytes happen to end up in the correct 32-bits. But on big endian platforms like Sparc the zp->z_mode will always end up set to zero. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700
* Define the needed ISA types for SparcBrian Behlendorf2014-01-092-2/+31
| | | | | | | | | | Add the minimum required ISA types to support the Sparc architecture. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700
* Add ddt, ddt_entry, and l2arc_hdr cachesJohn Layman2014-01-074-16/+42
| | | | | | | | | | | Back the allocations for ddt tables+entries and l2arc headers with kmem caches. This will reduce the cost of allocating these commonly used structures and allow for greater visibility of them through the /proc/spl/kmem/slab interface. Signed-off-by: John Layman <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1893
* Remove unconditional sharetab updateBrian Behlendorf2014-01-071-8/+0
| | | | | | | | | | | | | | | Removes the unconditional sharetab update when running any zfs command. This means the sharetab might become out of date if users are manually adding/removing shares with exportfs. But we shouldn't punish all callers to zfs in order to handle that unlikely case. In the unlikely event we observe issues because of this it can always be added back to just the share/unshare call paths where we need an up to date sharetab. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #845
* Enable /etc/mtab cache to improve performanceBrian Behlendorf2014-01-071-1/+1
| | | | | | | | | | | | | | | | | | | | | Re-enable the /etc/mtab cache to prevent the zfs command from having to repeatedly open and read from the /etc/mtab file. Instead an AVL tree of the mounted filesystems is created and used to vastly speed up lookups. This means that if non-zfs filesystems are mounted concurrently the 'zfs mount' will not immediately detect them. In practice that will rarely happen and even if it does the absolute worst case would be a failed mount. This was originally disabled out of an abundance of paranoia. NOTE: There may still be some parts of the code which do not consult the mtab cache. They should be updated to check the mtab cache as they as discovered to be a problem. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #845
* Add UNSHARING of filesystems and EXPORTING poolsTurbo Fredriksson2014-01-071-0/+11
| | | | | | | | | | As a 'stop' action ensure the filesystem is unshared before it is unmounted, just in case. Additionally, export the pool so it may be cleanly imported by a different host. Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2003
* Fix the creation of ZPOOL_HIST_CMD pool history entries.Tim Chase2014-01-072-9/+21
| | | | | | | | | | | | | | | | | | | | | Move the libzfs_fini() after the zpool_log_history() call so the ZPOOL_HIST_CMD entry can get written. Fix the handling of saved_poolname in zfsdev_ioctl() which was broken as part of the stack-reduction work in a16878805388c4d96cb8a294de965071d138a47b. Since ZoL destroys the TSD data in which the previously successful ioctl()'s pool name is stored following every vop, the ZFS_IOC_LOG_HISTORY ioctl has a very important restriction: it can only successfully write a long entry following a successful ioctl() if no intervening vops have been performed. Some of zfs subcommands do perform intervening vops and to do the logging themselves. At the moment, the "create" and "clone" subcommands have been modified appropriately. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1998
* Properly handle updates of variably-sized SA entries.Tim Chase2013-12-201-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | During the update process in sa_modify_attrs(), the sizes of existing variably-sized SA entries are obtained from sa_lengths[]. The case where a variably-sized SA was being replaced neglected to increment the index into sa_lengths[], so subsequent variable-length SAs would be rewritten with the wrong length. This patch adds the missing increment operation so all variably-sized SA entries are stored with their correct lengths. Previously, a size-changing update of a variably-sized SA that occurred when there were other variably-sized SAs in the bonus buffer would cause the subsequent SAs to be corrupted. The most common case in which this would occur is when a mode change caused the ZPL_DACL_ACES entry to change size when a ZPL_DXATTR (SA xattr) entry already existed. The following sequence would have caused a failure when xattr=sa was in force and would corrupt the bonus buffer: open(filename, O_WRONLY | O_CREAT, 0600); ... lsetxattr(filename, ...); /* create xattr SA */ chmod(filename, 0650); /* enlarges the ACL */ Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1978
* Register correct handlers for nvlist_{dup,pack,unpack}Brian Behlendorf2013-12-201-38/+19
| | | | | | | | | | | This change is related to commit 81eaf15 which ensured the correct allocation handlers were installed for nvlist_alloc(). The nvlist functions nvlist_dup(), nvlist_pack(), and nvlist_unpack() suffer from the same issue and have been updated accordingly. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1937
* Add full SELinux supportMatthew Thode2013-12-1910-82/+134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Four new dataset properties have been added to support SELinux. They are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map directly to the context options described in mount(8). When one of these properties is set to something other than 'none'. That string will be passed verbatim as a mount option for the given context when the filesystem is mounted. For example, if you wanted the rootcontext for a filesystem to be set to 'system_u:object_r:fs_t' you would set the property as follows: $ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media This will ensure the filesystem is automatically mounted with that rootcontext. It is equivalent to manually specifying the rootcontext with the -o option like this: $ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media By default all four contexts are set to 'none'. Further information on SELinux contexts is detailed in mount(8) and selinux(8) man pages. Signed-off-by: Matthew Thode <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #1504
* cstyle: Resolve C style issuesMichael Kjorling2013-12-18165-1945/+2129
| | | | | | | | | | | | | | | | | | The vast majority of these changes are in Linux specific code. They are the result of not having an automated style checker to validate the code when it was originally written. Others were caused when the common code was slightly adjusted for Linux. This patch contains no functional changes. It only refreshes the code to conform to style guide. Everyone submitting patches for inclusion upstream should now run 'make checkstyle' and resolve any warning prior to opening a pull request. The automated builders have been updated to fail a build if when 'make checkstyle' detects an issue. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1821
* cstyle: Allow spaces in all commentsBrian Behlendorf2013-12-181-2/+2
| | | | | | | | | | Update the cstyle.pl script to allow pictures in all comments not just header comments. Recent changes from Illumos such as d3cc8b1 have relocated various pictures in the standard block comments to make the code more readable. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1821
* cstyle: Exclude several files from 'make checkstyle'Brian Behlendorf2013-12-181-1/+2
| | | | | | | | | The zfs_config.h header and *.mod.c files are both products of the build process. They must be excluded from the style check because they are not part of the pristine source. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1821
* Illumos #4208John Wren Kennedy2013-12-181-2/+2
| | | | | | | | | | | | | | | | | 4208 Typo in zfs_main.c: "posxiuser" Reviewed by: Sonu Pillai <[email protected]> Reviewed by: Will Guyette <[email protected]> Reviewed by: Eric Diven <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/4208 illumos/illumos-gate@f38cb554a534c6df738be3f4d23327e69888e634 Ported-by: Brian Behlendorf <[email protected]> Closes #1986
* Add zfs_send_corrupt_data module optionTurbo Fredriksson2013-12-182-0/+16
| | | | | | | | | Tuning setting to ignore read/checksum errors when sending data. Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1982 Issue #1897
* Cause zfs.spec to place dracut files properlyAaron Fineman2013-12-181-1/+1
| | | | | | | | | This is an extension of commit ffb2111. As the fedora conditional has been added, this allows centos/rhel-6 to fall back to the proper directory (/usr/share/dracut) Signed-off-by: Brian Behlendorf <[email protected]> Closes #1984
* Handle acl flags from util-linux mount commandrenelson2013-12-183-3/+19
| | | | | | | | | | Add acl, noacl and posixacl to option_map, avoiding ENOENT error case when mount from util-linux-2.24 execs mount.zfs with any of those flags Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: renelson <[email protected]> Issue #1968
* Fix grammar in parse_options() error messagerenelson2013-12-171-1/+1
| | | | | | | | | A minor grammar error was corrected in in the parse_options() error handling for the ENOENT case. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: renelson <[email protected]> Issue #1968
* Fix z_sync_cnt decrement in zfs_closeChunwei Chen2013-12-172-10/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | The comment in zfs_close states that "Under Linux the zfs_close() hook is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close is associated with every successful struct file creation/deletion, which should always be balanced. Here is an example of what's wrong: Process A B open(O_SYNC) z_sync_cnt = 1 open(O_SYNC) z_sync_cnt = 2 close() z_sync_cnt = 0 So z_sync_cnt is 0 even if B still has the file with O_SYNC. Also moves the generic_file_open call before zfs_open to ensure that in the case generic_file_open fails z_sync_cnt is not incremented. This is safe because generic_file_open has no side effects. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1962
* Silence e2fsck warning in zconfig.shBrian Behlendorf2013-12-161-11/+16
| | | | | | | | | | | | | | | | | | | When running zconfig.sh test 7 and 8 cause the following warning to be printed to the console. It's caused because we're snapshoting a mounted ext2 filesystem which is not in a 'clean' state. This is to be expected since we have no guarentees about the on-disk consistency of the filesystem. EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended To silence the warning and preserve the intent of these test cases they have been updated to unmount the filesystem prior to snapshoting them. This ensures the ext2 filesystem is in a consistent state when the snapshot is taken. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #1972
* cstyle: zvol.cBrian Behlendorf2013-12-161-109/+113
| | | | | | | | | | | Update zvol.c to conform to the style guidelines, verified by running cstyle.pl on the source file. This patch contains no functional changes. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #1821
* Update zfs(8) Snapshots sectionBrian Behlendorf2013-12-161-1/+1
| | | | | | | | | | | | | The Snapshots section of the zfs(8) man page is incorrect and should have been updated as part of #1312. Snapshots of volumes can be accessed independently and their visibility is determined by the 'snapdev=hidden|visible' property. This is analogous to the existing 'snapdir=hidden|visible' property. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #1921
* Sync /dev/zfs ioctl orderingBrian Behlendorf2013-12-162-4/+25
| | | | | | | | | | | | | | | | | | | | | | In order to minimize any future disruption caused by the addition and removal /dev/zfs ioctls this patch makes the following changes. 1) Sync ZoL's ioctl ordering such that it matches Illumos. For historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID ioctls were out of order. 2) Move Linux and FreeBSD specific ioctls in to their own reserved ranges. This allows us to preserve the existing ordering when new ioctls are added by either Illumos or FreeBSD. When an ioctl is no longer needed it should be retired in place. This change alters the ZFS user/kernel ABI so make sure you rebuild both your user and kernel modules. However, it should allow for a much stabler interface going forward. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #1973
* Remove ZFC_IOC_*_MINOR ioctl()sBrian Behlendorf2013-12-1612-456/+167
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Early versions of ZFS coordinated the creation and destruction of device minors from userspace. This was inherently racy and in late 2009 these ioctl()s were removed leaving everything up to the kernel. This significantly simplified the code. However, we never picked up these changes in ZoL since we'd already significantly adjusted this code for Linux. This patch aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s and moving all the functionality down in to the kernel. Since this cleanup will change the kernel/user ABI it's being done in the same tag as the previous libzfs_core ABI changes. This will minimize, but not eliminate, the disruption to end users. Once merged ZoL, Illumos, and FreeBSD will basically be back in sync in regards to handling ZVOLs in the common code. While each platform must have its own custom zvol.c implemenation the interfaces provided are consistent. NOTES: 1) This patch introduces one subtle change in behavior which could not be easily avoided. Prior to this change callers of 'zfs create -V ...' were guaranteed that upon exit the /dev/zvol/ block device link would be created or an error returned. That's no longer the case. The utilities will no longer block waiting for the symlink to be created. Callers are now responsible for blocking, this is why a 'udev_wait' call was added to the 'label' function in scripts/common.sh. 2) The read-only behavior of a ZVOL now solely depends on if the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant policy setting in the gendisk structure was removed. This both simplifies the code and allows us to safely leverage set_disk_ro() to issue a KOBJ_CHANGE uevent. See the comment in the code for futher details on this. 3) Because __zvol_create_minor() and zvol_alloc() may now be called in a sync task they must use KM_PUSHPAGE. References: illumos/illumos-gate@681d9761e8516a7dc5ab6589e2dfe717777e1123 Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #1969
* Illumos #4121 vdev_label_init read onlyGeorge Wilson2013-12-121-1/+1
| | | | | | | | | | | | | | | | 4121 vdev_label_init should treat request as succeeded when pool is read only Reviewed by: Christopher Siden <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/4121 illumos/illumos-gate@973c78e94bf9634782164382c9e291bf81161fa5 Ported-by: Brian Behlendorf <[email protected]> Closes #1863
* Fix atime handling.Tim Chase2013-12-122-3/+2
| | | | | | | | | | | | | | | | | | | | | | | Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP() followed by zfs_inode_update() to update the atime. However, since atimes are cached in the znode for delayed writing, the zfs_inode_update() function would effectively ignore the cached atime by reading it from the SA. This commit moves the updating of the atime in the inode into zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP() macro and eliminates the call to zfs_inode_update() in the atime-modifying vnops. It's possible the same thing could have been done directly in zfs_inode_update() but I wasn't sure that it was safe in all cases where it is called. The effect is that atime handling is as if "strictatime" were selected; even if the filesystem is mounted with "relatime". Signed-off-by: Brian Behlendorf <[email protected]> Issue #1949
* Fix zstream_t incorrect typeShen Yan2013-12-101-1/+1
| | | | | | | | | | | | | The DMU zfetch code organizes streams with lists not avl trees. A avl_node_t was mistakenly used for a list_node_t in the zstream_t type. This is incorrect (but harmless) and when unnoticed because: 1) The list functions explicitly cast the value preventing a warning, 2) sizeof(avl_node_t) >= sizeof(list_node_t) so no overrun occurs, and 3) The calculated offset is the same regardless of the type. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1946
* Remove MAX when initializing arc_c_maxdavid.chen2013-12-101-1/+1
| | | | | | | | | | | | | | The MAX when initializing arc_c_max doesn't make any sense because it hasn't been set anywhere before. Though, arc_c_max should be implicitly set to zero when initializing arc_stats, so the MAX doesn't make any difference. The MAX was mistakenly left if place when the Illumos default values were changed for Linux. Signed-off-by: david.chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1941
* Fix multipath bug in vdev_id caused by inconsistent field numberingSimon Guest2013-12-101-2/+4
| | | | | | | | | | | | | | | | | | The bug is caused by multipath output like this: 35000c50056bd77a7 dm-15 HP,MB3000FCWDH size=2.7T features='0' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 2:0:16:0 sdq 65:0 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 4:0:52:0 sdfp 130:176 active undef running Note that the pipe symbols mean that the field numbering is different between the sdq and sdfp lines. The fix edits out the pipe symbols. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1692
* Revert "Use directory xattrs for symlinks"Ned Bass2013-12-101-4/+0
| | | | | | | | | | | This reverts commit 6a7c0ccca44ad02c476a111d8f7911fc8b12fff7. A proper fix for Issue #1648 was landed under Issue #1890, so this is no longer needed. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1648
* sa_find_sizes() may compute wrong SA header sizeJames Pan2013-12-101-24/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under the right conditions sa_find_sizes() will compute an incorrect size of the system attribute (SA) header. This causes a failed assertion when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead to corruption of SA data. The bug presents itself when there are more than two variable-length SAs of just the right size to fit in the bonus buffer of a dnode. The existing logic fails to account for the SA header space needed to store the sizes of all the variable-length SAs. A reproducer was possible on Linux by setting the xattr=sa dataset property and storing xattrs on symbolic links (Issue #1648). Note the corrupt link target name: $ zfs set xattr=sa tank/fish $ cd /tank/fish $ ln -fs 12345678901234567 link $ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link $ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link $ ls -l link lrwxrwxrwx 1 root root 17 Dec 6 15:40 link -> 90123456701234567 Commit 6a7c0ccca44ad02c476a111d8f7911fc8b12fff7 worked around this bug by forcing xattr's on symlinks to be stored in directory format. This change implements a proper fix, so the workaround can now be reverted. The reference link below contains a reproducer for FreeBSD. References: http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html Ported-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1890
* Update init script to allow verbose mountsTurbo Fredriksson2013-12-061-1/+6
| | | | | | | | | | Allow verbose mounts to make is easier to monitor progress when mounting a large number of filesystems. This functionality is disabled by default. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1929
* Update init script to allow /dev/disk/by-id importTurbo Fredriksson2013-12-061-2/+13
| | | | | | | | | | | | Many people prefer to use by-id at import time instead of using the cache file. This can be a much better solution than the cache file in some environments so we're adding some infrastructure to allow it. This functionality is disabled by default. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1929
* Fix 'zfs diff' shares errorBrian Behlendorf2013-12-061-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | When creating a dataset with ZoL a zsb->z_shares_dir ZAP object will not be created because shares are unimplemented. Instead ZoL just sets zsb->z_shares_dir to zero to indicate there are no shares. However, if you import a pool which was created with a different ZFS implementation then the shares ZAP object may exist. Code was added to handle this case but it clearly wasn't sufficiently tested with other ZFS pools. There was a bug in the zpl_shares_getattr() function which passed the wrong inode to zfs_getattr_fast() for the case where are shares ZAP object does exist. This causes an EIO to be returned to stat64() which in turn causes 'zfs diff' to fail. This fix is the pass the correct inode after a sucessful zfs_zget(). Additionally, only put away the references if we were able to get one. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Graham Booker <https://github.com/gbooker> Signed-off-by: timemaster67 <https://github.com/timemaster67> Closes #1426 Closes #481
* Add module versioningBrian Behlendorf2013-12-066-0/+6
| | | | | | | | | | | | | | | | | | | | | | | Use the standard Linux MODULE_VERSION macro to expose the installed zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions. This will also automatically add a checksum of the .c files and headers in "srcversion". See: /sys/module/zavl/version /sys/module/zavl/srcversion /sys/module/znvpair/version /sys/module/znvpair/srcversion /sys/module/zunicode/version /sys/module/zunicode/srcversion /sys/module/zcommon/version /sys/module/zcommon/srcversion /sys/module/zfs/version /sys/module/zfs/srcversion /sys/module/zpios/version /sys/module/zpios/srcversion Signed-off-by: Brian Behlendorf <[email protected]> Closes #1923
* Illumos #4045 write throttle & i/o scheduler performance workMatthew Ahrens2013-12-0638-846/+1943
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead. * the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in zfsonlinux/zfs@c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Ned Bass <[email protected]> Reviewed by: Brendan Gregg <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e Ported-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1913
* Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT)Matthew Ahrens2013-12-063-37/+18
| | | | | | | | | | | | | | | | | | | | | | | | Fix a lock contention issue by allowing threads not holding ZPL locks to block when waiting to assign a transaction. Porting Notes: zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This case may be a contention point just like zfs_write(), however it is not safe to block here since it may be called during memory reclaim. Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Boris Protopopov <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4347 illumos/illumos-gate@e722410c49fe67cbf0f639cbcc288bd6cbcf7dd1 Ported-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Properly ignore bdi_setup_and_register return valueRichard Yao2013-12-041-1/+4
| | | | | | | | This broke compilation against Linux 3.13 and GCC 4.7.3. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1906
* Remove incorrect ASSERT in zfs_sb_teardown()Brian Behlendorf2013-12-021-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of zfs_sb_teardown() there is an assertion that all inodes which are part of the zsb->z_all_znodes list have at least one reference on them. This is always true for the standard unmount case but there are two other cases where it is not strictly true. * zfs_ioc_rollback() - This is the most common case and it results from the fact that we aren't unmounting the filesystem. During a normal unmount the MS_ACTIVE flag will be cleared on the super block causing iput_final() to evict the inode when its reference count drops to zero. However, during a rollback MS_ACTIVE remains set since we're rolling back a live filesystem and need to preserve the existing super block. This allows inodes with a zero reference count to stay in the cache thereby violating the assertion. * destroy_inode() / zfs_sb_teardown() - There exists a small race between dropping the last reference on an inode and removing it from the zsb->z_all_znodes list. This is unlikely to occur but could also trigger the assertion which is incorrect. The inode may safely have a zero reference count in this case. Since allowing a zero reference count on the inode is expected and safe for both of these cases the simplest thing to do is remove the ASSERT. This code is only enabled for default builds so removing this entirely is a very safe change. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #1417 Closes #1536
* Drive database updateRichard Yao2013-12-021-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added: Adata S396 (obtained from drive_id) Apple MacBookAir3,1 SSD (obtained from drive_id) Apple MacBookPro10,1 SSD (obtained from drive_id) Intel 510 (obtained from drive_id) Intel 710 (obtained from drive_id) Intel DC S3500 (obtained from drive_id) Netapp LUN (obtained from illumos user's sd.conf) OCZ Agility 3 (obtained from drive_id) OCZ Vertex (obtained from drive_id) Samsung PM800 (obtained from drive_id) Sandisk U100 (obtained from drive_id) Sun Comstar (obtained from illumos user's sd.conf) Notes: 1. The entries for the Intel DC S3500 were extrapolated from the 800GB model's entry, which is "ATA INTEL SSDSC2BB80". 2. The entires for the Intel 710 were extrapolated from the 120GG model's entry, which is "ATA INTEL SSDSA2BZ12". 3. The entires for the Intel 510 were extrapolated from the 250GB model's entry, which is "ATA INTEL SSDSC2MH25". 4. The entires for the Apple MacBookPro10,1 SSD were extrapolated from the 512GB model's entry, which is "ATA APPLE SSD SM512E". Google searches suggest that this is a rebadged Samsung 830. 5. The entires for the Apple MacBookAir3,1 SSD were extrapolated from the 128GB model's entry, which is "ATA APPLE SSD TS128C". Google searches suggest that this is a rebadged Kingston SSDNow V+ 100 (based on Toshiba). 6. Sun Comstar is an iSCSI Target, so we cannot tell what the correct sector size is through this method. We list it only for reference purposes, but it is commented out. Similarly, it is not clear what the right thing to do for Netapp is, so we comment it out. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1907
* Some nvlist allocations in hold processing need to use KM_PUSHPAGE.Tim Chase2013-12-021-2/+3
| | | | | | | | | | This should hopefully catch the rest of the allocations in the user hold/release processing that were missed by commit 65c67ea86e9f112177f1ad32de8e780f10798a64. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1852 Closes #1855
* Only commit the ZIL once in zpl_writepages() (msync() case).Etienne Dechamps2013-11-236-39/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. Note that this change also fixes a bug related to #907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1849 Closes #907
* Change zfs-dkms requirementTrey Dockendorf2013-11-211-5/+1
| | | | | | | | | | | | | Version 2.2.0.3-20 of dkms in the EPEL/Fedora repositories added the necessary patches to support ZoL, Therefore, the zfs-dkms requirement on dkms is set to match that version or higher. This allows us to drop the custom dkms build in the ZoL EPEL/Fedora repositories. References: https://bugzilla.redhat.com/show_bug.cgi?id=1023598 Signed-off-by: Brian Behlendorf <[email protected]> Closes #1873
* Illumos #2583Yuri Pankov2013-11-216-64/+69
| | | | | | | | | | | | 2583 Add -p (parsable) option to zfs list References: https://www.illumos.org/issues/2583 illumos/illumos-gate@43d68d68c1ce08fb35026bebfb141af422e7082e Ported-by: Gregor Kopka <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes: #937
* Add I/O Read/Write AccountingBrian Behlendorf2013-11-212-2/+11
| | | | | | | | | | | | | Because ZFS bypasses the page cache we don't inherit per-task I/O accounting for free. However, the Linux kernel does provide helper functions allow us to perform our own accounting. These are most commonly used for direct IO which also bypasses the page cache, but they can be used for the common read/write call paths as well. Signed-off-by: Pavel Snajdr <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #313 Closes #1275
* Document ZFS module parameters.Turbo Fredriksson2013-11-202-1/+1005
| | | | | | | | | | | | This is a first draft of a zfs-module-parameters(5) man page. I have just extracted the parameter name and its description with modinfo, then checked the source what type it is and its default value. This will need more work, preferably someone that actually know these values and what to use them for. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1856