summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Allow GPT+EFI vdevs for root poolsBrian Behlendorf2012-11-301-0/+10
| | | | | | | | | | | | | | | Commit 57a4edd allows the bootfs property to be set on any pool. However, many of the zpool commands still prevent you from using EFI labeled devices for the root pool. For example: # zpool attach rpool /dev/sda /dev/sdb cannot label 'sdb': EFI labeled devices are not supported on root pools. on root devices. For non-Solaris builds such as Linux disable this error. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1077
* Disable page allocation warnings for super blockBrian Behlendorf2012-11-301-1/+1
| | | | | | | | | | Due to the slightly increased size of the ZFS super block caused by 30315d2 there are now allocation warnings. The allocation size is still small (just over 8k) and super blocks are rarely allocated so we suppress the warning. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1101
* Verify --with-linux source directory existsBrian Behlendorf2012-11-291-5/+8
| | | | | | | | | | | | Previously this check was only performed when ./configure was attempting to autodetect your kernel source directory. But we should also handle the case where --with-linux was provided and is obviously wrong. This way we catch the error before invoking make and compiling the source with an incorrect autoconf results. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/spl#162
* vdev_id fails to handle complex device topologiesCyril Plisko2012-11-291-4/+4
| | | | | | | | | | | | While expanding positional parameters shell requires non-single digits to be enclosed in braces. When the SAS topology is non-trivial the number of positional parameters generated internally by vdev_id script (using set -- ...) easily crosses single digit limit and vdev_id fails to generate links. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1119
* Make vdev_id POSIX sh compatibleNed Bass2012-11-271-16/+23
| | | | | | | | | | | Full bash may not be available in all environments where udev helpers run, such as in an initial ramdisk. To avoid breakage in this case, remove use of bash-specific features such as variable arrays and the `declare' keyword from the vdev_id script. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #870
* Fix NULL deref when zvol_alloc() failsBrian Behlendorf2012-11-271-1/+1
| | | | | | | | | | If zvol_alloc() fails zv will be set to NULL and dereferenced in out_dmu_objset_disown. To avoid this entirely the zv->objset line is moved up in to the success block. Original-patch-by: Jorgen Lundman <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1109
* Increase ZFS_OBJ_MTX_SZ to 256Brian Behlendorf2012-11-271-1/+1
| | | | | | | | | | | | | | | | | | Increasing this limit costs us 6144 bytes of memory per mounted filesystem, but this is small price to pay for accomplishing the following: * Allows for up to 256-way concurreny when performing lookups which helps performance when there are a large number of processes. * Minimizes the likelyhood of encountering the deadlock described in issue #1101. Because vmalloc() won't strictly honor __GFP_FS there is still a very remote chance of a deadlock. See the zfsonlinux/spl@043f9b57 commit. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1101
* Recreate minors when renaming zvolsBrian Behlendorf2012-11-191-5/+13
| | | | | | | | | | | When a zvol with snapshots is renamed the device files under /dev/zvol/ are not renamed. This patch resolves the problem by destroying and recreating the minors with the new name so the links can be recreated bu udev. Original-patch-by: Suman Chakravartula <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #408
* mount.zfs: canonicalize mount point for mtabnordaux2012-11-151-2/+9
| | | | | | | | | | | | | | | | Canonicalize the mount point passed to the mount.zfs helper. This way a clean path is always added to mtab which ensures the umount can properly locate and remove the entry. Test case: $ mkdir /mnt/foo $ mount -t zfs zpool/foo /mnt/../mnt/foo//// $ umount /mnt/foo $ cat /etc/mtab | grep zpool/foo zpool/foo /mnt/../mnt/foo//// zfs rw 0 0 Signed-off-by: Brian Behlendorf <[email protected]> Closes #573
* Merge branch 'ashift'Brian Behlendorf2012-11-158-23/+144
|\ | | | | | | | | | | | | | | | | | | This branch adds some overdue ashift improvements. * Add '-o ashift' to 'zpool add' and 'zpool attach' * Improve AF hard disk detection * Allow 'zpool import' to handle increases in ashift Signed-off-by: Brian Behlendorf <[email protected]>
| * Add "-o ashift" to zpool add and zpool attachCyril Plisko2012-11-152-12/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When adding devices to an existing pool "ashift" property is auto-detected. However, if this property was overridden at the pool creation time (i.e. zpool create -o ashift=12 tank ...) this may not be what the user wants. This commit lets the user specify the value of "ashift" property to be used with newly added drives. For example, zpool add -o ashift=12 tank disk1 zpool attach -o ashift=12 tank disk1 disk2 Signed-off-by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #566
| * Improve AF hard disk detectionBrian Behlendorf2012-11-153-5/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the bdev_physical_block_size() interface to determine the minimize write size which can be issued without incurring a read-modify-write operation. This is used to set the ashift correctly to prevent a performance penalty when using AF hard disks. Unfortunately, this interface isn't entirely reliable because it's not uncommon for disks to misreport this value. For this reason you may still need to manually set your ashift with: zpool create -o ashift=12 ... The solution to this in the upstream Illumos source was to add a white list of known offending drives. Maintaining such a list will be a burden, but it still may be worth doing if we can detect a large number of these drives. This should be considered as future work. Reported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #916
| * Illumos #2671: zpool import should not fail if vdev ashift has increasedGeorge Wilson2012-11-154-6/+15
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Gordon Ross <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Richard Lowe <[email protected]> Refererces to Illumos issue: https://www.illumos.org/issues/2671 This patch has been slightly modified from the upstream Illumos version. In the upstream implementation a warning message is logged to the console. To prevent pointless console noise this notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift" event. The event indicates a non-optimial (but entirely safe) ashift value was used to create the pool. Depending on your workload this may impact pool performance. Unfortunately, the only way to correct the issue is to recreate the pool with a new ashift. NOTE: The unrelated fix to the comment in zpool_main.c appears in the upstream commit and was preserved for consistnecy. Ported-by: Cyril Plisko <[email protected]> Reworked-by: Brian Behlendorf <[email protected]> Closes #955
* zfs-0.6.0-rc12zfs-0.6.0-rc12Brian Behlendorf2012-11-131-1/+1
|
* Fix hard coded path in 60-vdev.rules.inRichard Yao2012-11-132-3/+3
| | | | | | | | | | | | | | | The udev data directory was hard coded in 60-vdev.rules.in. That causes a problem when a distribution changes the location of the directory. This was not an issue in the past because virtually all distributions used the same path, but that is beginning to change following a decision by the systemd developers to change the directory location to reflect their take-over of udev maintainership. The testing branch of Gentoo Linux adopted this change, which enabled the hardcoded directory location to trigger a regression. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1085
* Fix "allocating allocated segment" panicBrian Behlendorf2012-11-091-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | Gunnar Beutner did all the hard work on this one by correctly identifying that this issue is a race between dmu_sync() and dbuf_dirty(). Now in all cases the caller is responsible for preventing this race by making sure the zfs_range_lock() is held when dirtying a buffer which may be referenced in a log record. The mmap case which relies on zfs_putpage() was not taking the range lock. This code was accidentally dropped when the function was rewritten for the Linux VFS. This patch adds the required range locking to zfs_putpage(). It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which aren't strictly required due to the VFS holding a reference. However, this makes the code more consistent with the upsteam code and there's no harm in being extra careful here. Original-patch-by: Gunnar Beutner <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #541
* Fix zvol+btrfs hangBrian Behlendorf2012-11-091-0/+77
| | | | | | | | | | | | | | | | | | | | | | | | When using a zvol to back a btrfs filesystem the btrfs mount would hang. This was due to the bio completion callback used in btrfs assuming that lower level drivers would never modify the bio->bi_io_vecs after they were submitted via bio_submit(). If they are modified btrfs will miscalculate which pages need to be unlocked resulting in a hang. It's worth mentioning that other file systems such as ext[234] and xfs work fine because they do not make the same assumption in the bio completion callback. The most straight forward way to fix the issue is to present the semantics expected by btrfs. This is done by cloning the bios attached to each request and then using the clones bvecs to perform the required accounting. The clones are freed after each read/write and the original unmodified bios are linked back in to the request. Signed-off-by: Chris Wedgwood <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #469
* Merge remote branch 'eris/stats'Brian Behlendorf2012-11-067-11/+259
|\ | | | | | | | | | | | | Bring in support for the new KSTAT_TYPE_TXG type. This allows for additional visibility in to the txg handling. Signed-off-by: Brian Behlendorf <[email protected]>
| * Log I/Os longer than zio_delay_max (30s default)Brian Behlendorf2012-11-023-10/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There have been reports of ZFS deadlocking due to what appears to be a lost IO. This patch addes some debugging to determine the exact state of the IO which neither 1) completed, 2) failed, or 3) timed out after zio_delay_max (30) seconds. This information will be logged using the ZFS FMA infrastructure as a 'delay' event and posted to the internal zevent log. By default the last 64 events will be kept in the log but the limit is configurable via the zfs_zevent_len_max module option. To dump the contents of the log use the 'zpool events -v' command and look for the resource.fs.zfs.delay event. It will include various information about the pool, vdev, and zio which may shed some light on the issue. In the context of this change the 120 second kernel blocked thread watchdog has been disabled for synchronous IOs. Signed-off-by: Brian Behlendorf <[email protected]> Issue #930
| * Add txgs-<pool> kstat fileBrian Behlendorf2012-11-024-1/+231
|/ | | | | | | | | | | | | | | | | | | | Create a kstat file which contains useful statistics about the last N txgs processed. This can be helpful when analyzing pool performance. The new KSTAT_TYPE_TXG type was added for this purpose and it tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <[email protected]>
* Add ddt_object_count() error handlingBrian Behlendorf2012-10-294-19/+28
| | | | | | | | | | | | | | | | | | | The interface for the ddt_zap_count() function assumes it can never fail. However, internally ddt_zap_count() is implemented with zap_count() which can potentially fail. Now because there was no way to return the error to the caller a VERIFY was used to ensure this case never happens. Unfortunately, it has been observed that pools can be damaged in such a way that zap_count() fails. The result is that the pool can not be imported without hitting the VERIFY and crashing the system. This patch reworks ddt_object_count() so the error can be safely caught and returned to the caller. This allows a pool which has be damaged in this way to be safely rewound for import. Signed-off-by: Brian Behlendorf <[email protected]> Closes #910
* Revert "Don't ashift-align vdev read requests."Brian Behlendorf2012-10-241-9/+3
| | | | | | | | | This reverts commit a5c20e2a0a9046c06d86615fbf51dc04f12bba14 which accidentally introduced a regression for real 4k sector devices. See issue #1065 for details. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1065
* Remove 'Resized bio's/dio' warningBrian Behlendorf2012-10-221-1/+0
| | | | | | | | | | | | | The following warning was originally added to provide visibility in to how often a dio gets heavily fragmented in to over 16 bios. This can happen due to constraints imposed by the block device and may have a negitive impact on performance but is otherwise harmless. To prevent needless confusion and worry the message has been removed. kernel: WARNING: Resized bio's/dio to 32 Signed-off-by: Brian Behlendorf <[email protected]>
* Update spare and cache device names on importBrian Behlendorf2012-10-221-6/+7
| | | | | | | | | | | | | | | | | | | | During 'zpool import' all ZPOOL_CONFIG_PATH names are supposed to be updated by fix_paths(). This was not happening for spare and cache devices because the proper names were getting filtered out of the pool_list_t->names. Interestingly, the names were being filtered because the spare and cache devices do not contain the pool name in their vdev label. The fix is to exclude the device path from the list only if: 1) has a valid ZPOOL_CONFIG_POOL_NAME key in the label, and 2) that pool name does not match the specified pool name. Since the label is valid and because it does properly store the vdev guid it will be correctly assembled without the pool name. Signed-off-by: Brian Behlendorf <[email protected]> Closes #725
* Allow 'zpool replace' to use short device namesBrian Behlendorf2012-10-225-96/+186
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'zpool replace' command would fail when given a short name because unlike on other platforms the short name cannot be deterministically expanded to a single path. Multiple path prefixes must be checked and in addition the partition suffix for whole disks is determined by the prefix. To handle this complexity a zfs_strcmp_pathname() function was added which takes either a short or fully qualified device name. Short names will be expanded using the prefixes in the default import search path, or the ZPOOL_IMPORT_PATH environment variable if it's defined. All posible expansions are then compared against the comparison path. Care is taken to strip redundant slashes to ensure legitimate matches are not missed. In the context of this work the existing zfs_resolve_shortname() function was extended to consider the ZPOOL_IMPORT_PATH when set. The zfs_append_partition() interface was also simplified to take only a single buffer. The vast majority of these changes rework existing Linux specific code which was originally written to accomidate udev. However, there is some minimal cleanup which removes Illumos specific code. This was done to improve readability but the basic flow and intent of the upstream code was maintained. These changes are the logical conclusion of the previos work to adjust the 'zpool import' search behavior, see commit 44867b6a. Signed-off-by: Brian Behlendorf <[email protected]> Closes #544 Closes #976
* Quote snapshot and mountpoint for .zfs automountBrian Behlendorf2012-10-171-2/+2
| | | | | | | | | | When automounting a snapshot in the .zfs/snapshot directory make sure to quote both the dataset name and the mount point. This ensures that if either component contains spaces, which are allowed, they get handled correctly. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1027
* Merge branch 'zil-performance'Brian Behlendorf2012-10-1714-27/+434
|\ | | | | | | | | | | | | | | | | | | | | This brnach brings some ZIL performance optimizations, with significant increases in synchronous write performance for some workloads and pool configurations. See the individual commit messages for details. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1013
| * Use the slog even with logbias=throughput.Etienne Dechamps2012-10-171-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the current code, logbias=throughput implies the following: 1) All synchronous writes are logged in indirect mode. 2) The slog is not used. (1) makes sense because it avoids writing the data twice, which is obviously a good thing when the user wants maximum pool throughput. (2), however, is a surprising decision. Considering all writes are indirect, the log record doesn't contain the actual data, only pointers to DMU blocks. As a result, log records written in logbias=throughput mode are quite small, and as such, it doesn't make any sense to write them to the main pool since slogs are usually optimized for small synchronous writes. In fact, the current behavior is actually harmful for performance, because log blocks and data blocks from dmu_sync() seldom have the same allocation size and as a result are usually allocated from different metaslabs. This means that if a spindle has to write both log blocks and DMU blocks (which is likely to happen under heavy load), it will have to seek between the two. Allocating the log blocks from the slog pool instead of the main pool avoids these unnecessary seeks. This commit makes ZFS use the slog on datasets with logbias=throughput. Real-life performance testing shows a 50% synchronous write performance increase with some large commit sizes, and no negative effect in other cases. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1013
| * Add FASTWRITE algorithm for synchronous writes.Etienne Dechamps2012-10-1710-20/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1013
| * Add atomic_sub_* functions to libspl.Etienne Dechamps2012-10-174-0/+284
|/ | | | | | | | | | Both the SPL and the ZFS libspl export most of the atomic_* functions, except atomic_sub_* functions which are only exported by the SPL, not by libspl. This patch remedies that by implementing atomic_sub_* functions in libspl. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1013
* Merge branch 'condvar'Brian Behlendorf2012-10-173-9/+14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Auditing the code to verify that all instances of cv_signal() and cv_broadcast() are called under the proper associated mutex turned up several races. None of these have been conclusively seen in the wild but the following patch set resolves them. For reference, from the cv_signal(9F) man page: cv_signal() signals the condition and wakes one blocked thread. All blocked threads can be unblocked by calling cv_broadcast(). You must acquire the mutex passed into cv_wait() before calling cv_signal() or cv_broadcast() Signed-off-by: Brian Behlendorf <[email protected]> Closes #1048
| * Condition variable usage, zp->r_{rd,wr}_cvBrian Behlendorf2012-10-151-4/+5
| | | | | | | | | | | | | | | | The following incorrect usage of cv_broadcast() was caught by code inspection. The cv_broadcast() function must be called under the associated mutex to preventing racing with cv_wait(). Signed-off-by: Brian Behlendorf <[email protected]>
| * Condition variable usage, zilog->zl_cv_batchBrian Behlendorf2012-10-151-1/+2
| | | | | | | | | | | | | | | | | | The following incorrect usage of cv_signal and cv_broadcast() was caught by code inspection. The cv_signal and cv_broadcast() functions must be called under the associated mutex to preventing racing with cv_wait(). Signed-off-by: Brian Behlendorf <[email protected]>
| * Condition variable usage, zevent_cvBrian Behlendorf2012-10-151-4/+7
|/ | | | | | | | The following incorrect usage of cv_broadcast() was caught by code inspection. The cv_broadcast() function must be called under the associated mutex to preventing racing with cv_wait(). Signed-off-by: Brian Behlendorf <[email protected]>
* Do not return /dev/loop-control in unused_loop_deviceAndrew Reid2012-10-151-1/+1
| | | | | | | | | | | | | | | | | | The function unused_loop_device in /usr/libexec/zfs/common.sh returns /dev/loop-control on the first call. This device is NOT a loop device (https://github.com/torvalds/linux/commit/770fe30) it is a control device. This in turn causes the script zconfig.sh to fail with: zpool-create.sh: Error 1 creating /tmp/zpool-vdev0 -> /dev/loop-control loopback The patch makes the function return /dev/loop[0-9]* which are loop devices. Signed-off-by: Andrew Reid <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #797
* Switch KM_SLEEP to KM_PUSHPAGEMassimo Maggi2012-10-151-2/+2
| | | | | | | | | | In this particular instance the allocation occurred in the context of sys_msync()->...->zpl_putpage() where we must be careful not to initiate additional I/O. Signed-off-by: Massimo Maggi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1038
* Limit zfs_vdev_aggregation_limit to SPA_MAXBLOCKSIZEBrian Behlendorf2012-10-151-2/+2
| | | | | | | | Prevent users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. Signed-off-by: Brian Behlendorf <[email protected]> Closes #520
* Return positive error number in zfsctl_shares_lookup.Yuxuan Shui2012-10-151-1/+1
| | | | | | | | | Otherwise it will cause zpl_shares_lookup() to return a invalid pointer when an error occurs. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Yuxuan Shui <[email protected]> Closes #626 #885 #947 #977
* Disable ztest deadman timerBrian Behlendorf2012-10-141-0/+4
| | | | | | | | | | | The ztest deadman timer has been causing false positives in the testing VMs. To make it easier to spot possible regressions I'm disabling this timer. The buildbot test infrastructure will still mark ztest instances which take to long to complete as failures. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1018
* Merge branch 'linux-3.6'Brian Behlendorf2012-10-1410-10/+115
|\ | | | | | | | | | | | | | | This branch adds the required compatibility code to support the Linux 3.6 kernel. Signed-off-by: Brian Behlendorf <[email protected]> Closes #873
| * Linux 3.6 compat, iops->mkdir()Richard Yao2012-10-143-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | Use .mkdir instead of .create in 3.3 compatibility check. Linux 3.6 modifies inode_operations->create's function prototype. This causes an autotools Linux 3.3. compatibility check for a function prototype change in create, mkdir and mknode to fail. Since mkdir and mknode are unchanged, we modify the check to examine it instead. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
| * Linux 3.6 compat, iops->create()Yuxuan Shui2012-10-143-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the struct nameidata is no longer passed to iops->create. Instead only the result of (inamedata->flags & LOOKUP_EXCL) is passed. ZFS like almost all Linux fileystems never made use of this so only the prototype needs to be wrapped for compatibility. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
| * Linux 3.6 compat, iops->lookup()Yuxuan Shui2012-10-144-0/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the struct nameidata is no longer passed to iops->lookup. Instead only the inamedata->flags are passed. ZFS like almost all Linux fileystems never made use of this so only the prototype needs to be wrapped for compatibility. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
| * Linux 3.6 compat, sget()Yuxuan Shui2012-10-144-2/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the mount flags are now passed to sget() so they can be used when initializing a new superblock. ZFS never uses sget() in this fashion so we can simply pass a zero and add a zpl_sget() compatibility wrapper. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
| * Linux 3.6 compat, sops->write_super() removedYuxuan Shui2012-10-141-1/+0
|/ | | | | | | | | | | | | | The .write_super callback was removed the the super_operations structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e. All file systems are now expected to self manage writing any dirty state assoicated with their super block. ZFS never made use of this callback so it can simply be removed from the super_operations structure. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
* Don't ashift-align vdev read requests.Etienne Dechamps2012-10-121-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be. This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size. For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests. We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves. As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons: - We remove a memory allocation in a heavily-used code path; - The request gets aligned in the lowest layer possible, which shrinks the path that the additional, useless padding data has to travel. For example, when using 4k-sector drives that lie about their sector size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter. The only exception is raidz, because raidz needs to read the whole allocated block for parity. This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1022
* Remove vmem_size() consumersRichard Yao2012-10-121-16/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently three vmem_size() consumers all of which are part of the ARC implemention. However, since the expected behavior of the Linux and Solaris virtual memory subsystems are so different the behavior in each of these instances needs to be reevaluated. * arc_evict_needed() - This is actually dead code. Arena support was never added to the SPL and zio_arena is always NULL. This support isn't needed so we simply remove this dead code. * arc_memory_throttle() - On Solaris where virtual memory constitutes almost all of the address space we can reasonably expect there to be a fairly large amount free. However, on Linux by default we only have about 100MB total and that's heavily used by the ARC. So the expectation on Linux is that this will usually be a small value. Therefore we remove the vmem_size() check for i386 systems because the expectation is that it will be less than the zfs_write_limit_max. * arc_init() - Here vmem_size() is used to initially size the ARC. Since the ARC is currently backed by the virtual address space it makes sense to use this as a limit on the ARC for 32-bit systems. This code can be removed when the ARC is backed by the page cache. Signed-off-by: Brian Behlendorf <[email protected]> Closes #831
* Fix zfs_txg_timeout module parameterBrian Behlendorf2012-10-113-6/+9
| | | | | | | | | | | | | | | | Allow the zfs_txg_timeout variable to be dynamically tuned at run time. By pulling it down out of the variable declaration it will be evaluted each time through the loop. The zfs_txg_timeout variable is now declared extern in a the common sys/txg.h header rather than locally in dsl_scan.c. This prevents potential type mismatches if the global variable needs to be used elsewhere. Move the module_param() code in to the same source file where zfs_txg_timeout is declared. This is the most logical location. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix zfs_write_limit_max integer size mismatch on 32-bit systemsRichard Yao2012-10-112-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c409e4647f221ab724a0bd10c480ac95447203c3 introduced a number of module parameters. This required several types to be changed to accomidate the required module parameters Linux macros. Unfortunately, arc.c contained its own extern definition of the zfs_write_limit_max variable and its type was not updated to be consistent with its dsl_pool.c counterpart. If the variable had been properly marked extern in a common header, then gcc would have generated a warning and this would not have slipped through. The result of this was that the ARC unconditionally expected zfs_write_limit_max to be 64-bit. Unfortunately, the largest size integer module parameter that Linux supports is unsigned long, which varies in size depending on the host system's native word size. The effect was that on 32-bit systems, ARC incorrectly performed 64-bit operations on a 32-bit value by reading the neighboring 32 bits as the upper 32 bits of the 64-bit value. We correct that by changing the extern declaration to use the unsigned long type and move these extern definitions in to the common arc.h header. This should make ARC correctly treat zfs_write_limit_max as a 32-bit value on 32-bit systems. Reported-by: Jorgen Lundman <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #749
* Make zfs_immediate_write_sz a module paramaterCyril Plisko2012-10-111-2/+7
| | | | | | | | zfs_immediate_write_sz variable is a tunable, but lacks proper module_param() instrumentation. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1032