summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* Change zfs-kmod-devel install pathBrian Behlendorf2013-03-136-6/+6
| | | | | | | | | | | | | | | Install the common zfs kernel development headers under /usr/src/zfs-<version>/ rather than in a kernel specific directory. The kernel specific build products such as zfs_config.h and Modules.symvers are left installed under /usr/src/zfs-<version>/<kernel>. This was done to be consistent with where dkms expects kernel module source to be packaged. It also allows for a common zfs-kmod-devel package which includes the headers, and per-kernel zfs-kmod-devel-<kernel> packages. Signed-off-by: Brian Behlendorf <[email protected]>
* Refresh links to web siteNed Bass2013-03-062-2/+2
| | | | | | | | A few files still refer to @behlendorf's private fork on github. Use the primary web site URL instead. Two typos are also corrected. Signed-off-by: Brian Behlendorf <[email protected]>
* Add snapdev=[hidden|visible] dataset propertyEric Dillmann2013-03-053-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new snapdev dataset property may be set to control the visibility of zvol snapshot devices. By default this value is set to 'hidden' which will prevent zvol snapshots from appearing under /dev/zvol/ and /dev/<dataset>/. When set to 'visible' all zvol snapshots for the dataset will be visible. This functionality was largely added because when automatic snapshoting is enabled large numbers of read-only zvol snapshots will be created. When creating these devices the kernel will attempt to read their partition tables, and blkid will attempt to identify any filesystems on those partitions. This leads to a variety of issues: 1) The zvol partition tables will be read in the context of the `modprobe zfs` for automatically imported pools. This is undesirable and should be done asynchronously, but for now reducing the number of visible devices helps. 2) Udev expects to be able to complete its work for a new block devices fairly quickly. When many zvol devices are added at the same time this is no longer be true. It can lead to udev timeouts and missing /dev/zvol links. 3) Simply having lots of devices in /dev/ can be aukward from a management standpoint. Hidding the devices your unlikely to ever use helps with this. Any snapshot device which is needed can be made visible by changing the snapdev property. NOTE: This patch changes the default behavior for zvols which was effectively 'snapdev=visible'. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1235 Closes #945 Issue #956 Issue #756
* Constify structures containing function pointersRichard Yao2013-03-047-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PaX team modified the kernel's modpost to report writeable function pointers as section mismatches because they are potential exploit targets. We could ignore the warnings, but their presence can obscure actual issues. Proper const correctness can also catch programming mistakes. Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2 kernel reports 133 section mismatches prior to this patch. This patch eliminates 130 of them. The quantity of writeable function pointers eliminated by constifying each structure is as follows: vdev_opts_t 52 zil_replay_func_t 24 zio_compress_info_t 24 zio_checksum_info_t 9 space_map_ops_t 7 arc_byteswap_func_t 5 The remaining 3 writeable function pointers cannot be addressed by this patch. 2 of them are in zpl_fs_type. The kernel's sget function requires that this be non-const. The final writeable function pointer is created by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and remove_shrinker() functions also require that this be non-const. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1300
* Fix hot sparesBrian Behlendorf2013-03-011-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue with hot spares in ZoL is because it opens all leaf vdevs exclusively (O_EXCL). On Linux, exclusive opens cause subsequent exclusive opens to fail with EBUSY. This could be resolved by not opening any of the devices exclusively, which is what Illumos does, but the additional protection offered by exclusive opens is desirable. It cleanly prevents you from accidentally adding an in-use non-ZFS device to your pool. To fix this we very slightly relaxed the usage of O_EXCL in the following ways. 1) Functions which open the device but only read had the O_EXCL flag removed and were updated to use O_RDONLY. 2) A common holder was added to the vdev disk code. This allow the ZFS code to internally open the device multiple times but non-ZFS callers may not. 3) An exception was added to make_disks() for hot spare when creating partition tables. For hot spare devices which are already opened exclusively we skip creating the partition table because this must already have been done when the disk was originally added as a hot spare. Additional minor changes include fixing check_in_use() to use a partition instead of a slice suffix. And is_spare() was moved above make_disks() to avoid adding a forward reference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #250
* Retire zpool_id infrastructureBrian Behlendorf2013-01-291-1/+1
| | | | | | | | | | | | | | | | | | | In the interest of maintaining only one udev helper to give vdevs user friendly names, the zpool_id and zpool_layout infrastructure is being retired. They are superseded by vdev_id which incorporates all the previous functionality. Documentation for the new vdev_id(8) helper and its configuration file, vdev_id.conf(5), can be found in their respective man pages. Several useful example files are installed under /etc/zfs/. /etc/zfs/vdev_id.conf.alias.example /etc/zfs/vdev_id.conf.multipath.example /etc/zfs/vdev_id.conf.sas_direct.example /etc/zfs/vdev_id.conf.sas_switch.example Signed-off-by: Brian Behlendorf <[email protected]> Closes #981
* Remove NPTL_GUARD_WITHIN_STACKBrian Behlendorf2013-01-291-6/+0
| | | | | | | | | | | | | | | | | Commit 4b2f65b253952c5103311cc8bb4b8cdc6836fd7e increased the user space stack by 4x to resolve certain stack overflows. As such it no longer makes sense to worry about a single extra page which might or might not be part of the process stack. There is now ample headroom for normal usage. By eliminating this configure check we are also resolving the following segfault which intentionally occurs at configure time and may be logged in dmesg. conftest[22156]: segfault at 7fbf18a47e48 ip 00000000004007fe sp 00007fbf18a4be50 error 6 in conftest[400000+1000] Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #3035 LZ4 compression support in ZFS and GRUBEric Dillmann2013-01-293-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3035 LZ4 compression support in ZFS and GRUB Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Christopher Siden <[email protected]> References: illumos/illumos-gate@a6f561b4aee75d0d028e7b36b151c8ed8a86bc76 https://www.illumos.org/issues/3035 http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS This patch has been slightly modified from the upstream Illumos version to be compatible with Linux. Due to the very limited stack space in the kernel a lz4 workspace kmem cache is used. Since we are using gcc we are also able to take advantage of the gcc optimized __builtin_ctz functions. Support for GRUB has been dropped from this patch. That code is available but those changes will need to made to the upstream GRUB package. Lastly, several hunks of dead code were dropped for clarity. They include the functions real_LZ4_uncompress(), LZ4_compressBound() and the Visual Studio specific hunks wrapped in _MSC_VER. Ported-by: Eric Dillmann <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1217
* Linux 2.6.26 compat, lookup_bdev()Brian Behlendorf2013-01-281-0/+9
| | | | | | | | | | | | | | | | | | | It's doubtful many people were impacted by this but commit 6c28567 accidentally broke ZFS builds for 2.6.26 and earlier kernels. This commit depends on the lookup_bdev() function which exists in 2.6.26 but wasn't exported until 2.6.27. The availability of the function isn't critical so a wrapper is introduced which returns ERR_PTR(-ENOTSUP) when the function isn't defined. This will have the effect of causing zvol_is_zvol() to always fail for 2.6.26 kernels. This in turn means vdevs will always get opened concurrently which is good for normal usage. This will only become an issue if your using a zvol as a vdev in another pool. In which case you really should be using a newer kernel anyway. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1205
* Use dsl_dataset_snap_lookup()Brian Behlendorf2013-01-252-1/+3
| | | | | | | | | | | | Retire the dmu_snapshot_id() function which was introduced in the initial .zfs control directory implementation. There is already an existing dsl_dataset_snap_lookup() which does exactly what we need, and the dmu_snapshot_id() function as implemented is racy. https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879 Signed-off-by: Brian Behlendorf <[email protected]> Closes #1238
* Add d_clear_d_op() compatibilityBrian Behlendorf2013-01-231-0/+20
| | | | | | | | | | | | | Added d_clear_d_op() helper function which clears some flags and the registered dentry->d_op table. This is required because d_set_d_op() issues a warning when the dentry operations table is already set. For the .zfs control directory to work properly we must be able to override the default operations table and register custom .d_automount and .d_revalidate callbacks. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #1230
* Fix 'zfs rollback' on mounted file systemsBrian Behlendorf2013-01-174-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rolling back a mounted filesystem with open file handles and cached dentries+inodes never worked properly in ZoL. The major issue was that Linux provides no easy mechanism for modules to invalidate the inode cache for a file system. Because of this it was possible that an inode from the previous filesystem would not get properly dropped from the cache during rolling back. Then a new inode with the same inode number would be create and collide with the existing cached inode. Ideally this would trigger an VERIFY() but in practice the error wasn't handled and it would just NULL reference. Luckily, this issue can be resolved by sprucing up the existing Solaris zfs_rezget() functionality for the Linux VFS. The way it works now is that when a file system is rolled back all the cached inodes will be traversed and refetched from disk. If a version of the cached inode exists on disk the in-core copy will be updated accordingly. If there is no match for that object on disk it will be unhashed from the inode cache and marked as stale. This will effectively make the inode unfindable for lookups allowing the inode number to be immediately recycled. The inode will then only be accessible from the cached dentries. Subsequent dentry lookups which reference a stale inode will result in the dentry being invalidated. Once invalidated the dentry will drop its reference on the inode allowing it to be safely pruned from the cache. Special care is taken for negative dentries since they do not reference any inode. These dentires will be invalidate based on when they were added to the dentry cache. Entries added before the last rollback will be invalidate to prevent them from masking real files in the dataset. Two nice side effects of this fix are: * Removes the dependency on spl_invalidate_inodes(), it can now be safely removed from the SPL when we choose to do so. * zfs_znode_alloc() no longer requires a dentry to be passed. This effectively reverts this portition of the code to its upstream counterpart. The dentry is not instantiated more correctly in the Linux ZPL layer. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #795
* Fix false ENOENT on snapshot control dentriesNed Bass2013-01-161-0/+25
| | | | | | | | | | | | | | | | | | | | | | Lookups in the snapshot control directory for an existing snapshot fail with ENOENT if an earlier lookup failed before the snapshot was created. This is because the earlier lookup causes a negative dentry to be cached which is never invalidated. The bug can be reproduced as follows (the second ls should succeed): $ ls /tank/.zfs/snapshot/s ls: cannot access /tank/.zfs/snapshot/s: No such file or directory $ zfs snap tank@s $ ls /tank/.zfs/snapshot/s ls: cannot access /tank/.zfs/snapshot/s: No such file or directory To remedy this, always invalidate cached dentries in the snapshot control directory. Since these entries never exist on disk there is no significant performance penalty for the extra lookups. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1192
* Illumos #3208 cross-endian incorrect user/group accountingMatthew Ahrens2013-01-141-1/+2
| | | | | | | | | | | | | | | | | | 3208 moving zpool cross-endian results in incorrect user/group accounting Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@e828a46d29ad418487f50d56b5c19e2a1f9033a7 illumos changeset: 13835:eea81edc4f14 https://www.illumos.org/issues/3208 Ported-by: Brian Behlendorf <[email protected]> Closes #627 Closes #1136
* Illumos #3145, #3212George Wilson2013-01-081-0/+1
| | | | | | | | | | | | | | | | | | | | | 3145 single-copy arc 3212 ztest: race condition between vdev_online() and spa_vdev_remove() Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Justin T. Gibbs <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos-gate/commit/9253d63df408bb48584e0b1abfcc24ef2472382e illumos changeset: 13840:97fd5cdf328a https://www.illumos.org/issues/3145 https://www.illumos.org/issues/3212 Ported-by: Brian Behlendorf <[email protected]> Closes #989 Closes #1137
* Illumos #3104: eliminate empty bpobjsMatthew Ahrens2013-01-085-0/+8
| | | | | | | | | | | | | | | | 3104 eliminate empty bpobjs Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@f17457368189aa911f774c38c1f21875a568bdca illumos changeset: 13782:8f78aae28a63 https://www.illumos.org/issues/3104 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #3086: unnecessarily setting DS_FLAG_INCONSISTENT on asyncMatthew Ahrens2013-01-084-2/+14
| | | | | | | | | | | | | | 3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets Reviewed by: Christopher Siden <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@ce636f8b38e8c9ff484e880d9abb27251a882860 illumos changeset: 13776:cd512c80fd75 https://www.illumos.org/issues/3086 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #2762: zpool command should have better support for feature flagsChristopher Siden2013-01-083-2/+4
| | | | | | | | | | | | | 2762 zpool command should have better support for feature flags Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@57221772c3fc05faba04bf48ddff45abf2bbf2bd https://www.illumos.org/issues/2762 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #3090 and #3102George Wilson2013-01-083-1/+3
| | | | | | | | | | | | | | | | | | | 3090 vdev_reopen() during reguid causes vdev to be treated as corrupt 3102 vdev_uberblock_load() and vdev_validate() may read the wrong label Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@dfbb943217bf8ab22a1a9d2e9dca01d4da95ee0b illumos changeset: 13777:b1e53580146d https://www.illumos.org/issues/3090 https://www.illumos.org/issues/3102 Ported-by: Brian Behlendorf <[email protected]> Closes #939
* Illumos #2619 and #2747Christopher Siden2013-01-0819-26/+449
| | | | | | | | | | | | | | | | | | | | | | 2619 asynchronous destruction of ZFS file systems 2747 SPA versioning with zfs feature flags Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Dan Kruchinin <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@53089ab7c84db6fb76c16ca50076c147cda11757 illumos/illumos-gate@ad135b5d644628e791c3188a6ecbd9c257961ef8 illumos changeset: 13700:2889e2596bd6 https://www.illumos.org/issues/2619 https://www.illumos.org/issues/2747 NOTE: The grub specific changes were not ported. This change must be made to the Linux grub packages. Ported-by: Brian Behlendorf <[email protected]>
* Use cv_wait_io() which will will account for iowaitMatt Johnston2013-01-071-0/+1
| | | | | | | Update zio_wait() to use cv_wait_io() to ensure the iowait time is properly accounted for. Signed-off-by: Brian Behlendorf <[email protected]>
* Revert "Remove TSD zfs_fsyncer_key"Brian Behlendorf2012-12-201-0/+2
| | | | | | | | This reverts commit 31f2b5abdf95d8426d8bfd66ca7f62ec70215e3c back to the original code until the fsync(2) performance regression can be addressed. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove TSD zfs_fsyncer_keyBrian Behlendorf2012-12-191-2/+0
| | | | | | | | | | | | | | | | | | | | It's my understanding that the zfs_fsyncer_key TSD was added as a performance omtimization to reduce contention on the zl_lock from zil_commit(). This issue manifested itself as very long (100+ms) fsync() system call times for fsync() heavy workloads. However, under Linux I'm not seeing the same contention that was originally described. Therefore, I'm removing this code in order to ween ourselves off any dependence on TSD. If the original performance issue reappears on Linux we can revisit fixing it without resorting to TSD. This just leaves one small ZFS TSD consumer. If it can be cleanly removed from the code we'll be able to shed the SPL TSD implementation entirely. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/spl#174
* Fix using zvol as slog deviceJorgen Lundman2012-12-182-1/+1
| | | | | | | | | | | | | | | During the original ZoL port the vdev_uses_zvols() function was disabled until it could be properly implemented. This prevented a zpool from use a zvol for its slog device. This patch implements that missing functionality by adding a zvol_is_zvol() function to zvol.c. Given the full path to a device it will lookup the device and verify its major number against the registered zvol major number for the system. If they match we know the device is a zvol. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1131
* Update SAs when an inode is dirtiedBrian Behlendorf2012-12-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Revert the portion of commit d3aa3ea which always resulted in the SAs being update when an mmap()'ed file was closed. That change accidentally resulted in unexpected ctime updates which upset tools like git. That was always a horrible hack and I'm happy it will never make it in to a tagged release. The right fix is something I initially resisted doing because I was worried about the additional overhead. However, in hindsight the overhead isn't as bad as I feared. This patch implemented the sops->dirty_inode() callback which is unsurprisingly called when an inode is dirtied. We leverage this callback to keep the znode SAs strictly in sync with the inode. However, for now we're going to go slowly to avoid introducing any new unexpected issues by only updating the atime, mtime, and ctime. This will cover the callpath of most concern to us. ->filemap_page_mkwrite->file_update_time->update_time-> mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode Signed-off-by: Brian Behlendorf <[email protected]> Closes #764 Closes #1140
* Linux 3.7 compat, schedule_delayed_work()Brian Behlendorf2012-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | Linux kernel commit d8e794d accidentally broke the delayed work APIs for non-GPL callers. While the APIs to schedule a delayed work item are still available to all callers, it is no longer possible to initialize the delayed work item. I'm cautiously optimistic we could get the delayed_work_timer_fn exported for all callers in the upstream kernel. But frankly the compatibility code to use this kernel interface has always been problematic. Therefore, this patch abandons direct use the of the Linux kernel interface in favor of the new delayed taskq interface. It provides roughly the same functionality as delayed work queues but it's a stable interface under our control. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1053
* Directory xattr znodes hold a reference on their parentBrian Behlendorf2012-12-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike normal file or directory znodes, an xattr znode is guaranteed to only have a single parent. Therefore, we can take a refernce on that parent if it is provided at create time and cache it. Additionally, we take care to cache it on any subsequent zfs_zaccess() where the parent is provided as an optimization. This allows us to avoid needing to do a zfs_zget() when setting up the SELinux security xattr in the create path. This is critical because a hash lookup on the directory will deadlock since it is locked. The zpl_xattr_security_init() call has also been moved up to the zpl layer to ensure TXs to create the required xattrs are performed after the create TX. Otherwise we run the risk of deadlocking on the open create TX. Ideally the security xattr should be fully constructed before the new inode is unlocked. However, doing so would require far more extensive changes to ZFS. This change may also have the benefitial side effect of ensuring xattr directory znodes are evicted from the cache before normal file or directory znodes due to the extra reference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #671
* Increase ZFS_OBJ_MTX_SZ to 256Brian Behlendorf2012-11-271-1/+1
| | | | | | | | | | | | | | | | | | Increasing this limit costs us 6144 bytes of memory per mounted filesystem, but this is small price to pay for accomplishing the following: * Allows for up to 256-way concurreny when performing lookups which helps performance when there are a large number of processes. * Minimizes the likelyhood of encountering the deadlock described in issue #1101. Because vmalloc() won't strictly honor __GFP_FS there is still a very remote chance of a deadlock. See the zfsonlinux/spl@043f9b57 commit. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1101
* Improve AF hard disk detectionBrian Behlendorf2012-11-151-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | Use the bdev_physical_block_size() interface to determine the minimize write size which can be issued without incurring a read-modify-write operation. This is used to set the ashift correctly to prevent a performance penalty when using AF hard disks. Unfortunately, this interface isn't entirely reliable because it's not uncommon for disks to misreport this value. For this reason you may still need to manually set your ashift with: zpool create -o ashift=12 ... The solution to this in the upstream Illumos source was to add a white list of known offending drives. Maintaining such a list will be a burden, but it still may be worth doing if we can detect a large number of these drives. This should be considered as future work. Reported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #916
* Illumos #2671: zpool import should not fail if vdev ashift has increasedGeorge Wilson2012-11-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Gordon Ross <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Richard Lowe <[email protected]> Refererces to Illumos issue: https://www.illumos.org/issues/2671 This patch has been slightly modified from the upstream Illumos version. In the upstream implementation a warning message is logged to the console. To prevent pointless console noise this notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift" event. The event indicates a non-optimial (but entirely safe) ashift value was used to create the pool. Depending on your workload this may impact pool performance. Unfortunately, the only way to correct the issue is to recreate the pool with a new ashift. NOTE: The unrelated fix to the comment in zpool_main.c appears in the upstream commit and was preserved for consistnecy. Ported-by: Cyril Plisko <[email protected]> Reworked-by: Brian Behlendorf <[email protected]> Closes #955
* Log I/Os longer than zio_delay_max (30s default)Brian Behlendorf2012-11-021-0/+2
| | | | | | | | | | | | | | | | | | | | | | | There have been reports of ZFS deadlocking due to what appears to be a lost IO. This patch addes some debugging to determine the exact state of the IO which neither 1) completed, 2) failed, or 3) timed out after zio_delay_max (30) seconds. This information will be logged using the ZFS FMA infrastructure as a 'delay' event and posted to the internal zevent log. By default the last 64 events will be kept in the log but the limit is configurable via the zfs_zevent_len_max module option. To dump the contents of the log use the 'zpool events -v' command and look for the resource.fs.zfs.delay event. It will include various information about the pool, vdev, and zio which may shed some light on the issue. In the context of this change the 120 second kernel blocked thread watchdog has been disabled for synchronous IOs. Signed-off-by: Brian Behlendorf <[email protected]> Issue #930
* Add txgs-<pool> kstat fileBrian Behlendorf2012-11-021-0/+15
| | | | | | | | | | | | | | | | | | | | Create a kstat file which contains useful statistics about the last N txgs processed. This can be helpful when analyzing pool performance. The new KSTAT_TYPE_TXG type was added for this purpose and it tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <[email protected]>
* Add ddt_object_count() error handlingBrian Behlendorf2012-10-291-3/+3
| | | | | | | | | | | | | | | | | | | The interface for the ddt_zap_count() function assumes it can never fail. However, internally ddt_zap_count() is implemented with zap_count() which can potentially fail. Now because there was no way to return the error to the caller a VERIFY was used to ensure this case never happens. Unfortunately, it has been observed that pools can be damaged in such a way that zap_count() fails. The result is that the pool can not be imported without hitting the VERIFY and crashing the system. This patch reworks ddt_object_count() so the error can be safely caught and returned to the caller. This allows a pool which has be damaged in this way to be safely rewound for import. Signed-off-by: Brian Behlendorf <[email protected]> Closes #910
* Allow 'zpool replace' to use short device namesBrian Behlendorf2012-10-221-16/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'zpool replace' command would fail when given a short name because unlike on other platforms the short name cannot be deterministically expanded to a single path. Multiple path prefixes must be checked and in addition the partition suffix for whole disks is determined by the prefix. To handle this complexity a zfs_strcmp_pathname() function was added which takes either a short or fully qualified device name. Short names will be expanded using the prefixes in the default import search path, or the ZPOOL_IMPORT_PATH environment variable if it's defined. All posible expansions are then compared against the comparison path. Care is taken to strip redundant slashes to ensure legitimate matches are not missed. In the context of this work the existing zfs_resolve_shortname() function was extended to consider the ZPOOL_IMPORT_PATH when set. The zfs_append_partition() interface was also simplified to take only a single buffer. The vast majority of these changes rework existing Linux specific code which was originally written to accomidate udev. However, there is some minimal cleanup which removes Illumos specific code. This was done to improve readability but the basic flow and intent of the upstream code was maintained. These changes are the logical conclusion of the previos work to adjust the 'zpool import' search behavior, see commit 44867b6a. Signed-off-by: Brian Behlendorf <[email protected]> Closes #544 Closes #976
* Add FASTWRITE algorithm for synchronous writes.Etienne Dechamps2012-10-175-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1013
* Linux 3.6 compat, iops->mkdir()Richard Yao2012-10-141-1/+1
| | | | | | | | | | | | Use .mkdir instead of .create in 3.3 compatibility check. Linux 3.6 modifies inode_operations->create's function prototype. This causes an autotools Linux 3.3. compatibility check for a function prototype change in create, mkdir and mknode to fail. Since mkdir and mknode are unchanged, we modify the check to examine it instead. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
* Linux 3.6 compat, sget()Yuxuan Shui2012-10-141-0/+10
| | | | | | | | | | | | | As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the mount flags are now passed to sget() so they can be used when initializing a new superblock. ZFS never uses sget() in this fashion so we can simply pass a zero and add a zpl_sget() compatibility wrapper. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #873
* Fix zfs_txg_timeout module parameterBrian Behlendorf2012-10-111-0/+3
| | | | | | | | | | | | | | | | Allow the zfs_txg_timeout variable to be dynamically tuned at run time. By pulling it down out of the variable declaration it will be evaluted each time through the loop. The zfs_txg_timeout variable is now declared extern in a the common sys/txg.h header rather than locally in dsl_scan.c. This prevents potential type mismatches if the global variable needs to be used elsewhere. Move the module_param() code in to the same source file where zfs_txg_timeout is declared. This is the most logical location. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix zfs_write_limit_max integer size mismatch on 32-bit systemsRichard Yao2012-10-111-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c409e4647f221ab724a0bd10c480ac95447203c3 introduced a number of module parameters. This required several types to be changed to accomidate the required module parameters Linux macros. Unfortunately, arc.c contained its own extern definition of the zfs_write_limit_max variable and its type was not updated to be consistent with its dsl_pool.c counterpart. If the variable had been properly marked extern in a common header, then gcc would have generated a warning and this would not have slipped through. The result of this was that the ARC unconditionally expected zfs_write_limit_max to be 64-bit. Unfortunately, the largest size integer module parameter that Linux supports is unsigned long, which varies in size depending on the host system's native word size. The effect was that on 32-bit systems, ARC incorrectly performed 64-bit operations on a 32-bit value by reading the neighboring 32 bits as the upper 32 bits of the 64-bit value. We correct that by changing the extern declaration to use the unsigned long type and move these extern definitions in to the common arc.h header. This should make ARC correctly treat zfs_write_limit_max as a 32-bit value on 32-bit systems. Reported-by: Jorgen Lundman <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #749
* Illumos #3100: zvol rename fails with EBUSY when dirty.Matthew Ahrens2012-10-032-1/+1
| | | | | | | | | | | | | | | | | illumos/illumos-gate@2e2c135528b3edfe9aaf67d1f004dc0202fa1a54 Illumos changeset: 13780:6da32a929222 3100 zvol rename fails with EBUSY when dirty Reviewed by: Christopher Siden <[email protected]> Reviewed by: Adam H. Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Eric Schrock <[email protected]> Ported-by: Etienne Dechamps <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #995
* Fix VOP_CLOSE() in userspace.Etienne Dechamps2012-10-031-1/+1
| | | | | | | | | | | | | Currently, for unknown reasons, VOP_CLOSE() is a no-op in userspace. This causes file descriptor leaks. This is especially problematic with long ztest runs, since zpool.cache is opened repeatedly and never closed, resulting in resource exhaustion (EMFILE errors). This patch fixes the issue by making VOP_CLOSE() do what it is supposed to do. Signed-off-by: Brian Behlendorf <[email protected]> Issue #989
* Create threads in detached state in userspace.Etienne Dechamps2012-10-031-2/+2
| | | | | | | | | | | | | | | | | | | | | Currently, thread_create(), when called in userspace, creates a joinable (i.e. not detached thread). This is the pthread default. Unfortunately, this does not reproduce kthreads behavior (kthreads are always detached). In addition, this contradicts the original Solaris code which creates userspace threads in detached mode. These joinable threads are never joined, which leads to a leakage of pthread thread objects ("zombie threads"). This in turn results in excessive ressource consumption, and possible ressource exhaustion in extreme cases (e.g. long ztest runs). This patch fixes the issue by creating userspace threads in detached mode. The only exception is ztest worker threads which are meant to be joinable. Signed-off-by: Brian Behlendorf <[email protected]> Issue #989
* Illumos #2703: add mechanism to report ZFS send progressBill Pijewski2012-09-196-2/+43
| | | | | | | | | | | | | Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Robert Mustacchi <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Eric Schrock <[email protected]> References: https://www.illumos.org/issues/2703 Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #1948: zpool list should show more detailed pool infoChris Siden2012-09-193-3/+10
| | | | | | | | | | | | | | | | | | Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Albert Lee <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Eric Schrock <[email protected]> References: https://www.illumos.org/issues/1948 Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #685
* Revert "Improve AF hard disk detection"Brian Behlendorf2012-09-111-19/+5
| | | | | | | | | | | | | | | | | | This reverts commit 395350c85d9903beba43bac7ae79092ae25f1526 which accidentally introduced issue #955. Pools using AF drives which were originally created with a sector size of 512 bytes will now be correctly detected to have physical sector size of 4096. This is desirable for a new pool, however for an existing pool abruptly changing the sector size causes problems. For this reason, this change is being reverted until the additional logic can be added to detect the existing pool case. Existing pools must use the ashift size stored in the label regardless of what the disk reports. This is critical for compatibility. Signed-off-by: Brian Behlendorf <[email protected]> Issue #955
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-09-041-0/+1
| | | | | | | | | | | | | This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <[email protected]> Issue #917
* Improve AF hard disk detectionBrian Behlendorf2012-09-041-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | Use the bdev_physical_block_size() interface to determine the minimize write size which can be issued without incurring a read-modify-write operation. This is used to set the ashift correctly to prevent a performance penalty when using AF hard disks. Unfortunately, this interface isn't entirely reliable because it's not uncommon for disks to misreport this value. For this reason you may still need to manually set your ashift with: zpool create -o ashift=12 ... The solution to this in the upstream Illumos source was to add a while list of known offending drives. Maintaining such a list will be a burden, but it still may be worth doing if we can detect a large number of these drives. This should be considered as future work. Reported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #916
* Switch KM_SLEEP to KM_PUSHPAGERichard Yao2012-08-274-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* Pre-allocate vdev I/O buffersBrian Behlendorf2012-08-272-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vdev queue layer may require a small number of buffers when attempting to create aggregate I/O requests. Rather than attempting to allocate them from the global zio buffers, which is slow under memory pressure, it makes sense to pre-allocate them because... 1) These buffers are short lived. They are only required for the life of a single I/O at which point they can be used by the next I/O. 2) The maximum number of concurrent buffers needed by a vdev is small. It's roughly limited by the zfs_vdev_max_pending tunable which defaults to 10. By keeping a small list of these buffer per-vdev we can ensure one is always available when we need it. This significantly reduces contention on the vq->vq_lock, because we no longer need to perform a slow allocation under this lock. This is particularly important when memory is already low on the system. It would probably be wise to extend the use of these buffers beyond aggregate I/O and in to the raidz implementation. The inability to quickly allocate buffer for the parity stripes could result in similiar problems. Signed-off-by: Brian Behlendorf <[email protected]>
* Revert Disable direct reclaim for z_wr_* threadsRichard Yao2012-08-271-1/+0
| | | | | | | | | | | | | | This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit 49be0ccf1fdc2ce852271d4d2f8b7a9c2c4be6db eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726