openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	5313 Allow I/Os to be aggregated across ZIO priority classes	Justin T. Gibbs	2015-04-24	1	-31/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Will Andrews <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5313 https://github.com/illumos/illumos-gate/commit/fe319232 Ported-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3280
*	Serialize access to spa->spa_feat_stats nvlist	Ned Bass	2015-04-24	2	-1/+8
\| \| \| \| \| \| \| \| \| \|	The function spa_add_feature_stats() manipulates the shared nvlist spa->spa_feat_stats in an unsafe concurrent manner. Add a mutex to protect the list. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3335
*	Fix kernel panic due to tsd_exit in ZFS_EXIT(zsb)	Chunwei Chen	2015-04-24	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following panic would occur under certain heavy load: [ 4692.202686] Kernel panic - not syncing: thread ffff8800c4f5dd60 terminating with rrw lock ffff8800da1b9c40 held [ 4692.228053] CPU: 1 PID: 6250 Comm: mmap_deadlock Tainted: P OE 3.18.10 #7 The culprit is that ZFS_EXIT(zsb) would call tsd_exit() every time, which would purge all tsd data for the thread. However, ZFS_ENTER is designed to be reentrant, so we cannot allow ZFS_EXIT to blindly purge tsd data. Instead, we rely on the new behavior of tsd_set. When NULL is passed as the new value to tsd_set, it will automatically remove the tsd entry specified the the key for the current thread. rrw_tsd_key and zfs_allow_log_key already calls tsd_set(key, NULL) when they're done. The zfs_fsyncer_key relied on ZFS_EXIT(zsb) to call tsd_exit() to do clean up. Now we explicitly call tsd_set(key, NULL) on them. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3247
*	Extend PF_FSTRANS critical regions	Brian Behlendorf	2015-04-24	2	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \|	Additional testing has shown that the region covered by PF_FSTRANS needs to be extended to cover the zpl_xattr_security_init() and init_acl() functions. The zpl_mark_dirty() function can also recurse and therefore must always be protected. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #3331
*	Mark additional functions as PF_FSTRANS	Brian Behlendorf	2015-04-17	4	-4/+53
\| \| \| \| \| \| \| \| \|	Prevent deadlocks by disabling direct reclaim during all NFS, xattr, ctldir, and super function calls. This is related to 40d06e3. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #3225
*	Allocate zfs_znode_cache on the Linux slab	Tim Chase	2015-04-14	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Linux slab, in general, performs better than the SPl slab in cases where a lot of objects are allocated and fragmentation is likely present. This patch fixes pathologically bad behavior in cases where the ARC is filled with mostly metadata and a user program needs to allocate and dirty enough memory which would require an insignificant amount of the ARC to be reclaimed. If zfs_znode_cache is on the SPL slab, the system may spin for a very long time trying to reclaim sufficient memory. If it is on the Linux slab, the behavior has been observed to be much more predictible; the memory is reclaimed more efficiently. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3283
*	Use vmem_alloc() in spa_config_write()	Brian Behlendorf	2015-04-07	1	-2/+2
\| \| \| \| \| \| \| \| \|	The packed nvlist allocated in spa_config_write() may exceed the warning threshold for large configurations. Use the vmem interfaces for this short lived allocation. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3251
*	Mark all ZPL and ioctl functions as PF_FSTRANS	Tim Chase	2015-04-03	6	-18/+96
\| \| \| \| \| \| \| \| \| \| \| \|	Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl calls as well as the l2arc and adapt ARC threads. This obviates the need for MUTEX_FSTRANS so its previous uses and definition have been eliminated. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3225
*	Illumus 5693 - ztest fails in dbuf_verify: buf[i] == 0, due to dedup and ↵	Matthew Ahrens	2015-03-27	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bp_override 5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Bayard Bell <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5693 https://github.com/illumos/illumos-gate/commit/7f7ace3 Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3231
*	Illumos 5694 - traverse_prefetcher does not prefetch enough	George Wilson	2015-03-27	1	-12/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	5694 traverse_prefetcher does not prefetch enough Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Bayard Bell <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/5694 https://github.com/illumos/illumos-gate/commit/34d7ce05 Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3230
*	Align code with Illumos	Chris Dunlop	2015-03-27	1	-10/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Align code in traverse_visitbp() with that in Illumos in preparation for applying Illumos-5694. No functional change: use a temporary variable pd to replace multiple occurrences of td->td_pfd. This increases our stack use slightly more then normal because the function is called recursively. Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3230
*	Illumos 5695 - dmu_sync'ed holes do not retain birth time	Prakash Surya	2015-03-27	1	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	5695 dmu_sync'ed holes do not retain birth time Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Bayard Bell <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5695 https://github.com/illumos/illumos-gate/commit/70163ac Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3229
*	dbuf_free_range() overzealously frees dbufs	Ned Bass	2015-03-25	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When called to free a spill block from a dnode, dbuf_free_range() has a bug that results in all dbufs for the dnode getting freed. A variety of problems may result from this bug, but a common one was a zap lookup tripping an ASSERT because the zap buffers had been zeroed out. This could happen on a dataset with xattr=sa set when extended attributes are written and removed on a directory concurrently with I/O to files in that directory. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Fixes #3195 Fixes #3204 Fixes #3222
*	Set the maximum ZVOL transfer size correctly	Tim Chase	2015-03-25	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	ZoL had been setting max_sectors to UINT_MAX, but until Linux 3.19, it the kernel artifically capped it at 1024 (BLK_DEF_MAX_SECTORS). This cap was removed in torvalds/linux@34b48db. This patch changes it to DMU_MAX_ACCESS (in sectors) and also changes the ASSERT in dmu_tx_hold_write() to allow the maximum transfer size. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3212
*	zio_injection_enabled should not be a module option	Isaac Huang	2015-03-24	2	-10/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	The zio_inject.c keeps zio_injection_enabled as a counter of fault handlers, so it should not be exported to user space as a module option. Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c, where the symbols are defined. Signed-off-by: Isaac Huang <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3199
*	Reduce size of zfs_sb_t: allocate z_hold_mtx separately	Chris Dunlop	2015-03-24	2	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	zfs_sb_t has grown to the point where using kmem_zalloc() for allocations is triggering the 32k warning threshold. We can't safely convert this entire allocation to use vmem_alloc() instead of kmem_alloc() because the backing_dev_info structure is embedded here. It depends on the bit_waitqueue() function which won't behave properly when given a virtual address. Instead, use vmem_alloc() to allocate the z_hold_mtx array separately. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #3178
*	Fix arc_adjust_meta() behavior	Brian Behlendorf	2015-03-20	1	-23/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Achieving this is slightly more complicated than it appears because it is common for data buffers to have holds on meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. The number of total restarts in limited by zfs_arc_meta_adjust_restarts to prevent spinning in the rare case where all meta data is pinned. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Issue #3160
*	Restructure per-filesystem reclaim	Brian Behlendorf	2015-03-20	3	-39/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super blocks and perform the reclaim. For the most part this design has worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is designed to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit 050d22b removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivalent functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds that missing functionality by calling shrink_dcache_parent() to release dentries which may be pinning inodes. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available for old kernels. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support the .evict_inode callback. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Issue #3160
*	Fix arc_meta_max accounting	Brian Behlendorf	2015-03-20	1	-3/+4
\| \| \| \| \| \| \| \| \|	The arc_meta_max value should be increased when space it consumed not when it is returned. This ensure's that arc_meta_max is always up to date. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Issue #3160
*	Use MUTEX_FSTRANS on l2arc_buflist_mtx	Chunwei Chen	2015-03-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Use MUTEX_FSTRANS on l2arc_buflist_mtx to prevent the following deadlock scenario: 1. arc_release() -> hash_lock -> l2arc_buflist_mtx 2. l2arc_write_buffers() -> l2arc_buflist_mtx -> (direct reclaim) -> arc_buf_remove_ref() -> hash_lock Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Signed-off-by: Tim Chase <[email protected]> Issue #3160
*	Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruption	Justin T. Gibbs	2015-03-12	3	-9/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	5630 stale bonus buffer in recycled dnode_t leads to data corruption Author: Justin T. Gibbs <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Will Andrews <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5630 https://github.com/illumos/illumos-gate/commit/cd485b4 Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #3172
*	Illumos 5047 - don't use atomic_*_nv if you discard the return value	Josef 'Jeff' Sipek	2015-03-12	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	5047 don't use atomic_*_nv if you discard the return value Author: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Jason King <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5047 https://github.com/illumos/illumos-gate/commit/640c167 Porting Notes: Several hunks from the original patch where not specific to ZFS and thus were dropped. Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #3172
*	Mark zfs_inactive() with PF_FSTRANS	Brian Behlendorf	2015-03-10	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \|	Allowing direct reclaim to re-enter the VFS in the zfs_inactive() call path has historically been problematic for ZoL. Therefore, in order to avoid an entire class of current and future issues caused by this PF_FSTRANS is set for all zfs_inactive() callers. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3163
*	Use cached feature info in spa_add_feature_stats()	Ned Bass	2015-03-05	3	-11/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Avoid issuing I/O to the pool when retrieving feature flags information. Trying to read the ZAPs from disk means that zpool clear would hang if the pool is suspended and recovery would require a reboot. To keep the feature stats resident in memory, we hang a cached nvlist off of the spa. It is built up from disk the first time spa_add_feature_stats() is called, and refreshed thereafter using the cached feature reference counts. spa_add_feature_stats() gets called at pool import time so we can be sure the cached nvlist will be available if the pool is later suspended. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3082
*	Change ASSERT(!"...") to cmn_err(CE_PANIC, ...)	Brian Behlendorf	2015-03-03	5	-7/+11
\| \| \| \| \| \| \| \| \| \|	There are a handful of ASSERT(!"...")'s throughout the code base for cases which should be impossible. This patch converts them to use cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that additional debugging is logged if they were to occur. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1445
*	Linux 4.0 compat: bdi_setup_and_register()	Brian Behlendorf	2015-03-03	1	-17/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The 'capabilities' argument which was passed to bdi_setup_and_register() has been removed. File systems should no longer pass BDI_CAP_MAP_COPY. For our purposes this means there are now three different interfaces which must be handled. A zpl_bdi_setup_and_register() wrapper function has been introduced to provide a single interface to the ZPL code. * 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported. * 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments. * 4.0 - x.y, bdi_setup_and_register() takes 2 arguments. I've also taken this opportunity to remove HAVE_BDI because kernels older then 2.6.32 are no longer supported. All kernels newer than this will have one of the above interfaces. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #3128
*	Use MUTEX_FSTRANS mutex type	Brian Behlendorf	2015-03-03	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by using a new mutex type which is responsible for managing PF_FSTRANS, support for which was added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch 'kmem-rework'. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #3050 Closes #3055 Closes #3062 Closes #3132 Closes #3142 Closes #2983
*	Fix deadlock between zpool export and zfs list	Isaac Huang	2015-03-02	1	-19/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pool reference count is NOT checked in spa_export_common() if the pool has been imported readonly=on, i.e. spa->spa_sync_on is FALSE. Then zpool export and zfs list may deadlock: 1. Pool A is imported readonly. 2. zpool export A and zfs list are run concurrently. 3. zfs command gets reference on the spa, which holds a dbuf on on the MOS meta dnode. 4. zpool command grabs spa_namespace_lock, and tries to evict dbufs of the MOS meta dnode. The dbuf held by zfs command can't be evicted as its reference count is not 0. 5. zpool command blocks in dnode_special_close() waiting for the MOS meta dnode reference count to drop to 0, with spa_namespace_lock held. 6. zfs command tries to get the spa_namespace_lock with a reference on the spa held, which holds a dbuf on the MOS meta dnode. 7. Now zpool command and zfs command deadlock each other. Also any further zfs/zpool command will block on spa_namespace_lock forever. The fix is to always check pool reference count in spa_export_common(), no matter whether the pool was imported readonly or not. Signed-off-by: Isaac Huang <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2034
*	Prevent "zpool destroy\|export" when suspended	Brian Behlendorf	2015-03-02	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Cleanly destroying or exporting a pool requires that the pool not be suspended. Therefore, set the POOL_CHECK_SUSPENDED flag for these ioctls so the utilities will output a descriptive error message rather than block. Signed-off-by: Brian Behlendorf <[email protected]> Issue #2878
*	Retire spl_module_init()/spl_module_fini()	Brian Behlendorf	2015-02-24	6	-34/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the original implementation of the SPL wrappers were provided for module initialization and cleanup. This was done to abstract away any compatibility code which might be needed for the SPL. As it turned out the only significant compatibility issue was that the default pwd during module load differed under Illumos and Linux. Since this is such as minor thing and the wrappers complicate the code they are being retired. Signed-off-by: Brian Behlendorf <[email protected]> Closes #2985
*	Fix O_APPEND open(2) flag	Brian Behlendorf	2015-02-24	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As described in flags section of open(2): O_APPEND: The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). O_APPEND may lead to corrupted files on NFS filesys- tems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition. This issue was originally overlooked because normally the generic VFS code handles this for a filesystem. However, because ZFS explictly registers a zpl_write() function it's responsible for the seek. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3124
*	Set zfs_autoimport_disable default value to 1	Dan Swartzendruber	2015-02-17	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When loading the ZFS kernel modules they should not populate the spa namespace using the cache file. This behavior isn't consistent with other Linux kernel modules and we need to move away from it. Removing this makes the whole startup process predictable with four basic steps which are driven by the init system. 1) modprobe 2) zpool import 3) zfs mount 4) zfs share This change also helps lay the groundwork for eventually removing the kobj_* compatibility code on the kernel side. It may need to be preserved in userspace because libzfs_init() depends on it. This is why the conditional must be wrapped with an #ifdef _KERNEL. Signed-off-by: Dan Swartzendruber <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2820
*	Skip bad DVAs during free by setting zfs_recover=1	Brian Behlendorf	2015-02-13	1	-6/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a bad DVA is encountered in metaslab_free_dva() the system should treat it as fatal. This indicates that somehow a damaged DVA was written to disk and that should be impossible. However, we have seen a handful of reports over the years of pools somehow being damaged in this way. Since this damage can render otherwise intact pools unimportable, and the consequence of skipping the bad DVA is only leaked free space, it makes sense to provide a mechanism to ignore the bad DVA. Setting the zfs_recover=1 module option will cause the DVA to be ignored which may allow the pool to be imported. Since zfs_recover=0 by default any pool attempting to free a bad DVA will treat it as a fatal error preserving the current behavior. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3099 Issue #3090 Issue #2720
*	Fix readdir for .zfs/snapshot directory	Andrey Vesnovaty	2015-02-10	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	dmu_snapshot_list_next stores the index of the next snapshot entry to the offp argument, which zpl_snapdir_iterate then uses for the dir_emit. This result in an off-by-one error. Therefore a temporary variable should be used. This was a regression introduced in commit zfsonlinux/zfs@0f37d0c. Signed-off-by: Andrey Vesnovaty <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2930
*	Retire zio_cons()/zio_dest()	Brian Behlendorf	2015-02-10	1	-66/+15
\| \| \| \| \| \| \| \| \| \| \|	The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue #2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Issue #3063
*	Retire zio_bulk_flags	Brian Behlendorf	2015-02-10	1	-8/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Long ago the zio_bulk_flags module parameter was introduced to facilitate debugging and profiling the zio_buf_caches. Today this code works well and there's no compelling reason to keep this functionality. In fact it's preferable to revert this so the code is more consistent with other ZFS implementations. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Issue #3063
*	Linux 3.19 compat: file_inode was added	Jörg Thalheim	2015-02-10	1	-2/+2
\| \| \| \| \| \| \| \| \|	struct access f->f_dentry->d_inode was replaced by accessor function file_inode(f) Signed-off-by: Joerg Thalheim <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3084
*	Use vmem_alloc() for nvlists	Brian Behlendorf	2015-02-10	3	-8/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Several of the nvlist functions may perform allocations larger than the 32k warning threshold. Convert them to use vmem_alloc() so the best allocator is used. Commit efcd79a retired KM_NODEBUG which was used to suppress large allocation warnings. Concurrently the large allocation warning threshold was increased from 8k to 32k. The goal was to identify the remaining locations, such as this one, where the allocation can be larger than 32k. This patch is expected fine tuning resulting for the kmem-rework changes, see commit 6e9710f. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3057 Closes #3079 Closes #3081
*	Revert "Don't read space maps during import for readonly pools"	Brian Behlendorf	2015-02-09	2	-9/+2
\| \| \| \| \| \| \| \| \| \|	This reverts commit 7fc8c33ede10f7104ca0e91d690d3ebb5236887b which accidentally introduced a ztest failure. ztest: '/usr/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 2 child exited with code 3 Signed-off-by: Brian Behlendorf <[email protected]>
*	Don't read space maps during import for readonly pools	Brian Behlendorf	2015-02-09	2	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Normally when importing a pool the space maps for all top level vdevs are read from disk. The space maps will be required latter when an allocation is performed and free blocks need to be located. However, if the pool is imported readonly then we are guaranteed that no allocations can occur. In this case the space maps need not be loaded.. A similar argument can be made for the DTLs (dirty time logs). Because a pool import will fail if the space maps cannot be read. The ability to safely ignore them makes it more likely that a damaged pool can be imported readonly to recover its contents. Signed-off-by: Brian Behlendorf <[email protected]> Issue #2831
*	Illumos 5311 - traverse_dnode may report success when it should not	Justin T. Gibbs	2015-02-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	5311 traverse_dnode may report success when it should not Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Will Andrews <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://github.com/illumos/illumos-gate/commit/2a89c2c https://www.illumos.org/issues/5311 Ported by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2970
*	Fix SA header size accounting	Ned Bass	2015-02-06	1	-41/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The functions sa_find_sizes() and sa_build_layout() fail to account for the additional 2 bytes of SA header space when calculating whether a variable size attribute might spill over. They may consequently determine that an attribute will fit in the bonus buffer along with a spill block pointer, when in reality the attribute would be partially overwritten by the spill block pointer if spill over occurs. This also causes an inconsistency between the SA header size and the number of variable size attributes in the layout, tripping an assertion when debugging is on. The following reproducer demonstrates the problem. ln -s $(perl -e 'print "z" x 20') file setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file Even though sa_find_sizes() computes the index of the attribute where spill-over will occur, sa_build_layouts() discards the result and recomputes it itself. As it turns out, both functions get it wrong. Since this computation is awkward and, as history has shown, easy to screw up, let's just do it in one place. This patch fixes the bug in sa_find_sizes() and updates sa_build_layout() to use the result computed there. Also improve the comments in sa_find_sizes(). Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #3070
*	Skip evicting dbufs when walking the dbuf hash	Brian Behlendorf	2015-02-06	1	-3/+5
\| \| \| \| \| \| \| \| \| \|	When a dbuf is in the DB_EVICTING state it may no longer be on the dn_dbufs list. In which case it's unsafe to call DB_DNODE_ENTER. Therefore, any dbuf which is found in this safe must be skipped. Signed-off-by: Brian Behlendorf <[email protected]> Closes #2553 Closes #2495
*	Spurious ENOMEM returns when reading dbufs kstat	Tim Chase	2015-02-04	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 7b2d78a046aa4695d434478a439a9438521d73af fixed some improper uses of snprintf(), however, in __dbuf_stats_hash_table_data() the return value of snprintf is propagated to the caller. This caused spurious ENOMEM errors when reading the dbufs kstat. This commit causes the actual number of characters written to be returned. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3072
*	fix l2arc compression buffers leak	avg	2015-02-03	1	-10/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit log from FreeBSD: We have observed that arc_release() can be called concurrently with a l2arc in-flight write. Also, we have observed that arc_hdr_destroy() can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar circumstances. Previously the l2arc headers would be freed while leaking their associated compression buffers. Now the buffers are placed on l2arc_free_on_write list for delayed freeing. This is similar to what was already done to arc buffers that were supposed to be freed concurrently with in-flight writes of those buffers. In addition to fixing the discovered leaks this change also adds some protective code to assert that a compression buffer associated with a l2arc header is never leaked. A new kstat l2_cdata_free_on_write is added. It keeps a count of delayed compression buffer frees which previously would have been leaks. Tested by: Vitalij Satanivskij <[email protected]> et al Requested by: many MFC after: 2 weeks Sponsored by: HybridCluster / ClusterHQ References: https://illumos.org/issues/5222 https://github.com/freebsd/freebsd/commit/b98f85d http://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781 http://lists.open-zfs.org/pipermail/developer/2014-January/000455.html http://lists.open-zfs.org/pipermail/developer/2014-February/000523.html Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3029
*	Use zio buffers in zil_itx_create()	Brian Behlendorf	2015-02-02	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The zil_itx_create() function uses the vmem_alloc() allocator for its buffers because when logging a write that buffer may be as large as 64K. This is non-optimal because we may need to allocate many of of these buffers and this interface has the potential to be slow. Instead, use zio_data_buf_alloc() which is specifically designed to be able to efficiently allocate a wide range of buffer sizes. In addition, do some cleanup and use the zil_itx_destroy() function to always free an itx structure. This way we're always sure the right allocation functions are used. Notice that in the current code kmem_free() and vmem_free() were both used. This happened to work because these wrappers map to the same internal SPL function. This was identified as a potential problem when a low-end memory constrained system began logging the following warnings. There was no deadlock here just repeated allocation failures resulting in increased latency. Possible memory allocation deadlock: size=65792 lflags=0x42d0 Pid: 20118, comm: kvm Tainted: P O 3.2.0-0.bpo.4-amd64 Call Trace: [<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl] [<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl] [<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs] [<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs] [<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs] [<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs] [<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde [<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125 [<ffffffff81109055>] ? sys_pwritev+0x55/0x9a [<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #3059
*	Handle closing an unopened ZVOL	Brian Behlendorf	2015-01-30	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Thank to commit a4430fce691d492aec382de0dfa937c05ee16500 we're now correctly returning EROFS when opening a zvol on a read-only pool. Unfortunately, it looks like this causes us to trigger some unexpected behavior by __blkdev_get(). In the failure case it's possible __blkdev_get() will call __blkdev_put() for a bdev which was never successfully opened. This results in us trying to close the device again and hitting the NULL dereference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1343
*	Add zvol_open() error handling for readonly property	Brian Behlendorf	2015-01-30	1	-1/+4
\| \| \| \| \| \| \| \|	Rather than ASSERT when for some reason the readonly property of a zvol can't be read cleanly handle the failure. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1343
*	Fix removal of SA in sa_modify_attrs()	Tim Chase	2015-01-21	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The sa_modify_attrs() function can add, remove or replace an SA. The main loop in the function uses the index "i" to iterate over the existing SAs and uses the index "j" for writing them into a new buffer via SA_ADD_BULK_ATTR(). The write index, "j" is incremented on remove (SA_REMOVE) operations which leads to a corruption in the new SA buffer. This patch remove the increment for SA_REMOVE operations. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #3028
*	Use kmem_vasprintf() in log_internal()	Richard Yao	2015-01-21	1	-10/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be simplified by using kmem_asprintf(). It is not clear that switching to kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to kmem_asprintf() is cleanup that simplifies debugging such that it would become clear that this is a bug in glibc should the issue persist. It also brings this function almost back in sync with Illumos. This was possible due to the recently reworked kmem code which allows us to use KM_SLEEP in the same fashion as Illumos. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2791 Issue #2781