aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-09-171-1/+1
| | | | | | | | | | | | | This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <[email protected]> Issue #917
* ZFS replay transaction error 5Cyril Plisko2012-09-171-8/+7
| | | | | | | | | | | | | | | | | | | | When zfs_replay_write() replays TX_WRITE records from ZIL it calls zpl_write_common() to perform the actual write. zpl_write_common() returns the number of bytes written (similar to write() system call) or an (negative) error. However, the code expects the positive return value to be a residual counter. Thus when zpl_write_common() successfully completes it is mistakenly considered to be a partial write and the error code delivered further. At this point the ZIL processing is aborted with famous "ZFS replay transaction error 5" error message given to the message buffer. The fix is to compare the zpl_write_commmon() return value with the buffer size and flag error only when they disagree. Signed-off-by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #933
* Clear PG_writeback for sync I/O error caseBrian Behlendorf2012-09-141-0/+9
| | | | | | | | | | | | | | | Commit 2b2861362f7dd09cc3167df8fddb6e2cb475018a accidentally introduced this issue by only conditionally registering the commit callback in the async case. The error handing code for the dmu_tx_assign() failure case relied on there always being a registered commit callback to clear the PG_writeback bit. Since that is no longer strictly true for the synchronous case we must explicitly invoke the callback. Signed-off-by: Brian Behlendorf <[email protected]> Closes #961
* Move iput() after zfs_inode_update()Brian Behlendorf2012-09-121-2/+2
| | | | | | | | | | | | | | | When replaying an unlink/remove operation via zfs_rmdir() the object being removed will be instantiated by a call to zfs_dirent_lock(). This means that there is a single reference protecting the object. Right before the call to zfs_inode_update() this reference is dropped which may cause the object to be destroyed. This will result in a NULL dereference as shown by the stack trace is issue #782. This likely isn't an issue during normal operation because there is always an additional reference held on the object by the VFS. Signed-off-by: Brian Behlendorf <[email protected]> Closes #782
* Remove zvol device nodeBrian Behlendorf2012-09-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | The 'zfs destroy' changes in 330d06f disrupted how zvol devices get removed on ZoL. However, it basically boils down to the fact that we are no longer reliably calling zvol_remove_minor() via zfs_ioc_destroy_snaps(). Therefore we add the missing call and handle things similarly to the existing zfs_unmount_snap() case. Ideally we would check if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the right thing as in zfs_ioc_destroy(). However, it looks like it would be fairly expensive to get the type, and it's harmless to simply attempt the umount and minor removal. This is also an issue in the latest FreeBSD and Illumos code. It was being tracked under the following issue, and we may want to refresh our code when they settle on what they want to do about it upstream. https://www.illumos.org/issues/3170 Signed-off-by: Brian Behlendorf <[email protected]> Issue #903
* Make ZFS filesystem id persistent across different machinesCyril Plisko2012-09-061-2/+4
| | | | | | | | | Use ZFS dataset fsid guid as a unique file system id, similar to what is done on Illumos/OpenSolaris. Signed-off-by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #888
* Disable page allocation warnings for ARC buffersBrian Behlendorf2012-09-061-2/+3
| | | | | | | | | | | | | | | | | | | | | Buffers for the ARC are normally backed by the SPL virtual slab. However, if memory is low, AND no slab objects are available, AND a new slab cannot be quickly constructed a new emergency object will be directly allocated. These objects can be as large as order 5 on a system with 4k pages. And because they are allocated with KM_PUSHPAGE, to avoid a potential deadlock, they are not allowed to initiate I/O to satisfy the allocation. This can result in the occasional allocation failure. However, since these allocations are allowed to block and perform operations such as memory compaction they will eventually succeed. Since this is not unexpected (just unlikely) behavior this patch disables the warning for the allocation failure. Signed-off-by: Brian Behlendorf <[email protected]> Issue #465
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-09-052-3/+3
| | | | | | | | | | | | | This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <[email protected]> Issue #917
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-09-041-2/+2
| | | | | | | | | | | | | This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <[email protected]> Issue #917
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-09-041-1/+1
| | | | | | | | | | | | | This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <[email protected]> Issue #917
* Switch KM_SLEEP to KM_PUSHPAGEChris Dunlop2012-09-022-6/+6
| | | | | | | | | | | | | | This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #917
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-08-311-2/+2
| | | | | | | | | | | | | This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <[email protected]> Issue #917
* Clear PG_writeback after zil_commit() for sync I/OBrian Behlendorf2012-08-301-3/+8
| | | | | | | | | | | | | | | | | When writing via ->writepage() the writeback bit was always cleared as part of the txg commit callback. However, when the I/O is also being written synchronsously to the zil we can immediately clear this bit. There is no need to wait for the subsequent TXG sync since the data is already safe on stable storage. This has been observed to reduce the msync(2) delay from up to 5 seconds down 10s of miliseconds. One workload which is expected to benefit from this are the intermittent samba hands described in issue #700. Signed-off-by: Brian Behlendorf <[email protected]> Closes #700 Closes #907
* Switch KM_SLEEP to KM_PUSHPAGERichard Yao2012-08-2741-170/+172
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* mzap_upgrade() must use kmem_alloc()Brian Behlendorf2012-08-271-3/+3
| | | | | | | | | | These allocations in mzap_update() used to be kmem_alloc() but were changed to vmem_alloc() due to the size of the allocation. However, since it turns out this function may be called in the context of the txg_sync thread they must be changed back to use a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored. Signed-off-by: Brian Behlendorf <[email protected]>
* Annotate KM_PUSHPAGE call paths with PF_NOFSBrian Behlendorf2012-08-273-5/+42
| | | | | | | | | | | | | | | | | The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard() call paths must only use KM_PUSHPAGE to avoid potential deadlocks during direct reclaim. This patch annotates these call paths so any accidental use of KM_SLEEP will be quickly detected. In the interest of stability if debugging is disabled the offending allocation will have its GFP flags automatically corrected. When debugging is enabled any misuse will be treated as a fatal error. This patch is entirely for debugging. We should be careful to NOT become dependant on it fixing up the incorrect allocations. Signed-off-by: Brian Behlendorf <[email protected]>
* Pre-allocate vdev I/O buffersBrian Behlendorf2012-08-272-2/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vdev queue layer may require a small number of buffers when attempting to create aggregate I/O requests. Rather than attempting to allocate them from the global zio buffers, which is slow under memory pressure, it makes sense to pre-allocate them because... 1) These buffers are short lived. They are only required for the life of a single I/O at which point they can be used by the next I/O. 2) The maximum number of concurrent buffers needed by a vdev is small. It's roughly limited by the zfs_vdev_max_pending tunable which defaults to 10. By keeping a small list of these buffer per-vdev we can ensure one is always available when we need it. This significantly reduces contention on the vq->vq_lock, because we no longer need to perform a slow allocation under this lock. This is particularly important when memory is already low on the system. It would probably be wise to extend the use of these buffers beyond aggregate I/O and in to the raidz implementation. The inability to quickly allocate buffer for the parity stripes could result in similiar problems. Signed-off-by: Brian Behlendorf <[email protected]>
* Revert Disable direct reclaim for z_wr_* threadsRichard Yao2012-08-271-6/+3
| | | | | | | | | | | | | | This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit 49be0ccf1fdc2ce852271d4d2f8b7a9c2c4be6db eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* Revert Fix zpl_writepage() deadlockRichard Yao2012-08-271-14/+1
| | | | | | | | | | | | | | | | | | | | | The commit, cfc9a5c88f91f7b4d606fce89505e1f404691ea5, to fix deadlocks in zpl_writepage() relied on PF_MEMALLOC. That had the effect of disabling the direct reclaim path on all allocations originating from calls to this function, but it failed to address the actual cause of those deadlocks. This led to the same deadlocks being observed with swap on zvols, but not with swap on the loop device, which exercises this code. The use of PF_MEMALLOC also had the side effect of permitting allocations to be made from ZONE_DMA in instances that did not require it. This contributes to the possibility of panics caused by depletion of pages from ZONE_DMA. As such, we revert this patch in favor of a proper fix for both issues. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* Revert Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))Richard Yao2012-08-271-13/+0
| | | | | | | | | | | Commit eec8164771bee067c3cd55ed0a16dadeeba276de worked around an issue involving direct reclaim through the use of PF_MEMALLOC. Since we are reworking thing to use KM_PUSHPAGE so that swap works, we revert this patch in favor of the use of KM_PUSHPAGE in the affected areas. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* rmdir(2) should return ENOTEMPTYBrian Behlendorf2012-08-261-3/+3
| | | | | | | | | | | | | | | Under Solaris the behavior for rmdir(2) is to return EEXIST when a directory still contains entries. However, on Linux ENOTEMPTY is the expected return value with EEXIST being technically allowed. According to rmdir(2): ENOTEMPTY pathname contains entries other than . and .. ; or, pathname has .. as its final component. POSIX.1-2001 also allows EEXIST for this condition. Signed-off-by: Brian Behlendorf <[email protected]> Closes #895
* Illumos #3085: zfs diff panics, then panics in a loop on bootingChristopher Siden2012-08-251-4/+5
| | | | | | | | | | | Reviewed by: Matt Ahrens <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/3085 Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #2901: zfs receive fails for exabyte sparse filesSimon Klinkert2012-08-251-0/+3
| | | | | | | | | | | Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/2901 Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Drop spill buffer referenceJaven Wu2012-08-251-2/+22
| | | | | | | | | | | | | | | | | | | When calling sa_update() and friends it is possible that a spill buffer will be needed to accomidate the update. When this happens a hold is taken on the new dbuf and that hold must be released before calling dmu_tx_commit(). Failing to release the hold will cause a copy of the dbuf to be made in dbuf_sync_leaf(). This is done to ensure further updates to the dbuf never sneak in to the syncing txg. This could be left to the sa_update() caller. But then the caller would need to be aware of this internal SA implementation detail. It is therefore preferable to handle this all internally in the SA implementation. Signed-off-by: Brian Behlendorf <[email protected]> Closes #503 Closes #513
* Revert "Use SA_HDL_PRIVATE for SA xattrs"Brian Behlendorf2012-08-251-28/+5
| | | | | | | | | | | | This reverts commit ec2626ad3f695a2ced3946c4197ef64cbcac4959 which caused consistency problems between the shared and private handles. Reverting this change should resolve issues #709 and #727. It will also reintroduce an arc_anon memory leak which is addressed by the next commit. Signed-off-by: Brian Behlendorf <[email protected]> Closes #709 Closes #727
* Wrap smp_processor_id in kpreempt_[dis|en]ablePrakash Surya2012-08-242-2/+18
| | | | | | | | | | | | | After surveying the code, the few places where smp_processor_id is used were deemed to be safe to use with a preempt enabled kernel. As such, no core logic had to be changed. These smp_processor_id call sites are simply are wrapped in kpreempt_disable and kpreempt_enabled to prevent the Linux kernel from emitting scary warnings. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #83
* Export dmu_buf_rele() symbolBrian Behlendorf2012-08-141-0/+1
| | | | | | | | | While I'd like to remove the various pragmas in module/zfs/dbuf.c. There are consumers such as Lustre which still depend on dmu_buf_* versions of the symbols. Until all consumers can be converted to use only the dbuf_* names leave this symbol exported. Signed-off-by: Brian Behlendorf <[email protected]>
* Suppress 'zfs_sb_create' memory warningBrian Behlendorf2012-08-101-1/+1
| | | | | | | | | | When mutex debugging is enabled in your kernel the increased size of the mutex structures can push the zfs_sb_t type beyond the 8k warning threshold. This isn't harmful so we suppress the warning for this case. Signed-off-by: Brian Behlendorf <[email protected]> Closes #628
* Export dbuf_* symbolsBrian Behlendorf2012-08-101-1/+33
| | | | | | | | | | | | | Export these symbols so they may be used by other ZFS consumers besides the ZPL. Remove three stale prototype definites from dbuf.h. The actual implementations of these functions were removed/renamed long ago. It would be good in the long term to remove the existing pragmas we inherited from Solaris and simply use the dbuf_* names. Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #1693: persistent 'comment' field for a zpoolDan McDonald2012-08-083-1/+54
| | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Eric Schrock <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/1693 Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #678
* Set zvol discard_granularity to the volblocksize.Etienne Dechamps2012-08-071-0/+1
| | | | | | | | | | | | | | | | | | Currently, zvols have a discard granularity set to 0, which suggests to the upper layer that discard requests of arbirarily small size and alignment can be made efficiently. In practice however, ZFS does not handle unaligned discard requests efficiently: indeed, it is unable to free a part of a block. It will write zeros to the specified range instead, which is both useless and inefficient (see dnode_free_range). With this patch, zvol block devices expose volblocksize as their discard granularity, so the upper layer is aware that it's not supposed to send discard requests smaller than volblocksize. Signed-off-by: Brian Behlendorf <[email protected]> Closes #862
* Limit the number of blocks to discard at once.Etienne Dechamps2012-07-311-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | The number of blocks that can be discarded in one BLKDISCARD ioctl on a zvol is currently unlimited. Some applications, such as mkfs, discard the whole volume at once and they use the maximum possible discard size to do that. As a result, several gigabytes discard requests are not uncommon. Unfortunately, if a large amount of data is allocated in the zvol, ZFS can be quite slow to process discard requests. This is especially true if the volblocksize is low (e.g. the 8K default). As a result, very large discard requests can take a very long time (seconds to minutes under heavy load) to complete. This can cause a number of problems, most notably if the zvol is accessed remotely (e.g. via iSCSI), in which case the client has a high probability of timing out on the request. This patch solves the issue by adding a new tunable module parameter: zvol_max_discard_blocks. This indicates the maximum possible range, in zvol blocks, of one discard operation. It is set by default to 16384 blocks, which appears to be a good tradeoff. Using the default volblocksize of 8K this is equivalent to 128 MB. When using the maximum volblocksize of 128K this is equivalent to 2 GB. Signed-off-by: Brian Behlendorf <[email protected]> Closes #858
* Illumos #1644, #1645, #1646, #1647, #1708Matthew Ahrens2012-07-3110-127/+594
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1644 add ZFS "clones" property 1645 add ZFS "written" and "written@..." properties 1646 "zfs send" should estimate size of stream 1647 "zfs destroy" should determine space reclaimed by destroying multiple snapshots 1708 adjust size of zpool history data References: https://www.illumos.org/issues/1644 https://www.illumos.org/issues/1645 https://www.illumos.org/issues/1646 https://www.illumos.org/issues/1647 https://www.illumos.org/issues/1708 This commit modifies the user to kernel space ioctl ABI. Extra care should be taken when updating to ensure both the kernel modules and utilities are updated. This change has reordered all of the new ioctl()s to the end of the list. This should help minimize this issue in the future. Reviewed by: Richard Lowe <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Albert Lee <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #826 Closes #664
* Add script for builtin module building.Etienne Dechamps2012-07-267-31/+18
| | | | | | | | | | | | | | | | | This commit introduces a "copy-builtin" script designed to prepare a kernel source tree for building ZFS as a builtin module. The script makes a full copy of all needed files, thus making the kernel source tree fully independent of the zfs source package. To achieve that, some compilation flags (-include, -I) have been moved to module/Makefile. This Makefile is only used when compiling external modules; when compiling builtin modules, a Kbuild file generated by the configure-builtin script is used instead. This makes sure Makefiles inside the kernel source tree does not contain references to the zfs source package. Signed-off-by: Brian Behlendorf <[email protected]> Issue #851
* Linux 3.5 compat, end_writeback() changed to clear_inode()Richard Yao2012-07-231-5/+10
| | | | | | | | | | | | | | | | The end_writeback() function was changed by moving the call to inode_sync_wait() earlier in to evict(). This effecitvely changes the ordering of the sync but it does not impact the details of the zfs implementation. However, as part of this change end_writeback() was renamed to clear_inode() to reflect the new semantics. This change does impact us and clear_inode() now maps to end_writeback() for kernels prior to 3.5. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #784
* Linux 3.5 compat, iops->truncate_range() removedRichard Yao2012-07-231-0/+4
| | | | | | | | | The vmtruncate_range() support has been removed from the kernel in favor of using the fallocate method in the file_operations table. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #784
* Linux 3.5 compat, eops->encode_fh() takes inodesRichard Yao2012-07-231-1/+6
| | | | | | | | | | | | | | The export_operations member ->encode_fh() has been updated to take both the child and parent inodes. This interface used to take the child dentry and a bool describing if the parent is needed. NOTE: While updating this code I noticed that we do not currently cleanly handle the case where we're passed a connectable parent. This code should be audited to make sure we're doing the right thing. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #784
* Disable .zfs directory on 32-bit systemsBrian Behlendorf2012-07-201-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | The .zfs control directory implementation currently relies on the fact that there is a direct 1:1 mapping from an object id to its inode number. This works well as long as the system uses a 64-bit value to store the inode number. Unfortunately, the Linux kernel defines the inode number as an 'unsigned long' type. This means that for 32-bit systems will only have 32-bit inode numbers but we still have 64-bit object ids. This problem is particularly acute for the .zfs directories which leverage those upper 32-bits. This is done to avoid conflicting with object ids which are allocated monotonically starting from 0. This is likely to also be a problem for datasets on 32-bit systems with more than ~2 billion files. The right long term fix must remove the simple 1:1 mapping. Until that's done the only safe thing to do is to disable the .zfs directory on 32-bit systems. Signed-off-by: Brian Behlendorf <[email protected]>
* Add ddt_object_load() error handlingBrian Behlendorf2012-07-201-1/+3
| | | | | | | | | Add the missing error handling to ddt_object_load(). There's no good reason this needs to be fatal. It is preferable that an error be returned. This will allow 'zpool import -FX' to safely attempt to rollback through previous txgs looking for a good one. Signed-off-by: Brian Behlendorf <[email protected]>
* Add 'inline' keywordBrian Behlendorf2012-07-191-4/+4
| | | | | | | | | | | | The '__attribute__((always_inline))' does not strictly imply 'inline'. Newer versions of gcc detect this misuse and issue the following warning. Including the missing 'inline' resolves the build warning. ./module/zfs/dsl_scan.c:758:1:error: always_inline function might not be inlinable [-Werror=attributes] Signed-off-by: Brian Behlendorf <[email protected]>
* Fix build failures on PaX/GRSecurity patched kernelsRichard Yao2012-07-172-27/+26
| | | | | | | | | | | | | | | | | | | | | | | Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a dialect of C that relies on a GCC plugin. In particular, struct file_operations has been marked do_const in the PaX/GRSecurity dialect, which causes GCC to consider all instances of it as const. This caused failures in the autotools checks and the ZFS source code. To address this, we modify the autotools checks to take into account differences between the PaX C dialect and the regular C dialect. We also modify struct zfs_acl's z_ops member to be a pointer to a function pointer table. Lastly, we modify zpl_put_link() to address a PaX change to the function prototype of nd_get_link(). This avoids compiler errors in the PaX/GRSecurity dialect. Note that the change in zpl_put_link() causes a warning that becomes a build failure when debugging is enabled. Fixing that warning requires ryao/spl@5ca50ef459c59bc74b7a7cd3af7311da2b1cd2c3. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #484
* Move partition scanning from userspace to module.Etienne Dechamps2012-07-171-2/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, zpool online -e (dynamic vdev expansion) doesn't work on whole disks because we're invoking ioctl(BLKRRPART) from userspace while ZFS still has a partition open on the disk, which results in EBUSY. This patch moves the BLKRRPART invocation from the zpool utility to the module. Specifically, this is done just before opening the device in vdev_disk_open() which is called inside vdev_reopen(). This requires jumping through some hoops to get to the disk device from the partition device, and to make sure we can still open the partition after the BLKRRPART call. Note that this new code path is triggered on dynamic vdev expansion only; other actions, like creating a new pool, are unchanged and still call BLKRRPART from userspace. This change also depends on API changes which are available in 2.6.37 and latter kernels. The build system has been updated to detect this, but there is no compatibility mode for older kernels. This means that online expansion will NOT be available in older kernels. However, it will still be possible to expand the vdev offline. Signed-off-by: Brian Behlendorf <[email protected]> Closes #808
* Illumos #1949, #1953George Wilson2012-07-112-6/+12
| | | | | | | | | | | | | | | | | | | | | | 1949 crash during reguid causes stale config 1953 allow and unallow missing from zpool history since removal of pyzfs Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Bill Pijewski <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Steve Gonczi <[email protected]> Approved by: Eric Schrock <[email protected]> References: https://www.illumos.org/issues/1949 https://www.illumos.org/issues/1953 Ported by: Brian Behlendorf <[email protected]> Closes #665
* Illumos #1748: desire support for reguid in zfsGarrett D'Amore2012-07-115-10/+83
| | | | | | | | | | | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: Alexander Eremin <[email protected]> Reviewed by: Alexander Stetsenko <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/1748 This commit modifies the user to kernel space ioctl ABI. Extra care should be taken when updating to ensure both the kernel modules and utilities are updated. If only the user space component is updated both the 'zpool events' command and the 'zpool reguid' command will not work until the kernel modules are updated. Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #665
* Update incorrect ddt_zap_lookup() assertionBrian Behlendorf2012-07-031-1/+1
| | | | | | | | | | | When the ddt_zap_lookup() function was updated to dynamically allocate memory for the cbuf variable, to save stack space, the 'csize <= sizeof (cbuf)' assertion was not updated. The result of this was that the size of the pointer was being used in the comparison rather than the buffer size. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]>
* Add ZIL statistics.Etienne Dechamps2012-06-291-2/+61
| | | | | | | | | | | | | | | | | The performance of the ZIL is usually the main bottleneck when dealing with synchronous, write-heavy workloads (e.g. databases). Understanding the behavior of the ZIL is required to diagnose performance issues for these workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly. This commit adds a new kstat page dedicated to the ZIL with some counters which, hopefully, scheds some light into what the ZIL is doing, and how it is doing it. Currently, these statistics are available in /proc/spl/kstat/zfs/zil. A description of the fields can be found in zil.h. Signed-off-by: Brian Behlendorf <[email protected]> Closes #786
* Speed up 'zfs list -t snapshot -o name -s name'Pawel Jakub Dawidek2012-06-141-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FreeBSD #xxx: Dramatically optimize listing snapshots when user requests only snapshot names and wants to sort them by name, ie. when executes: # zfs list -t snapshot -o name -s name Because only name is needed we don't have to read all snapshot properties. Below you can find how long does it take to list 34509 snapshots from a single disk pool before and after this change with cold and warm cache: before: # time zfs list -t snapshot -o name -s name > /dev/null cold cache: 525s warm cache: 218s after: # time zfs list -t snapshot -o name -s name > /dev/null cold cache: 1.7s warm cache: 1.1s NOTE: This patch only appears in FreeBSD. If/when Illumos picks up the change we may want to drop this patch and adopt their version. However, for now this addresses a real issue. Ported-by: Brian Behlendorf <[email protected]> Issue #450
* Add zvol_inhibit_dev module option.Darik Horn2012-06-131-0/+10
| | | | | | | | | | | | | | | | | ZoL can create more zvols at runtime than can be configured during system start, which hangs the init stack at reboot. When a slow system has more than a few hundred zvols, udev will fork bomb during system start and spend too much time in device detection routines, so upstart kills it. The zfs_inhibit_dev option allows an affected system to be rescued by skipping /dev/zd* creation and thereby avoiding the udev overload. All zvols are made inaccessible if this option is set, but the `zfs destroy` and `zfs send` commands still work, and ZFS filesystems can be mounted. Signed-off-by: Brian Behlendorf <[email protected]>
* Make zil_slog_limit a tunable module parameter.Etienne Dechamps2012-06-121-1/+4
| | | | | | | | | | | | | | | | | | zil_slog_limit specifies the maximum commit size to be written to the separate log device. Larger commits bypass the separate log device and go directly to the data devices. The optimal value for zil_slog_limit directly depends on the latency and throughput characteristics of both the separate log device and the data disks. Small synchronous writes are faster on low-latency separate log devices (e.g. SSDs) whereas large synchronous writes are faster on high-latency data disks (e.g. spindles) because of higher throughput, especially with a large array. The point is, the line between "small" and "large" synchronous writes in this scenario is heavily dependent on the hardware used. That's why it should be made configurable. Signed-off-by: Brian Behlendorf <[email protected]> Closes #783
* Linux 3.4 compat, d_make_root() replaces d_alloc_root()Richard Yao2012-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | torvalds/linux@adc0e91ab142abe93f5b0d7980ada8a7676231fe introduced introduced d_make_root() as a replacement for d_alloc_root(). Further commits appear to have removed d_alloc_root() from the Linux source tree. This causes the following failure: error: implicit declaration of function 'd_alloc_root' [-Werror=implicit-function-declaration] To correct this we update the code to use the current d_make_root() interface for readability. Then we introduce an autotools check to determine if d_make_root() is available. If it isn't then we define some compatibility logic which used the older d_alloc_root() interface. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #776