summaryrefslogtreecommitdiffstats
path: root/include/sys
Commit message (Collapse)AuthorAgeFilesLines
* Fix using zvol as slog deviceJorgen Lundman2012-12-182-1/+1
| | | | | | | | | | | | | | | During the original ZoL port the vdev_uses_zvols() function was disabled until it could be properly implemented. This prevented a zpool from use a zvol for its slog device. This patch implements that missing functionality by adding a zvol_is_zvol() function to zvol.c. Given the full path to a device it will lookup the device and verify its major number against the registered zvol major number for the system. If they match we know the device is a zvol. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1131
* Update SAs when an inode is dirtiedBrian Behlendorf2012-12-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Revert the portion of commit d3aa3ea which always resulted in the SAs being update when an mmap()'ed file was closed. That change accidentally resulted in unexpected ctime updates which upset tools like git. That was always a horrible hack and I'm happy it will never make it in to a tagged release. The right fix is something I initially resisted doing because I was worried about the additional overhead. However, in hindsight the overhead isn't as bad as I feared. This patch implemented the sops->dirty_inode() callback which is unsurprisingly called when an inode is dirtied. We leverage this callback to keep the znode SAs strictly in sync with the inode. However, for now we're going to go slowly to avoid introducing any new unexpected issues by only updating the atime, mtime, and ctime. This will cover the callpath of most concern to us. ->filemap_page_mkwrite->file_update_time->update_time-> mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode Signed-off-by: Brian Behlendorf <[email protected]> Closes #764 Closes #1140
* Linux 3.7 compat, schedule_delayed_work()Brian Behlendorf2012-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | Linux kernel commit d8e794d accidentally broke the delayed work APIs for non-GPL callers. While the APIs to schedule a delayed work item are still available to all callers, it is no longer possible to initialize the delayed work item. I'm cautiously optimistic we could get the delayed_work_timer_fn exported for all callers in the upstream kernel. But frankly the compatibility code to use this kernel interface has always been problematic. Therefore, this patch abandons direct use the of the Linux kernel interface in favor of the new delayed taskq interface. It provides roughly the same functionality as delayed work queues but it's a stable interface under our control. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1053
* Directory xattr znodes hold a reference on their parentBrian Behlendorf2012-12-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike normal file or directory znodes, an xattr znode is guaranteed to only have a single parent. Therefore, we can take a refernce on that parent if it is provided at create time and cache it. Additionally, we take care to cache it on any subsequent zfs_zaccess() where the parent is provided as an optimization. This allows us to avoid needing to do a zfs_zget() when setting up the SELinux security xattr in the create path. This is critical because a hash lookup on the directory will deadlock since it is locked. The zpl_xattr_security_init() call has also been moved up to the zpl layer to ensure TXs to create the required xattrs are performed after the create TX. Otherwise we run the risk of deadlocking on the open create TX. Ideally the security xattr should be fully constructed before the new inode is unlocked. However, doing so would require far more extensive changes to ZFS. This change may also have the benefitial side effect of ensuring xattr directory znodes are evicted from the cache before normal file or directory znodes due to the extra reference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #671
* Increase ZFS_OBJ_MTX_SZ to 256Brian Behlendorf2012-11-271-1/+1
| | | | | | | | | | | | | | | | | | Increasing this limit costs us 6144 bytes of memory per mounted filesystem, but this is small price to pay for accomplishing the following: * Allows for up to 256-way concurreny when performing lookups which helps performance when there are a large number of processes. * Minimizes the likelyhood of encountering the deadlock described in issue #1101. Because vmalloc() won't strictly honor __GFP_FS there is still a very remote chance of a deadlock. See the zfsonlinux/spl@043f9b57 commit. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1101
* Illumos #2671: zpool import should not fail if vdev ashift has increasedGeorge Wilson2012-11-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Gordon Ross <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Richard Lowe <[email protected]> Refererces to Illumos issue: https://www.illumos.org/issues/2671 This patch has been slightly modified from the upstream Illumos version. In the upstream implementation a warning message is logged to the console. To prevent pointless console noise this notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift" event. The event indicates a non-optimial (but entirely safe) ashift value was used to create the pool. Depending on your workload this may impact pool performance. Unfortunately, the only way to correct the issue is to recreate the pool with a new ashift. NOTE: The unrelated fix to the comment in zpool_main.c appears in the upstream commit and was preserved for consistnecy. Ported-by: Cyril Plisko <[email protected]> Reworked-by: Brian Behlendorf <[email protected]> Closes #955
* Log I/Os longer than zio_delay_max (30s default)Brian Behlendorf2012-11-021-0/+2
| | | | | | | | | | | | | | | | | | | | | | | There have been reports of ZFS deadlocking due to what appears to be a lost IO. This patch addes some debugging to determine the exact state of the IO which neither 1) completed, 2) failed, or 3) timed out after zio_delay_max (30) seconds. This information will be logged using the ZFS FMA infrastructure as a 'delay' event and posted to the internal zevent log. By default the last 64 events will be kept in the log but the limit is configurable via the zfs_zevent_len_max module option. To dump the contents of the log use the 'zpool events -v' command and look for the resource.fs.zfs.delay event. It will include various information about the pool, vdev, and zio which may shed some light on the issue. In the context of this change the 120 second kernel blocked thread watchdog has been disabled for synchronous IOs. Signed-off-by: Brian Behlendorf <[email protected]> Issue #930
* Add txgs-<pool> kstat fileBrian Behlendorf2012-11-021-0/+15
| | | | | | | | | | | | | | | | | | | | Create a kstat file which contains useful statistics about the last N txgs processed. This can be helpful when analyzing pool performance. The new KSTAT_TYPE_TXG type was added for this purpose and it tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <[email protected]>
* Add ddt_object_count() error handlingBrian Behlendorf2012-10-291-3/+3
| | | | | | | | | | | | | | | | | | | The interface for the ddt_zap_count() function assumes it can never fail. However, internally ddt_zap_count() is implemented with zap_count() which can potentially fail. Now because there was no way to return the error to the caller a VERIFY was used to ensure this case never happens. Unfortunately, it has been observed that pools can be damaged in such a way that zap_count() fails. The result is that the pool can not be imported without hitting the VERIFY and crashing the system. This patch reworks ddt_object_count() so the error can be safely caught and returned to the caller. This allows a pool which has be damaged in this way to be safely rewound for import. Signed-off-by: Brian Behlendorf <[email protected]> Closes #910
* Add FASTWRITE algorithm for synchronous writes.Etienne Dechamps2012-10-175-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1013
* Fix zfs_txg_timeout module parameterBrian Behlendorf2012-10-111-0/+3
| | | | | | | | | | | | | | | | Allow the zfs_txg_timeout variable to be dynamically tuned at run time. By pulling it down out of the variable declaration it will be evaluted each time through the loop. The zfs_txg_timeout variable is now declared extern in a the common sys/txg.h header rather than locally in dsl_scan.c. This prevents potential type mismatches if the global variable needs to be used elsewhere. Move the module_param() code in to the same source file where zfs_txg_timeout is declared. This is the most logical location. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix zfs_write_limit_max integer size mismatch on 32-bit systemsRichard Yao2012-10-111-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c409e4647f221ab724a0bd10c480ac95447203c3 introduced a number of module parameters. This required several types to be changed to accomidate the required module parameters Linux macros. Unfortunately, arc.c contained its own extern definition of the zfs_write_limit_max variable and its type was not updated to be consistent with its dsl_pool.c counterpart. If the variable had been properly marked extern in a common header, then gcc would have generated a warning and this would not have slipped through. The result of this was that the ARC unconditionally expected zfs_write_limit_max to be 64-bit. Unfortunately, the largest size integer module parameter that Linux supports is unsigned long, which varies in size depending on the host system's native word size. The effect was that on 32-bit systems, ARC incorrectly performed 64-bit operations on a 32-bit value by reading the neighboring 32 bits as the upper 32 bits of the 64-bit value. We correct that by changing the extern declaration to use the unsigned long type and move these extern definitions in to the common arc.h header. This should make ARC correctly treat zfs_write_limit_max as a 32-bit value on 32-bit systems. Reported-by: Jorgen Lundman <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #749
* Illumos #3100: zvol rename fails with EBUSY when dirty.Matthew Ahrens2012-10-032-1/+1
| | | | | | | | | | | | | | | | | illumos/illumos-gate@2e2c135528b3edfe9aaf67d1f004dc0202fa1a54 Illumos changeset: 13780:6da32a929222 3100 zvol rename fails with EBUSY when dirty Reviewed by: Christopher Siden <[email protected]> Reviewed by: Adam H. Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Eric Schrock <[email protected]> Ported-by: Etienne Dechamps <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #995
* Fix VOP_CLOSE() in userspace.Etienne Dechamps2012-10-031-1/+1
| | | | | | | | | | | | | Currently, for unknown reasons, VOP_CLOSE() is a no-op in userspace. This causes file descriptor leaks. This is especially problematic with long ztest runs, since zpool.cache is opened repeatedly and never closed, resulting in resource exhaustion (EMFILE errors). This patch fixes the issue by making VOP_CLOSE() do what it is supposed to do. Signed-off-by: Brian Behlendorf <[email protected]> Issue #989
* Create threads in detached state in userspace.Etienne Dechamps2012-10-031-2/+2
| | | | | | | | | | | | | | | | | | | | | Currently, thread_create(), when called in userspace, creates a joinable (i.e. not detached thread). This is the pthread default. Unfortunately, this does not reproduce kthreads behavior (kthreads are always detached). In addition, this contradicts the original Solaris code which creates userspace threads in detached mode. These joinable threads are never joined, which leads to a leakage of pthread thread objects ("zombie threads"). This in turn results in excessive ressource consumption, and possible ressource exhaustion in extreme cases (e.g. long ztest runs). This patch fixes the issue by creating userspace threads in detached mode. The only exception is ztest worker threads which are meant to be joinable. Signed-off-by: Brian Behlendorf <[email protected]> Issue #989
* Illumos #2703: add mechanism to report ZFS send progressBill Pijewski2012-09-195-2/+39
| | | | | | | | | | | | | Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Robert Mustacchi <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Eric Schrock <[email protected]> References: https://www.illumos.org/issues/2703 Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #1948: zpool list should show more detailed pool infoChris Siden2012-09-192-2/+8
| | | | | | | | | | | | | | | | | | Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Albert Lee <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Eric Schrock <[email protected]> References: https://www.illumos.org/issues/1948 Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #685
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-09-041-0/+1
| | | | | | | | | | | | | This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <[email protected]> Issue #917
* Switch KM_SLEEP to KM_PUSHPAGERichard Yao2012-08-274-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* Pre-allocate vdev I/O buffersBrian Behlendorf2012-08-272-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vdev queue layer may require a small number of buffers when attempting to create aggregate I/O requests. Rather than attempting to allocate them from the global zio buffers, which is slow under memory pressure, it makes sense to pre-allocate them because... 1) These buffers are short lived. They are only required for the life of a single I/O at which point they can be used by the next I/O. 2) The maximum number of concurrent buffers needed by a vdev is small. It's roughly limited by the zfs_vdev_max_pending tunable which defaults to 10. By keeping a small list of these buffer per-vdev we can ensure one is always available when we need it. This significantly reduces contention on the vq->vq_lock, because we no longer need to perform a slow allocation under this lock. This is particularly important when memory is already low on the system. It would probably be wise to extend the use of these buffers beyond aggregate I/O and in to the raidz implementation. The inability to quickly allocate buffer for the parity stripes could result in similiar problems. Signed-off-by: Brian Behlendorf <[email protected]>
* Revert Disable direct reclaim for z_wr_* threadsRichard Yao2012-08-271-1/+0
| | | | | | | | | | | | | | This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit 49be0ccf1fdc2ce852271d4d2f8b7a9c2c4be6db eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* Remove autotools productsBrian Behlendorf2012-08-274-2994/+0
| | | | | | | | Remove all of the generated autotools products from the repository and update the .gitignore files accordingly. Signed-off-by: Brian Behlendorf <[email protected]> Closes #718
* Wrap smp_processor_id in kpreempt_[dis|en]ablePrakash Surya2012-08-241-0/+4
| | | | | | | | | | | | | After surveying the code, the few places where smp_processor_id is used were deemed to be safe to use with a preempt enabled kernel. As such, no core logic had to be changed. These smp_processor_id call sites are simply are wrapped in kpreempt_disable and kpreempt_enabled to prevent the Linux kernel from emitting scary warnings. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #83
* Export dbuf_* symbolsBrian Behlendorf2012-08-101-3/+0
| | | | | | | | | | | | | Export these symbols so they may be used by other ZFS consumers besides the ZPL. Remove three stale prototype definites from dbuf.h. The actual implementations of these functions were removed/renamed long ago. It would be good in the long term to remove the existing pragmas we inherited from Solaris and simply use the dbuf_* names. Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #1693: persistent 'comment' field for a zpoolDan McDonald2012-08-083-0/+8
| | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Eric Schrock <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/1693 Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #678
* Set zvol discard_granularity to the volblocksize.Etienne Dechamps2012-08-074-0/+4
| | | | | | | | | | | | | | | | | | Currently, zvols have a discard granularity set to 0, which suggests to the upper layer that discard requests of arbirarily small size and alignment can be made efficiently. In practice however, ZFS does not handle unaligned discard requests efficiently: indeed, it is unable to free a part of a block. It will write zeros to the specified range instead, which is both useless and inefficient (see dnode_free_range). With this patch, zvol block devices expose volblocksize as their discard granularity, so the upper layer is aware that it's not supposed to send discard requests smaller than volblocksize. Signed-off-by: Brian Behlendorf <[email protected]> Closes #862
* Illumos #1644, #1645, #1646, #1647, #1708Matthew Ahrens2012-07-314-4/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1644 add ZFS "clones" property 1645 add ZFS "written" and "written@..." properties 1646 "zfs send" should estimate size of stream 1647 "zfs destroy" should determine space reclaimed by destroying multiple snapshots 1708 adjust size of zpool history data References: https://www.illumos.org/issues/1644 https://www.illumos.org/issues/1645 https://www.illumos.org/issues/1646 https://www.illumos.org/issues/1647 https://www.illumos.org/issues/1708 This commit modifies the user to kernel space ioctl ABI. Extra care should be taken when updating to ensure both the kernel modules and utilities are updated. This change has reordered all of the new ioctl()s to the end of the list. This should help minimize this issue in the future. Reviewed by: Richard Lowe <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Albert Lee <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #826 Closes #664
* Linux 3.5 compat, end_writeback() changed to clear_inode()Richard Yao2012-07-234-0/+4
| | | | | | | | | | | | | | | | The end_writeback() function was changed by moving the call to inode_sync_wait() earlier in to evict(). This effecitvely changes the ordering of the sync but it does not impact the details of the zfs implementation. However, as part of this change end_writeback() was renamed to clear_inode() to reflect the new semantics. This change does impact us and clear_inode() now maps to end_writeback() for kernels prior to 3.5. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #784
* Linux 3.5 compat, iops->truncate_range() removedRichard Yao2012-07-234-0/+4
| | | | | | | | | The vmtruncate_range() support has been removed from the kernel in favor of using the fallocate method in the file_operations table. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #784
* Linux 3.5 compat, eops->encode_fh() takes inodesRichard Yao2012-07-234-0/+4
| | | | | | | | | | | | | | The export_operations member ->encode_fh() has been updated to take both the child and parent inodes. This interface used to take the child dentry and a bool describing if the parent is needed. NOTE: While updating this code I noticed that we do not currently cleanly handle the case where we're passed a connectable parent. This code should be audited to make sure we're doing the right thing. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #784
* Fix build failures on PaX/GRSecurity patched kernelsRichard Yao2012-07-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a dialect of C that relies on a GCC plugin. In particular, struct file_operations has been marked do_const in the PaX/GRSecurity dialect, which causes GCC to consider all instances of it as const. This caused failures in the autotools checks and the ZFS source code. To address this, we modify the autotools checks to take into account differences between the PaX C dialect and the regular C dialect. We also modify struct zfs_acl's z_ops member to be a pointer to a function pointer table. Lastly, we modify zpl_put_link() to address a PaX change to the function prototype of nd_get_link(). This avoids compiler errors in the PaX/GRSecurity dialect. Note that the change in zpl_put_link() causes a warning that becomes a build failure when debugging is enabled. Fixing that warning requires ryao/spl@5ca50ef459c59bc74b7a7cd3af7311da2b1cd2c3. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #484
* Move partition scanning from userspace to module.Etienne Dechamps2012-07-175-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, zpool online -e (dynamic vdev expansion) doesn't work on whole disks because we're invoking ioctl(BLKRRPART) from userspace while ZFS still has a partition open on the disk, which results in EBUSY. This patch moves the BLKRRPART invocation from the zpool utility to the module. Specifically, this is done just before opening the device in vdev_disk_open() which is called inside vdev_reopen(). This requires jumping through some hoops to get to the disk device from the partition device, and to make sure we can still open the partition after the BLKRRPART call. Note that this new code path is triggered on dynamic vdev expansion only; other actions, like creating a new pool, are unchanged and still call BLKRRPART from userspace. This change also depends on API changes which are available in 2.6.37 and latter kernels. The build system has been updated to detect this, but there is no compatibility mode for older kernels. This means that online expansion will NOT be available in older kernels. However, it will still be possible to expand the vdev offline. Signed-off-by: Brian Behlendorf <[email protected]> Closes #808
* Illumos #1949, #1953George Wilson2012-07-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | 1949 crash during reguid causes stale config 1953 allow and unallow missing from zpool history since removal of pyzfs Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Bill Pijewski <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Steve Gonczi <[email protected]> Approved by: Eric Schrock <[email protected]> References: https://www.illumos.org/issues/1949 https://www.illumos.org/issues/1953 Ported by: Brian Behlendorf <[email protected]> Closes #665
* Illumos #1748: desire support for reguid in zfsGarrett D'Amore2012-07-114-1/+10
| | | | | | | | | | | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: Alexander Eremin <[email protected]> Reviewed by: Alexander Stetsenko <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/1748 This commit modifies the user to kernel space ioctl ABI. Extra care should be taken when updating to ensure both the kernel modules and utilities are updated. If only the user space component is updated both the 'zpool events' command and the 'zpool reguid' command will not work until the kernel modules are updated. Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #665
* Use ULL suffix in constantsRichard Yao2012-07-101-4/+4
| | | | | | | | | | | | | | | The lack of the ULL suffix causes warnings such as the following on 32-bit systems: In function 'zfsctl_is_snapdir': zfs-0.6.0//module/zfs/zfs_ctldir.c:151: warning: integer constant is too large for 'long' type We add the ULL suffix to fix that. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #813
* Add ZIL statistics.Etienne Dechamps2012-06-291-0/+59
| | | | | | | | | | | | | | | | | The performance of the ZIL is usually the main bottleneck when dealing with synchronous, write-heavy workloads (e.g. databases). Understanding the behavior of the ZIL is required to diagnose performance issues for these workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly. This commit adds a new kstat page dedicated to the ZIL with some counters which, hopefully, scheds some light into what the ZIL is doing, and how it is doing it. Currently, these statistics are available in /proc/spl/kstat/zfs/zil. A description of the fields can be found in zil.h. Signed-off-by: Brian Behlendorf <[email protected]> Closes #786
* Speed up 'zfs list -t snapshot -o name -s name'Pawel Jakub Dawidek2012-06-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FreeBSD #xxx: Dramatically optimize listing snapshots when user requests only snapshot names and wants to sort them by name, ie. when executes: # zfs list -t snapshot -o name -s name Because only name is needed we don't have to read all snapshot properties. Below you can find how long does it take to list 34509 snapshots from a single disk pool before and after this change with cold and warm cache: before: # time zfs list -t snapshot -o name -s name > /dev/null cold cache: 525s warm cache: 218s after: # time zfs list -t snapshot -o name -s name > /dev/null cold cache: 1.7s warm cache: 1.1s NOTE: This patch only appears in FreeBSD. If/when Illumos picks up the change we may want to drop this patch and adopt their version. However, for now this addresses a real issue. Ported-by: Brian Behlendorf <[email protected]> Issue #450
* Linux 3.4 compat, d_make_root() replaces d_alloc_root()Richard Yao2012-06-114-0/+4
| | | | | | | | | | | | | | | | | | | | torvalds/linux@adc0e91ab142abe93f5b0d7980ada8a7676231fe introduced introduced d_make_root() as a replacement for d_alloc_root(). Further commits appear to have removed d_alloc_root() from the Linux source tree. This causes the following failure: error: implicit declaration of function 'd_alloc_root' [-Werror=implicit-function-declaration] To correct this we update the code to use the current d_make_root() interface for readability. Then we introduce an autotools check to determine if d_make_root() is available. If it isn't then we define some compatibility logic which used the older d_alloc_root() interface. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #776
* Linux 3.3 compat, iops->create()/mkdir()/mknod()Brian Behlendorf2012-04-305-1/+5
| | | | | | | | | | The mode argument of iops->create()/mkdir()/mknod() was changed from an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf check was added to detect the API change and then correctly set a zpl_umode_t typedef. There is no functional change. Signed-off-by: Brian Behlendorf <[email protected]> Closes #701
* Illumos #1951: leaking a vdev when removing an l2cache deviceGeorge Wilson2012-04-111-0/+1
| | | | | | | | | | | | | | | | | | | | | 1952 memory leak when adding a file-based l2arc device 1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Bill Pijewski <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Eric Schrock <[email protected]> References to Illumos issues: https://www.illumos.org/issues/1951 https://www.illumos.org/issues/1952 https://www.illumos.org/issues/1954 Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #650
* Add --enable-debug-dmu-tx configure optionBrian Behlendorf2012-03-235-3/+7
| | | | | | | | | | | | | | | | | | Allow rigorous (and expensive) tx validation to be enabled/disabled indepentantly from the standard zfs debugging. When enabled these checks ensure that all txs are constructed properly and that a dbuf is never dirtied without taking the correct tx hold. This checking is particularly helpful when adding new dmu consumers like Lustre. However, for established consumers such as the zpl with no known outstanding tx construction problems this is just overhead. --enable-debug-dmu-tx - Enable/disable validation of each tx as --disable-debug-dmu-tx it is constructed. By default validation is disabled due to performance concerns. Signed-off-by: Brian Behlendorf <[email protected]>
* Add .zfs control directoryBrian Behlendorf2012-03-2210-18/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for the .zfs control directory. This was accomplished by leveraging as much of the existing ZFS infrastructure as posible and updating it for Linux as required. The bulk of the core functionality is now all there with the following limitations. *) The .zfs/snapshot directory automount support requires a 2.6.37 or newer kernel. The exception is RHEL6.2 which has backported the d_automount patches. *) Creating/destroying/renaming snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory works as expected. However, this functionality is only available to root until zfs delegations are finished. * mkdir - create a snapshot * rmdir - destroy a snapshot * mv - rename a snapshot The following issues are known defeciences, but we expect them to be addressed by future commits. *) Add automount support for kernels older the 2.6.37. This should be possible using follow_link() which is what Linux did before. *) Accessing the .zfs/snapshot directory via NFS is not yet possible. The majority of the ground work for this is complete. However, finishing this work will require resolving some lingering integration issues with the Linux NFS kernel server. *) The .zfs/shares directory exists but no futher smb functionality has yet been implemented. Contributions-by: Rohan Puri <[email protected]> Contributiobs-by: Andrew Barnes <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #173
* Add sa_spill_rele() interfaceBrian Behlendorf2012-03-071-0/+1
| | | | | | | | | | | Add a SA interface which allows us to release the spill block from a SA handle without destroying the handle. This is useful because we can then ensure that a copy of the dirty spill block is not made at sync time due to the extra hold. Susequent calls to sa_update() or sa_lookup() with transparently refetch the spill block dbuf from the ARC hash. Signed-off-by: Brian Behlendorf <[email protected]>
* Cleanly support debug packagesBrian Behlendorf2012-02-275-1/+5
| | | | | | | | | | | | | | | | | | | Allow a source rpm to be rebuilt with debugging enabled. This avoids the need to have to manually modify the spec file. By default debugging is still largely disabled. To enable specific debugging features use the following options with rpmbuild. '--with debug' - Enables ASSERTs # For example: $ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm Additionally, ZFS_CONFIG has been added to zfs_config.h for packages which build against these headers. This is critical to ensure both zfs and the dependant package are using the same prototype and structure definitions. Signed-off-by: Brian Behlendorf <[email protected]>
* Add 'dmu_tx' kstats entryBrian Behlendorf2012-02-271-0/+28
| | | | | | | | | Keep counters for the various reasons that a thread may end up in txg_wait_open() waiting on a new txg. This can be useful when attempting to determine why a particular workload is under performing. Signed-off-by: Brian Behlendorf <[email protected]>
* Add support for DISCARD to ZVOLs.Etienne Dechamps2012-02-094-0/+4
| | | | | | | | | | | | | | | | | | | | DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus optimizing disk space usage by allowing a ZVOL to shrink instead of just grow. We can't use zfs_space() or zfs_freesp() here, since these functions only work on regular files, not volumes. Fortunately we can use the low-level function dmu_free_long_range() which does exactly what we want. Currently the discard operation is not added to the log. That's not a big deal since losing discard requests cannot result in data corruption. It would however result in disk space usage higher than it should be. Thus adding log support to zvol_discard() is probably a good idea for a future improvement. Signed-off-by: Brian Behlendorf <[email protected]>
* Support the fallocate() file operation.Etienne Dechamps2012-02-095-0/+7
| | | | | | | | | | | | | | Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported, since it's the only one that matches the behavior of zfs_space(). This makes it pretty much useless in its current form, but it's a start. To support other flag combinations we would need to modify zfs_space() to make it more flexible, or emulate the desired functionality in zpl_fallocate(). Signed-off-by: Brian Behlendorf <[email protected]> Issue #334
* Improve ZVOL queue behavior.Etienne Dechamps2012-02-074-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering. Detailed rationale: - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide. - max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide. - physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow. - The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix synchronicity for ZVOLs.Etienne Dechamps2012-02-074-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zvol_write() assumes that the write request must be written to stable storage if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does *not* mean what we think it means in the context of the Linux block layer. This is well explained in linux/fs.h: WRITE: A normal async write. Device will be plugged. WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down the hint that someone will be waiting on this IO shortly. WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush. WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on non-volatile media on completion. In other words, SYNC does not *mean* that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs. The request for stable storage is expressed by the FUA flag, so we must commit the ZIL after the write if the FUA flag is set. In addition, we must commit the ZIL before the write if the FLUSH flag is set. Also, we must inform the block layer that we actually support FLUSH and FUA. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.3 compat, sops->show_options()Brian Behlendorf2012-02-034-0/+4
| | | | | | | | | | The second argument of sops->show_options() was changed from a 'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check to detect the API change and then conditionally define the expected interface. In either case we are only interested in the zfs_sb_t. Signed-off-by: Brian Behlendorf <[email protected]> Closes #549