summaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Tear down and flush the mmap regionPrasad Joshi2011-06-271-2/+2
| | | | | | | | | | | | | | The inode eviction should unmap the pages associated with the inode. These pages should also be flushed to disk to avoid the data loss. Therefore, use truncate_setsize() in evict_inode() to release the pagecache. The API truncate_setsize() was added in 2.6.35 kernel. To ensure compatibility with the old kernel, the patch defines its own truncate_setsize function. Signed-off-by: Prasad Joshi <[email protected]> Closes #255
* Linux 3.0 compat, shrinker compatibilityBrian Behlendorf2011-06-211-3/+5
| | | | | | | | | | To accomindate the updated Linux 3.0 shrinker API the spl shrinker compatibility code was updated. Unfortunately, this couldn't be done cleanly without slightly adjusting the comapt API. See spl commit a55bcaad181096d764e12d847e3091cd7b15509a. This commit updates the ZFS code to use the slightly modified API. You must use the latest SPL if your building ZFS.
* Fix unlink/xattr deadlockGunnar Beutner2011-06-202-55/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem here is that prune_icache() tries to evict/delete both the xattr directory inode as well as at least one xattr inode contained in that directory. Here's what happens: 1. File is created. 2. xattr is created for that file (behind the scenes a xattr directory and a file in that xattr directory are created) 3. File is deleted. 4. Both the xattr directory inode and at least one xattr inode from that directory are evicted by prune_icache(); prune_icache() acquires a lock on both inodes before it calls ->evict() on the inodes When the xattr directory inode is evicted zfs_zinactive attempts to delete the xattr files contained in that directory. While enumerating these files zfs_zget() is called to obtain a reference to the xattr file znode - which tries to lock the xattr inode. However that very same xattr inode was already locked by prune_icache() further up the call stack, thus leading to a deadlock. This can be reliably reproduced like this: $ touch test $ attr -s a -V b test $ rm test $ echo 3 > /proc/sys/vm/drop_caches This patch fixes the deadlock by moving the zfs_purgedir() call to zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the xattr dir is empty and leaves the xattr dir in the unlinked set if it finds any xattrs. To ensure zfs_unlinked_drain() never accesses a stale super block zfsvfs_teardown() has been update to block until the iput taskq has been drained. This avoids a potential race where a file with an xattr directory is removed and the file system is immediately unmounted. Signed-off-by: Brian Behlendorf <[email protected]> Closes #266
* Removed erroneous zfs_inode_destroy() calls from zfs_rmnode().Gunnar Beutner2011-06-201-3/+0
| | | | | | | | | | | iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy() for us after zfs_zinactive(), thus making sure that the inode is properly cleaned up. The zfs_inode_destroy() calls in zfs_rmnode() would lead to a double-free. Fixes #282
* Add "ashift" property to zpool createChristian Kohlschütter2011-06-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some disks with internal sectors larger than 512 bytes (e.g., 4k) can suffer from bad write performance when ashift is not configured correctly. This is caused by the disk not reporting its actual sector size, but a sector size of 512 bytes. The drive may behave this way for compatibility reasons. For example, the WDC WD20EARS disks are known to exhibit this behavior. When creating a zpool, ZFS takes that wrong sector size and sets the "ashift" property accordingly (to 9: 1<<9=512), whereas it should be set to 12 for 4k sectors (1<<12=4096). This patch allows an adminstrator to manual specify the known correct ashift size at 'zpool create' time. This can significantly improve performance in certain cases. However, it will have an impact on your total pool capacity. See the updated ashift property description in the zpool.8 man page for additional details. Valid values for the ashift property range from 9 to 17 (512B-128KB). Additionally, you may set the ashift to 0 if you wish to auto-detect the sector size based on what the disk reports, this is the default behavior. The most common ashift values are 9 and 12. Example: zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd Closes #280 Original-patch-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 2.6.37 compat, WRITE_FLUSH_FUABrian Behlendorf2011-06-171-1/+1
| | | | | | | | | | | | | | | The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been introduced as a replacement for WRITE_BARRIER. This was done to allow richer semantics to be expressed to the block layer. It is the block layers responsibility to choose the correct way to implement these semantics. This change simply updates the bio's to use the new kernel API which should be absolutely safe. However, since ZFS depends entirely on this working as designed for correctness we do want to be careful. Closes #281
* Fix stack ddt_class_contains()Brian Behlendorf2011-05-311-5/+11
| | | | | | | Stack usage for ddt_class_contains() reduced from 524 bytes to 68 bytes. This large stack allocation significantly contributed to the likelyhood of a stack overflow when scrubbing/resilvering dedup pools.
* Fix stack ddt_zap_lookup()Brian Behlendorf2011-05-311-4/+8
| | | | | | | Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120 bytes. This large stack allocation significantly contributed to the likelyhood of a stack overflow when scrubbing/resilvering dedup pools.
* Revert "Fix stack traverse_visitbp()"Brian Behlendorf2011-05-311-177/+98
| | | | | | | | This abomination is no longer required because the zio's issued during this recursive call path will now be handled asynchronously by the taskq thread pool. This reverts commit 6656bf56216f36805731298ee0f4de87ae6b6b3d.
* Make tgx_sync_thread zio's asyncBrian Behlendorf2011-05-311-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The majority of the recursive operations performed by the dsl are done either in the context of the tgx_sync_thread or during pool import. It is these recursive operations which contribute greatly to the stack depth. When this recursion is coupled with a synchronous I/O in the same context overflow becomes possible. Previously to handle this case I have focused on keeping the individual stack frames as light as possible. This is a good idea as long as it can be done in a way which doesn't overly complicate the code. However, there is a better solution. If we treat all zio's issued by the tgx_sync_thread as async then we can use the tgx_sync_thread stack for the recursive parts, and the zio_* threads for the I/O parts. This effectively doubles our available stack space with the only drawback being a small delay to schedule the I/O. However, in practice the scheduling time is so much smaller than the actual I/O time this isn't an issue. Another benefit of making the zio async is that the zio pipeline is now parallel. That should mean for CPU intensive pipelines such as compression or dedup performance may be improved. With this change in place the worst case stack usage observed so far is 6902 bytes. This is still higher than I'd like but significantly improved. Additional changes to specific functions should improve this further. This change allows us to revent commit 6656bf5 which did some horrible things to the recursive traverse_visitbp() callpath in the name of saving stack.
* Fix 4K sector supportBrian Behlendorf2011-05-271-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yesterday I ran across a 3TB drive which exposed 4K sectors to Linux. While I thought I had gotten this support correct it turns out there were 2 subtle bugs which prevented it from working. sudo ./cmd/zpool/zpool create -f large-sector /dev/sda cannot create 'large-sector': one or more devices is currently unavailable 1) The first issue was that it was possible that bdev_capacity() would return the number of 512 byte sectors rather than the number of 4096 sectors. Internally, certain Linux functions only operate with 512 byte sectors so you need to be careful. To avoid any confusion in the future I've updated bdev_capacity() to simply return the device (or partition) capacity in bytes. The higher levels of ZFS want the value in bytes anyway so this is cleaner. 2) When creating a bio the ->bi_sector count must always be expressed in 512 byte sectors. The existing code would scale the byte offset by the logical sector size. Until now this was always 512 so it never caused problems. Trying a 4K sector drive clearly exposed the issue. The problem has been fixed by hard coding the 512 byte sector which is exactly what the bio code does internally. With these changes I'm now able to create ZFS pools using 4K sector drives. No issues were observed during fairly extensive testing. This is also a low risk change if your using 512b sectors devices because none of the logic changes. Closes #256
* Use vmem_alloc() for zfs_ioc_userspace_many()Brian Behlendorf2011-05-201-2/+2
| | | | | | | | The default buffer size when requesting multiple quota entries is 100 times the zfs_useracct_t size. In practice this works out to exactly 27200 bytes. Since this will be a short lived buffer in a non-performance critical path it is preferable to vmem_alloc() the needed memory.
* Pass caller's credential in zfsdev_ioctl()Brian Behlendorf2011-05-201-1/+1
| | | | | | | | | | | | Initially when zfsdev_ioctl() was ported to Linux we didn't have any credential support implemented. So at the time we simply passed NULL which wasn't much of a problem since most of the secpolicy code was disabled. However, one exception is quota handling which does require the credential. Now that proper credentials are supported we can safely start passing the callers credential. This is also an initial step towards fully implemented the zfs secpolicy.
* Fix 'negative objects to delete' warningBrian Behlendorf2011-05-181-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | Normally when the arc_shrinker_func() function is called the return value should be: >=0 - To indicate the number of freeable objects in the cache, or -1 - To indicate this cache should be skipped However, when the shrinker callback is called with 'nr_to_scan' equal to zero. The caller simply wants the number of freeable objects in the cache and we must never return -1. This patch reorders the first two conditionals in arc_shrinker_func() to ensure this behavior. This patch also now explictly casts arc_size and arc_c_min to signed int64_t types so MAX(x, 0) works as expected. As unsigned types we would never see an negative value which defeated the purpose of the MAX() lower bound and broke the shrinker logic. Finally, when nr_to_scan is non-zero we explictly prevent all reclaim below arc_c_min. This is done to prevent the Linux page cache from completely crowding out the ARC. This limit is tunable and some experimentation is likely going to be required to set it exactly right. For now we're sticking with the OpenSolaris defaults. Closes #218 Closes #243
* Update synchronous open zfs_close() commentBrian Behlendorf2011-05-131-1/+5
| | | | | | | | | | | The comment in zfs_close() pertaining to decrementing the synchronous open count needs to be updated for Linux. The code was already updated to be correct, but the comment was missed and is now misleading. Under Linux the zfs_close() hook is only called once when the final reference is dropped. This differs from Solaris where zfs_close() is called for each close. Closes #237
* Merge pull request #235 from nedbass/rdevBrian Behlendorf2011-05-092-7/+16
|\ | | | | Don't store rdev in SA for FIFOs and sockets
| * Don't store rdev in SA for FIFOs and socketsNed A. Bass2011-05-092-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | Update the handling of named pipes and sockets to be consistent with other platforms with regard to the rdev attribute. While all ZFS ipmlementations store the rdev for device files in a system attribute (SA), this is not the case for FIFOs and sockets. Indeed, Linux always passes rdev=0 to mknod() for FIFOs and sockets, so the value is not needed. Add an ASSERT that rdev==0 for FIFOs and sockets to detect if the expected behavior ever changes. Closes #216
* | Disable direct reclaim for z_wr_* threadsBrian Behlendorf2011-05-061-3/+6
|/ | | | | | | | | | | | | | | | | | The direct reclaim path in the z_wr_* threads must be disabled to ensure forward progress is always maintained for txg processing. This ensures that a txg will never get stuck waiting on itself because it entered the following memory reclaim callpath. ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive() ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open() It would be preferable to target this exact code path but the kernel offers no way to do this without custom patches. To avoid this we are forced to disable all reclaim for these threads. It should not be necessary to do this for other other z_* threads because they will not hold a txg open. Closes #232
* Handle NULL in nfsd .fsync() hookBrian Behlendorf2011-05-061-2/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | How nfsd handles .fsync() has been changed a couple of times in the recent kernels. But basically there are three cases we need to consider. Linux 2.6.12 - 2.6.33 * The .fsync() hook takes 3 arguments * The nfsd will call .fsync() with a NULL file struct pointer. Linux 2.6.34 * The .fsync() hook takes 3 arguments * The nfsd no longer calls .fsync() but instead used sync_inode() Linux 2.6.35 - 2.6.x * The .fsync() hook takes 2 arguments * The nfsd no longer calls .fsync() but instead used sync_inode() For once it looks like we've gotten lucky. The first two cases can actually be collased in to one if we stop using the file struct pointer entirely. Since the dentry is still passed in both cases this is possible. The last case can then be safely handled by unconditionally using the dentry in the file struct pointer now that we know the nfsd caller has been removed. Closes #230
* Use vmem_alloc() for zfs_ioc_pool_get_history()Brian Behlendorf2011-05-061-2/+2
| | | | | | | | The default buffer size when requesting history is 128k. This is far to large for a kmem_alloc() so instead use the slower vmem_alloc(). This path has no performance concerns and the buffer is immediately free'd after its contents are copied to the user space buffer.
* Add missing ZFS tunablesBrian Behlendorf2011-05-0416-41/+176
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds module options for all existing zfs tunables. Ideally the average user should never need to modify any of these values. However, in practice sometimes you do need to tweak these values for one reason or another. In those cases it's nice not to have to resort to rebuilding from source. All tunables are visable to modinfo and the list is as follows: $ modinfo module/zfs/zfs.ko filename: module/zfs/zfs.ko license: CDDL author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory description: ZFS srcversion: 8EAB1D71DACE05B5AA61567 depends: spl,znvpair,zcommon,zunicode,zavl vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions parm: zvol_major:Major number for zvol device (uint) parm: zvol_threads:Number of threads for zvol device (uint) parm: zio_injection_enabled:Enable fault injection (int) parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int) parm: zio_delay_max:Max zio millisec delay before posting event (int) parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool) parm: zil_replay_disable:Disable intent logging replay (int) parm: zfs_nocacheflush:Disable cache flushes (bool) parm: zfs_read_chunk_size:Bytes to read per chunk (long) parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int) parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int) parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int) parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int) parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int) parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int) parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int) parm: zfs_vdev_scheduler:I/O scheduler (charp) parm: zfs_vdev_cache_max:Inflate reads small than max (int) parm: zfs_vdev_cache_size:Total size of the per-disk cache (int) parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int) parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int) parm: zfs_recover:Set to attempt to recover from fatal errors (int) parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp) parm: zfs_zevent_len_max:Max event queue length (int) parm: zfs_zevent_cols:Max event column width (int) parm: zfs_zevent_console:Log events to the console (int) parm: zfs_top_maxinflight:Max I/Os per top-level (int) parm: zfs_resilver_delay:Number of ticks to delay resilver (int) parm: zfs_scrub_delay:Number of ticks to delay scrub (int) parm: zfs_scan_idle:Idle window in clock ticks (int) parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int) parm: zfs_free_min_time_ms:Min millisecs to free per txg (int) parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int) parm: zfs_no_scrub_io:Set to disable scrub I/O (bool) parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool) parm: zfs_txg_timeout:Max seconds worth of delta per txg (int) parm: zfs_no_write_throttle:Disable write throttling (int) parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int) parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int) parm: zfs_write_limit_min:Min tgx write limit (ulong) parm: zfs_write_limit_max:Max tgx write limit (ulong) parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong) parm: zfs_write_limit_override:Override tgx write limit (ulong) parm: zfs_prefetch_disable:Disable all ZFS prefetching (int) parm: zfetch_max_streams:Max number of streams per zfetch (uint) parm: zfetch_min_sec_reap:Min time before stream reclaim (uint) parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint) parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong) parm: zfs_pd_blks_max:Max number of blocks to prefetch (int) parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int) parm: zfs_arc_min:Min arc size (ulong) parm: zfs_arc_max:Max arc size (ulong) parm: zfs_arc_meta_limit:Meta limit for arc size (ulong) parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int) parm: zfs_arc_grow_retry:Seconds before growing arc size (int) parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int) parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
* Fully update inode when createdBrian Behlendorf2011-05-021-2/+1
| | | | | | | | | | | | | | When a new znode/inode pair is created both the znode and the inode should be immediately updated to the correct values. This was done for the znode and for most of the values in the inode, but not all of them. This normally wasn't a problem because most subsequent operations would cause the inode to be immediately updated. This change ensures the inode is now fully updated before it is inserted in to the inode hash. Closes #116 Closes #146 Closes #164
* Fix 'zfs set volsize=N pool/dataset'Brian Behlendorf2011-05-021-11/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change fixes a kernel panic which would occur when resizing a dataset which was not open. The objset_t stored in the zvol_state_t will be set to NULL when the block device is closed. To avoid this issue we pass the correct objset_t as the third arg. The code has also been updated to correctly notify the kernel when the block device capacity changes. For 2.6.28 and newer kernels the capacity change will be immediately detected. For earlier kernels the capacity change will be detected when the device is next opened. This is a known limitation of older kernels. Online ext3 resize test case passes on 2.6.28+ kernels: $ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023 $ zpool create tank /tmp/zvol $ zfs create -V 500M tank/zd0 $ mkfs.ext3 /dev/zd0 $ mkdir /mnt/zd0 $ mount /dev/zd0 /mnt/zd0 $ df -h /mnt/zd0 $ zfs set volsize=800M tank/zd0 $ resize2fs /dev/zd0 $ df -h /mnt/zd0 Original-patch-by: Fajar A. Nugraha <[email protected]> Closes #68 Closes #84
* Implemented NFS export_operations.Gunnar Beutner2011-04-294-11/+123
| | | | | Implemented the required NFS operations for exporting ZFS datasets using the in-kernel NFS daemon.
* Suppress 'vdev_metaslab_init' memory warningBrian Behlendorf2011-04-271-1/+1
| | | | | | | The vdev_metaslab_init() function has been observed to allocate larger than 8k chunks. However, they are not much larger than 8k and it does this infrequently so it is allowed and the warning is supressed.
* Conserve stack in dsl_scan_visit()Brian Behlendorf2011-04-261-10/+15
| | | | | | | | | | | | | | The dsl_scan_visit() function is a little heavy weight taking 464 bytes on the stack. This can be easily reduced for little cost by moving zap_cursor_t and zap_attribute_t off the stack and on to the heap. After this change dsl_scan_visit() has been reduced in size by 320 bytes. This change was made to reduce stack usage in the dsl_scan_sync() callpath which is recursive and has been observed to overflow the stack. Issue #174
* Conserve stack in dsl_scan_visitbp()Brian Behlendorf2011-04-261-5/+12
| | | | | | | | | | | This function is called recursively so everything possible must be done to limit its stack consumption. The dprintf_bp() debugging function adds 30 bytes of local variables to the function we cannot afford. By commenting out this debugging we save 30 bytes per recursion and depths of 13 are not uncommon. This yeilds a total stack saving of 390 bytes on our 8k stack. Issue #174
* Conserve stack in dsl_scan_visitbp()Brian Behlendorf2011-04-261-2/+2
| | | | | | | | | | | | | The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() -> dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume considerable stack resulting in a stack overflow (>8k). The cleanest way I see to fix this with minimal impact to the existing flow of code, and with the fewest performance concerns, is to always inline dsl_scan_recurse() and dsl_scan_visitdnode(). While this will increase the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces the stack requirements by removing the function call overhead. Issue #174
* Fix zvol deadlockBrian Behlendorf2011-04-261-1/+2
| | | | | | | | | | | | It's possible for a zvol_write thread to enter direct memory reclaim while holding open a transaction group. This results in the system attempting to write out data to the disk to free memory. Unfortunately, this can't succeed because the the thread doing reclaim is holding open the txg which must be closed to be synced to disk. To prevent this the offending allocation is marked KM_PUSHPAGE which will prevent it from attempting writeback. Closes #191
* Fix spurious -EFAULT when setting I/O schedulerBrian Behlendorf2011-04-221-17/+12
| | | | | | | | | | | | Occasionally we would see an -EFAULT returned when setting the I/O scheduler on a vdev. This was caused an improperly formatted user mode helper command. This commit restructures the command to something simpler, allocates space for it dynamically to save stack, and removes the retry logic which is no longer needed. Closes #169
* Enforce ARC meta-data limitsBrian Behlendorf2011-04-211-2/+15
| | | | | | | | | | | | | | | This change ensures the ARC meta-data limits are enforced. Without this enforcement meta-data can grow to consume all of the ARC cache pushing out data and hurting performance. The cache is aggressively reclaimed but this is a soft and not a hard limit. The cache may exceed the set limit briefly before being brought under control. By default 25% of the ARC capacity can be used for meta-data. This limit can be tuned by setting the 'zfs_arc_meta_limit' module option. Once this limit is exceeded meta-data reclaim will occur in 3 percent chunks, or may be tuned using 'arc_reduce_dnlc_percent'. Closes #193
* Fixed a use-after-free bug in zfs_zget().Gunnar Beutner2011-04-211-1/+23
| | | | | | | | | | Fixed a bug where zfs_zget could access a stale znode pointer when the inode had already been removed from the inode cache via iput -> iput_final -> ... -> zfs_zinactive but the corresponding SA handle was still alive. Signed-off-by: Brian Behlendorf <[email protected]> Closes #180
* Suppress 'zfs receive' memory warningBrian Behlendorf2011-04-201-1/+1
| | | | | | | | | | As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel which is 17808 bytes in size. This sort of thing in general should be avoided. However, since this should be an infrequent event for now we allow it and simply suppress the warning with the KM_NODEBUG flag. This can be revisited latter if/when it becomes an issue. Closes #178
* Truncate the xattr znode when updating existing attributes.Gunnar Beutner2011-04-191-1/+7
| | | | | | | | If the attribute's new value was shorter than the old one the old code would leave parts of the old value in the xattr znode. Signed-off-by: Brian Behlendorf <[email protected]> Closes #203
* Added missing initialization for va.va_dentry in zfs_get_xattrdir.Gunnar Beutner2011-04-191-0/+1
| | | | | | | | | | | | Without this we may mistakenly believe we have a dentry and try to d_instantiate() it. This will result in the following BUG. It's important to note that while the xattr directory has an inode assoicated with it we never create a dentry for it. kernel BUG at fs/dcache.c:1418! Signed-off-by: Brian Behlendorf <[email protected]> Closes #202
* Fix gcc compiler warning, dsl_pool_create()Brian Behlendorf2011-04-191-2/+2
| | | | | | | | | | | | | | When compiling ZFS in user space gcc-4.6.0 correctly identifies the variable 'os' as being set but never used. This generates a warning and a build failure when using --enable-debug. However, the code is correct we only want to use 'os' for the kernel space builds. To suppress the warning the call was wrapped with a VERIFY() which has the nice side effect of ensuring the 'os' actually never is NULL. This was observed under Fedora 15. module/zfs/dsl_pool.c: In function ‘dsl_pool_create’: module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used [-Werror=unused-but-set-variable]
* Linux 2.6.39 compat, invalidate_inodes()Brian Behlendorf2011-04-191-1/+1
| | | | | | | | | Update code to use the spl_invalidate_inodes() wrapper. This hides some of the complexity of determining if invalidate_inodes() was exported, and if so what is its prototype. The second argument of spl_invalidate_inodes() determined the behavior of how dirty inodes are handled. By passing a zero we are indicated that we want those inodes to be treated as busy and skipped.
* Linux 2.6.29 compat, credentialsBrian Behlendorf2011-04-071-3/+3
| | | | | | | The .sync_fs fix as applied did not use the updated SPL credential API. This broke builds on Debian Lenny, this change applies the needed fix to use the portable API. The original credential changes are part of commit 81e97e21872a9c38ad66c37fafe1436ee25abee3.
* Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))Brian Behlendorf2011-04-071-0/+13
| | | | | | | | | | Disable the normal reclaim path for the txg_sync thread. This ensures the thread will never enter dmu_tx_assign() which can otherwise occur due to direct reclaim. If this is allowed to happen the system can deadlock. Direct reclaim call path: ->shrink_icache_memory->prune_icache->dispose_list-> clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
* Add direct+indirect ARC reclaimBrian Behlendorf2011-04-071-0/+59
| | | | | | | | | | | | | | | | | | | | | Under OpenSolaris all memory reclaim is done asyncronously. Under Linux memory reclaim is done asynchronously _and_ synchronously. When a process allocates memory with GFP_KERNEL it explicitly allows the kernel to do reclaim on its behalf to satify the allocation. If that GFP_KERNEL allocation fails the kernel may take more drastic measures to reclaim the memory such as killing user space processes. This was observed to happen with ZFS because the ARC could consume a large fraction of the system memory but no synchronous reclaim could be performed on it. The result was GFP_KERNEL allocations could fail resulting in OOM events, and only moments latter the arc_reclaim thread would free unused memory from the ARC. This change leaves the arc_thread in place to manage the fundamental ARC behavior. But it adds a synchronous (direct) reclaim path for the ARC which can be called when memory is badly needed. It also adds an asynchronous (indirect) reclaim path which is called much more frequently to prune the ARC slab caches.
* Add missing arcstatsBrian Behlendorf2011-04-071-8/+20
| | | | | | | | | | | | The following useful values were missing the arcstats. This change adds them in to provide greater visibility in to the arcs behavior. arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_meta_used 4 624774592 arc_meta_limit 4 400785408 arc_meta_max 4 625594176
* Call d_instantiate before unlocking inodeBrian Behlendorf2011-04-073-27/+12
| | | | | | | | | Under Linux a dentry referencing an inode must be instantiated before the inode is unlocked. To accomplish this without overly modifing the core ZFS code the dentry it passed via the vattr_t. There are cases such as replay when a dentry is not available. In which case it is obviously not initialized at inode creation time, if a dentry is needed it will be spliced as when required via d_lookup().
* Fix `make distclean` for `./configure --with-config=userBrian Behlendorf2011-04-051-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | Making distclean in module make[1]: Entering directory `/zfs/module' make -C SUBDIRS=`pwd` clean make: Entering an unknown directory make: *** SUBDIRS=/zfs/module: No such file or directory. Stop. When using --with-config=user the 'distclean' target would fail because it assumes the kernel configuration infrastrure is set up. This is not the case, nor does it need to be, because the '--with-config=user' option will prune the entire ./module subtree from SUBDIRS. This prevents most build rules from operating in the ./module directory. However, the 'dist*' rules will still traverse this directory because it is listed in DIST_SUBDIRS. This is correct because we need to ensure the dist rules package the directory contents regardless of the configuration for the 'dist' rule. The correct way to handle this is to only invoke the kernel build system as part of the 'clean' rule when CONFIG_KERNEL_TRUE is set. Initial fix provided by Darik Horn <[email protected]>. This commit is a slightly refined form of the original.
* Fix inflated load averageBrian Behlendorf2011-03-311-2/+2
| | | | | | | | | | | | | | | Kernel threads which sleep uninterruptibly on Linux are marked in the (D) state. These threads are usually in the process of performing IO and are thus counted against the load average. The txg_quiesce and txg_sync threads were always sleeping uninterruptibly and thus inflating the load average. This change makes them sleep interruptibly. Some care is required however because these threads may now be woken early by signals. In this case the callers are all careful to check that the required conditions are met after waking up. If we're woken early due to a signal they will simply go back to sleep. In this case these changes are safe. Closes #175
* Linux 2.6.29 compat, .freeze_fs/.unfreeze_fsBrian Behlendorf2011-03-221-2/+0
| | | | | | The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29 Since these hooks are currently unused they are being removed to allow support of older kernels.
* Linux 2.6.29 compat, credentialsBrian Behlendorf2011-03-223-81/+81
| | | | | | | As of Linux 2.6.29 a clean credential API was added to the Linux kernel. Previously the credential was embedded in the task_struct. Because the SPL already has considerable support for handling this API change the ZPL code has been updated to use the Solaris credential API.
* Fix evict() deadlockBrian Behlendorf2011-03-221-4/+20
| | | | | | | | | | | | | | | | | | | | | Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility of synchronous reclaim deadlocks. These deadlocks never existed in the original OpenSolaris code because all memory reclaim on Solaris is done asyncronously. Linux does both synchronous (direct) and asynchronous (indirect) reclaim. This commit addresses a deadlock caused by inode eviction. A KM_SLEEP allocation may trigger direct memory reclaim and shrink the inode cache. This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()-> zfs_zinactive() call path the same mutex may be reacquired resulting in a deadlock. To avoid this deadlock the process must not reacquire the mutex when it is already holding it. This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD mutex locking should be reevaluated. This infrastructure already prevents us from ever using the Linux lock dependency analysis tools, and it may limit scalability.
* Use KM_PUSHPAGE instead of KM_SLEEPBrian Behlendorf2011-03-223-9/+9
| | | | | | | | | | | | | | | | | | | | | | | It used to be the case that all KM_SLEEP allocations were GFS_NOFS. Unfortunately this often resulted in the kernel being unable to reclaim the ARC, inode, and dentry caches in a timely manor. The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL. However, this increases the posibility of deadlocking the system on a zfs write thread. If a zfs write thread attempts to perform an allocation it may trigger synchronous reclaim. This reclaim may attempt to flush dirty data/inode to disk to free memory. Unforunately, this write cannot finish because the write thread which would handle it is holding the previous transaction open. Deadlock. To avoid this all allocations in the zfs write thread path must use KM_PUSHPAGE which prohibits synchronous reclaim for that thread. In this way forward progress in ensured. The risk with this change is I missed updating an allocation for the write threads leaving an increased posibility of deadlock. If any deadlocks remain they will be unlikely but we'll have to make sure they all get fixed.
* Register .remount_fs handlerBrian Behlendorf2011-03-152-1/+51
| | | | | | | | | Register the missing .remount_fs handler. This handler isn't strictly required because the VFS does a pretty good job updating most of the MS_* flags. However, there's no harm in using the hook to call the registered zpl callback for various MS_* flags. Additionaly, this allows us to lay the ground work for more complicated argument parsing in the future.
* Register .sync_fs handlerBrian Behlendorf2011-03-152-8/+26
| | | | | | | | | Register the missing .sync_fs handler. This is a noop in most cases because the usual requirement is that sync just be initiated. As part of the DMU's normal transaction processing txgs will be frequently synced. However, when the 'wait' flag is set the requirement is that .sync_fs must not return until the data is safe on disk. With the addition of the .sync_fs handler this is now properly implemented.