aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Fix locking order in zfs_zget()Richard Yao2014-04-041-1/+1
| | | | | Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Fix deadlock in zfs_zget()Richard Yao2014-04-041-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zfsonlinux/zfs#180 occurred because of a race between inode eviction and zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call to the VFS to learn whether an inode is being evicted. If it was being evicted the operation was retried after dropping and reacquiring the relevant resources. Unfortunately, this introduced another deadlock. INFO: task kworker/u24:6:891 blocked for more than 120 seconds. Tainted: P O 3.13.6 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 Call Trace: [<ffffffff813dd1d4>] schedule+0x24/0x70 [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [<ffffffff811608c5>] ilookup+0x65/0xd0 [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [<ffffffff8106072c>] process_one_work+0x16c/0x490 [<ffffffff810613f3>] worker_thread+0x113/0x390 [<ffffffff81066edf>] kthread+0xdf/0x100 This patch implements the original fix in a slightly different manner in order to avoid both deadlocks. Instead of relying on a call to ilookup() which can block in __wait_on_freeing_inode() the return value from igrab() is used. This gives us the information that ilookup() provided without the risk of a deadlock. Alternately, this race could be closed by registering an sops->drop_inode() callback. The callback would need to detect the active SA hold thereby informing the VFS that this inode should not be evicted. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #180
* Revert "Fixed a use-after-free bug in zfs_zget()."Brian Behlendorf2014-04-031-23/+1
| | | | | | This reverts commit 36df284366caa77cb40083d2e6bcce02274e2f05. Signed-off-by: Brian Behlendorf <[email protected]>
* Add automatic hot spare functionalityBrian Behlendorf2014-04-022-9/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a vdev starts getting I/O or checksum errors it is now possible to automatically rebuild to a hot spare device. To cleanly support this functionality in a shell script some additional information was added to all zevent ereports which include a vdev. This covers both io and checksum zevents but may be used but other scripts. In the Illumos FMA solution the same information is required but it is retrieved through the libzfs library interface. Specifically the following members were added: vdev_spare_paths - List of vdev paths for all hot spares. vdev_spare_guids - List of vdev guids for all hot spares. vdev_read_errors - Read errors for the problematic vdev vdev_write_errors - Write errors for the problematic vdev vdev_cksum_errors - Checksum errors for the problematic vdev. By default the required hot spare scripts are installed but this functionality is disabled. To enable hot sparing uncomment the ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the /etc/zfs/zed.d/zed.rc configuration file. These scripts do no add support for the autoexpand property. At a minimum this requires adding a new udev rule to detect when a new device is added to the system. It also requires that the autoexpand policy be ported from Illumos, see: https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c Support for detecting the correct name of a vdev when it's not a whole disk was added by Turbo Fredriksson. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Turbo Fredriksson <[email protected]> Issue #2
* Clarify zpool_events_next() commentBrian Behlendorf2014-03-311-1/+1
| | | | | | | | | | | | | | | | | | Due to the very poorly chosen argument name 'cleanup_fd' it was completely unclear that this file descriptor is used to track the current cursor location. When the file descriptor is created by opening ZFS_DEV a private cursor is created in the kernel for the returned file descriptor. Subsequent calls to zpool_events_next() and zpool_events_seek() then require the file descriptor as an argument to reposition the cursor. When the file descriptor is closed the kernel state tracking the cursor is destroyed. This patch contains no functional change, it just changes a few variable names and clarifies the documentation. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Issue #2
* Add zpool_events_seek() functionalityBrian Behlendorf2014-03-312-0/+85
| | | | | | | | | | | | | The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers to seek around the zevent file descriptor by EID. When a specific EID is passed and it exists the cursor will be positioned there. If the EID is no longer cached by the kernel ENOENT is returned. The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek to those respective locations. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Issue #2
* Add a unique "eid" value to all zeventsBrian Behlendorf2014-03-311-0/+16
| | | | | | | | | | | | Tagging each zevent with a unique monotonically increasing EID (Event IDentifier) provides the required infrastructure for a user space daemon to reliably process zevents. By writing the EID to persistent storage the daemon can safely resume where it left off in the event stream when it's restarted. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlap <[email protected]> Issue #2
* Illumos #4089 NULL pointer dereference in arc_read()Boris Protopopov2014-03-241-9/+11
| | | | | | | | | | | | | | | | | | 4089 NULL pointer dereference in arc_read() Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4089 illumos/illumos-gate@57815f6b95a743697e148327725b7f568e75e6ea Signed-off-by: Brian Behlendorf <[email protected]> Issue #2171 Issue #2165 Closes #2198
* Implement -t option to zpool import for temporary pool namesRichard Yao2014-03-201-2/+26
| | | | | | | | | | | | | | | | Originally, users had to handle spa namespace collisions by either exporting the already imported pool or by specifying a new name for the pool with a conflicting name. In the case of root pools from virtual guests, neither approach to collision resolution is reasonable. This is addressed by extending the new name syntax with a -t option to specify that the new name is temporary. When specified, this sets an internal flag that is passed into the kernel to tell it that all label updates should refer to the name used in the original label. Consequently, the original pool name will be retained on export. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2189
* Fix regression introduced in port of Illumos #3744Andrey Vesnovaty2014-03-201-2/+4
| | | | | | | | | | | | | | Remove the redundant call to zfs_unmount_snap() which was being done after char array was freed, This fixes an upstream regression that was introduced in commit zfsonlinux/zfs@d09f25dc66774959499a89bf3680d09c6e541ce8, which ported the Illumos 3744 changes. Signed-off-by: Andrey Vesnovaty <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #2156
* Illumos #4088 use after free in arc_release()Boris Protopopov2014-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | 4088 use after free in arc_release() Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4088 illumos/illumos-gate@ccc22e130479b5bd7c0002267fee1e0602d3f772 From the illumos issue: A race-induced use after free occurs in arc_release() where the ARC header is used outside the critical section protected by the hash_lock. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #2162
* Use KM_PUSHPAGE in spa_add() for spa_label_features.Tim Chase2014-03-101-1/+1
| | | | | | | | | The spa_label_features nvlist is used in the sync context during pool version upgrade. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2168
* Export symbols dsl_sync_task{_nowait}Brian Behlendorf2014-03-071-0/+2
| | | | | | | | These are needed by consumers (i.e. Lustre) who wish to perform a callback in the syncing context. Both a blocking and non-blocking version are available to callers. Signed-off-by: Brian Behlendorf <[email protected]>
* Improve reporting of tx assignment wait timesNed Bass2014-03-041-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call dmu_tx_wait() themselves before retrying if the assignment fails. The wait times for such callers are not accounted for in the dmu_tx_assign kstat histogram, because the histogram only records time spent in dmu_tx_assign(). This change moves the histogram update to dmu_tx_wait() to properly account for all time spent there. One downside of this approach is that it is possible to call dmu_tx_wait() multiple times before successfully assigning a transaction, in which case the cumulative wait time would not be recorded. However, this case should not often arise in practice, because most callers currently use one of these forms: dmu_tx_assign(tx, TXG_WAIT); dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); The first form should make just one call to dmu_tx_delay() inside of dmu_tx_assign(). The second form retries with TXG_WAITED if the first assignment fails and incurs a delay, in which case no further waiting is performed. Therefore transaction delays normally occur in one call to dmu_tx_wait() so the histogram should be fairly accurate. Another possible downside of this approach is that the histogram will no longer record overhead outside of dmu_tx_wait() such as in dmu_tx_try_assign(). While I'm not aware of any reason for concern on this point, it is conceivable that lock contention, long list traversal, etc. could cause assignment delays that would not be reflected in the histogram. Therefore the histogram should strictly be used for visibility in to the normal delay mechanisms and not as a profiling tool for code performance. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1915
* replace nreserved with ndirty in txgs kstatNed Bass2014-03-042-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | The nreserved column in the txgs kstat file always contains 0 following the write throttle restructuring of commit e8b96c6007bf97cdf34869c1ffbd0ce753873a3d. Prior to that commit, the nreserved column showed the number of bytes temporarily reserved in the pool by a transaction group at sync time. The new write throttle did away with temporary reservations and uses the amount of dirty data instead. To approximate the old output of the txgs kstat, the number of dirty bytes per-txg was passed in as the nreserved value to spa_txg_history_set_io(). This approach did not work as intended, because the per-txg dirty value is decremented as data is written out to disk, so it is zero by the time we call spa_txg_history_set_io(). To fix this, save the number of dirty bytes before calling spa_sync(), and pass this value in to spa_txg_history_set_io(). Also, since the name "nreserved" is now a misnomer, the column heading is now labeled "ndirty". Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1696
* dmu_tx kstat cleanupNed Bass2014-03-042-2/+2
| | | | | | | | | | | | | | | | | | | | | | A few counters in the dmu_tx kstats are obsolete or no longer bumped properly. - The sync task restructuring commit 13fe019870c8779bf2f5b3ff731b512cf89133ef removed the code that bumpted dmu_tx_quota. The counter is now bumped in two cases, instead of just the one case as before (after the result of dsl_dataset_check_quota call). The second case is where we check the requested reservation against the actual pool size, as this is an implicit quota of sorts. - The write throttle restructuring commit e8b96c6007bf97cdf34869c1ffbd0ce753873a3d makes dmu_tx_how and dmu_tx_inflight obsolete, so they are removed. Signed-off-by: Kohsuke Kawaguchi <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1914
* Invalidate Linux buffer cache on vdevs upon each flushRichard Yao2014-03-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userland tools such as blkid, grub2-probe and zdb will go through the buffer cache. However, ZFS uses on submit_bio() to bypass the buffer cache when performing IO operations on vdevs for efficiency purposes. This permits the on-disk state and buffer cache to fall out of synchronization. That causes seemingly random failures when tools reading stale metadata from the buffer cache try to access references to data that is no longer there. A particularly bad failure this causes involves grub2-probe, which is used by grub2-mkconfig. Ordinarily, a rootfs might be called rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe, grub2-mkconfig will generate a configuration file containing /ROOT/gentoo, which omits the pool name and causes a boot failure. This is avoidable by calling invalidate_bdev() on each flush, which is a simple way to ensure that all non-dirty pages are wiped. Since userland tools rarely access vdevs directly, this should be a fancy noop >99.999% of the time and have little impact on IO. We could have tried a finer grained approach for the rare instances in which the vdevs are accessed frequently by userland. However, that would require consideration of corner cases and it is not worth the effort. Memory-wise, it would have been better to use a Linux kernel API hook to disable the buffer cache on such devices, but it provides us no way of doing that, so we opt for this approach instead. We should revisit that idea in the future when higher priority issues have been tackled. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2150
* Illumos #4574 get_clones_stat does not call zap_count in non-debug kernelAlexander Stetsenko2014-03-041-1/+2
| | | | | | | | | | | | | | | | 4574 get_clones_stat does not call zap_count in non-debug kernel Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Marcel Telka <[email protected]> Approved by: Gordon Ross <[email protected]> References: https://www.illumos.org/issues/4574 illumos/illumos-gate@03d1795fa6f720eafbee821ad37f4343c391cfe4 Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2147
* Fix zap_lookup() in feature_is_supported().Tim Chase2014-03-041-1/+1
| | | | | | | | | | The length (number of integers) argument passed to zap_lookup was wrong; likely as a result of performing stack-reduction on the function. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2141
* Remove recursion from dsl_dir_willuse_space()Andrew Barnes2014-03-041-11/+17
| | | | | | | | | | | Remove recursion from dsl_dir_willuse_space() to reduce stack usage. Issues with stack overflow were observed in zfs recv of zvols, likelihood of an overflow is proportional to the depth of the dataset as dsl_dir_willuse_space() recurses to parent datasets. Signed-off-by: Andrew Barnes <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2069
* Set "arc_meta_limit" to 3/4 arc_c_max by defaultPrakash Surya2014-02-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately, this change is an cheap attempt to work around a pathological workload for the ARC. A "real" solution still needs to be fleshed out, so this patch is intended to alleviate the situation in the meantime. Let me try and describe the problem.. Data buffers residing in the dbuf hash table (dbuf cache) will keep a hold on their respective dnode, this dnode will in turn keep a hold on its backing dbuf (the physical block of the dnode object backing it). Since the dnode has a hold on its backing dbuf, the arc buffer for this dbuf is unevictable. What this essentially boils down to, "data" buffers have the potential to pin "metadata" in the arc (as a result of these dnode object buffers being unevictable). This scenario becomes a real problem when the workload consists of many small files (e.g. creating millions of 4K files). With this workload, the arc's "arc_meta_used" space get filled up with buffers for any resident directories as well as buffers for the objset's dnode object. Once the "arc_meta_limit" is reached, the directory buffers will be evicted and only the unevictable dnode object buffers will reside. If the workload is simply creating new small files, these dnode object buffers will never even be needed again, whereas the directory buffers will be used constantly until the creates move to a new directory. If "arc_c" and "arc_meta_limit" are sized appropriately, this situation wont occur. This is because as the data buffers accumulate, "arc_size" will eventually approach "arc_c" (before "arc_meta_used" reaches "arc_meta_limit"); at that point the data buffers will be evicted, which releases the hold on the dnode, which releases the hold on the dnode object's dbuf, which allows that buffer to be evicted from the arc in preference to more "useful" metadata. So, to side step the issue, we simply need to ensure "arc_size" reaches "arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to pick a proper limit, we have to do some math. To make things a little easier to follow, it is assumed that there will only be a single data buffer per file (which is probably always the case for "small" files anyways). Based on the current internals of the arc, if N files residing in the dbuf cache all pin a single dnode buffer (i.e. their dnodes all share the same physical dnode object block), then the following amount of "arc_meta_used" space will be consumed: - 16K for the dnode object's block - [ 16384 bytes] - N * sizeof(dnode_t) -------------- [ N * 928 bytes] - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes] - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes] - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes] To simplify, these N files will pin the following amount of "arc_meta_used" space as unevictable: Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280) Pinned "arc_meta_used" bytes = 17000 + N * 1544 This pinned space is regardless of the size of the files, and is only dependent on the number of pinned dnodes sharing a physical block (i.e. N). For example, 32 512b files sharing a single dnode object block would consume the same "arc_meta_used" space as 32 4K files sharing a single dnode object block. Now, given a files size of S, we can determine the total amount of space that will be consumed in the arc: Total = 17000 + N * 1544 + S * N ^^^^^^^^^^^^^^^^ ^^^^^ metadata data So, given these formulas, we can generate a table which states the ratio of pinned metadata to total arc (meta + data) using different values of N (number of pinned dnodes per pinned physical dnode block) and S (size of the file). File Sizes (S) | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | ---+----------+----------+----------+----------+----------+----------+ 1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 | 2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 | N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 | 8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 | 16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 | 32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 | As you can see, if we wanted to support the absolute worst case of 1 dnode per physical dnode block and 512b files, we would have to set the "arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At that point, it essentially defeats the purpose of having an "arc_meta_limit" at all. This patch changes the default value of "arc_meta_limit" to be 75% of "arc_c_max", which should be good enough for "most" workloads (I think). Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Split "data_size" into "meta" and "data"Prakash Surya2014-02-211-14/+24
| | | | | | | | | | | | | | | | | Previously, the "data_size" field in the arcstats kstat contained the amount of cached "metadata" and "data" in the ARC. The problem is this then made it difficult to extract out just the "metadata" size, or just the "data" size. To make it easier to distinguish the two values, "data_size" has been modified to count only buffers of type ARC_BUFC_DATA, and "meta_size" was added to count only buffers of type ARC_BUFC_METADATA. If one wants the old "data_size" value, simply sum the new "data_size" and "meta_size" values. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Prioritize "metadata" in arc_get_data_bufPrakash Surya2014-02-211-22/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the arc is at it's size limit and a new buffer is added, data will be evicted (or recycled) from the arc to make room for this new buffer. As far as I can tell, this is to try and keep the arc from over stepping it's bounds (i.e. keep it below the size limitation placed on it). This makes sense conceptually, but there appears to be a subtle flaw in its current implementation, resulting in metadata buffers being throttled. When it evicts from the arc's lists, it also passes in a "type" so as to remove a buffer of the same type that it is adding. The problem with this is that once the size limit is hit, the ratio of "metadata" to "data" contained in the arc essentially becomes fixed. For example, consider the following scenario: * the size of the arc is capped at 10G * the meta_limit is capped at 4G * 9G of the arc contains "data" * 1G of the arc contains "metadata" Now, every time a new "metadata" buffer is created and added to the arc, an older "metadata" buffer(s) will be removed from the arc; preserving the 9G "data" to 1G "metadata" ratio that was in-place when the size limit was reached. This occurs even though the amount of "metadata" is far below the "metadata" limit. This can result in the arc behaving pathologically for certain workloads. To fix this, the arc_get_data_buf function was modified to evict "data" from the arc even when adding a "metadata" buffer; unless it's at the "metadata" limit. In addition, arc_evict now more closely resembles arc_evict_ghost; such that when evicting "data" from the arc, it may make a second pass over the arc lists and evict "metadata" if it cannot meet the eviction size the first time around. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Remove "arc_meta_used" from arc_adjust calculationPrakash Surya2014-02-211-2/+1
| | | | | | | | | | | | | Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Prune metadata from ghost lists in arc_adjust_metaPrakash Surya2014-02-212-20/+49
| | | | | | | | | | | | | | | | | | | | | | | | To maintain a strict limit on the metadata contained in the arc, while preventing the arc buffer headers from completely consuming the "arc_meta_used" space, we need to evict metadata buffers from the arc's ghost lists along with the regular lists. This change modifies arc_adjust_meta such that it more closely models the adjustments made in arc_adjust. "arc_meta_used" is used similarly to "arc_size", and "arc_meta_limit" is used similarly to "arc_c". Testing metadata intensive workloads (e.g. creating, copying, and removing millions of small files and/or directories) has shown this change to make a dramatic improvement to the hit rate maintained in the arc. While I think there is still room for improvement, this is a big step in the right direction. In addition, zpl_free_cached_objects was made into a no-op as I'm not yet sure how to properly implement that function. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Revert "Return -1 from arc_shrinker_func()"Prakash Surya2014-02-211-1/+3
| | | | | | | | | | This reverts commit c11a12bc3b2e5ee9a6bd74e26f1a396b6025fbd4. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Disable arc_p adapt dampener by defaultPrakash Surya2014-02-211-2/+12
| | | | | | | | | | | It's unclear why adjustments to arc_p need to be dampened as they are in arc_adjust. With that said, it's removal significantly improves the arc's ability to "warm up" to a given workload. Thus, I'm disabling by default until its usefulness is better understood. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Allow "arc_p" to drop to zero or grow to "arc_c"Prakash Surya2014-02-211-13/+4
| | | | | | | | | | | | | | | | | | Setting a limit on the minimum value of "arc_p" has been shown to have detrimental effects on the arc hit rate for certain "metadata" intensive workloads. Specifically, this has been exhibited with a workload that constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). What is seen is that the new anon data throttles the mfu list to a negligible size (because arc_p > anon + mru in arc_get_data_buf), even though the mfu ghost list receives a constant stream of hits. To remedy this, arc_p is now allowed to drop to zero if the algorithm deems it necessary. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Disable aggressive arc_p growth by defaultPrakash Surya2014-02-211-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For specific workloads consisting mainly of mfu data and new anon data buffers, the aggressive growth of arc_p found in the arc_get_data_buf() function can have detrimental effects on the mfu list size and ghost list hit rate. Running a workload consisting of two processes: * Process 1 is creating many small files * Process 2 is tar'ing a directory consisting of many small files I've seen arc_p and the mru grow to their maximum size, while the mru ghost list receives 100K times fewer hits than the mfu ghost list. Ideally, as the mfu ghost list receives hits, arc_p should be driven down and the size of the mfu should increase. Given the specific workload I was testing with, the mfu list size should grow to a point where almost no mfu ghost list hits would occur. Unfortunately, this does not happen because the newly dirtied anon buffers constancy drive arc_p to its maximum value and keep it there (effectively prioritizing the mru list and starving the mfu list down to a negligible size). The logic to increment arc_p from within the arc_get_data_buf() function was introduced many years ago in this upstream commit: commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc Author: maybee <none@none> Date: Wed Dec 20 15:46:12 2006 -0800 6505658 target MRU size (arc.p) needs to be adjusted more aggressively and since I don't fully understand the motivation for the change, I am reluctant to completely remove it. As a way to test out how it's removal might affect performance, I've disabled that code by default, but left it tunable via a module option. Thus, if its removal is found to be grossly detrimental for certain workloads, it can be re-enabled on the fly, without a code change. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Adjust arc_p based on "bytes" in arc_shrinkPrakash Surya2014-02-211-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | In an attempt to prevent arc_c from collapsing "too fast", the arc_shrink() function was updated to take a "bytes" parameter by this change: commit 302f753f1657c05a4287226eeda1f53ae431b8a7 Author: Brian Behlendorf <[email protected]> Date: Tue Mar 13 14:29:16 2012 -0700 Integrate ARC more tightly with Linux Unfortunately, that change failed to make a similar change to the way that arc_p was updated. So, there still exists the possibility for arc_p to collapse to near 0 when the kernel start calling the arc's shrinkers. This change attempts to fix this, by decrementing arc_p by the "bytes" parameter in the same way that arc_c is updated. In addition, a minimum value of arc_p is attempted to be maintained, similar to the way a minimum arc_p value is maintained in arc_adapt(). Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2110
* Set zfs_arc_min to 4MBBrian Behlendorf2014-02-211-6/+3
| | | | | | | | | | | | | | | | | | | | | Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Issue #2110
* Add erratum for issue #2094Richard Yao2014-02-211-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZoL commit 1421c89 unintentionally changed the disk format in a forward- compatible, but not backward compatible way. This was accomplished by adding an entry to zbookmark_t, which is included in a couple of on-disk structures. That lead to the creation of pools with incorrect dsl_scan_phys_t objects that could only be imported by versions of ZoL containing that commit. Such pools cannot be imported by other versions of ZFS or past versions of ZoL. The additional field has been removed by the previous commit. However, affected pools must be imported and scrubbed using a version of ZoL with this commit applied. This will return the pools to a state in which they may be imported by other implementations. The 'zpool import' or 'zpool status' command can be used to determine if a pool is impacted. A message similar to one of the following means your pool must be scrubbed to restore compatibility. $ zpool import pool: zol-0.6.2-173 id: 1165955789558693437 state: ONLINE status: Errata #1 detected. action: The pool can be imported using its name or numeric identifier, however there is a compatibility issue which should be corrected by running 'zpool scrub' see: http://zfsonlinux.org/msg/ZFS-8000-ER config: ... $ zpool status pool: zol-0.6.2-173 state: ONLINE scan: pool compatibility issue detected. see: https://github.com/zfsonlinux/zfs/issues/2094 action: To correct the issue run 'zpool scrub'. config: ... If there was an async destroy in progress 'zpool import' will prevent the pool from being imported. Further advice on how to proceed will be provided by the error message as follows. $ zpool import pool: zol-0.6.2-173 id: 1165955789558693437 state: ONLINE status: Errata #2 detected. action: The pool can not be imported with this version of ZFS due to an active asynchronous destroy. Revert to an earlier version and allow the destroy to complete before updating. see: http://zfsonlinux.org/msg/ZFS-8000-ER config: ... Pools affected by the damaged dsl_scan_phys_t can be detected prior to an upgrade by running the following command as root: zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w Note that `poolname` must be replaced with the name of the pool you wish to check. A value of 25 indicates the dsl_scan_phys_t has been damaged. A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0 indicates that there has never been a scrub run on the pool. The regression caused by the change to zbookmark_t never made it into a tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL stable respositorys. Only those using the HEAD version directly from Github after the 0.6.2 but before the 0.6.3 tag are affected. This patch does have one limitation that should be mentioned. It will not detect errata #2 on a pool unless errata #1 is also present. It expected this will not be a significant problem because pools impacted by errata #2 have a high probably of being impacted by errata #1. End users can ensure they do no hit this unlikely case by waiting for all asynchronous destroy operations to complete before updating ZoL. The presence of any background destroys on any imported pools can be checked by running `zpool get freeing` as root. This will display a non-zero value for any pool with an active asynchronous destroy. Lastly, it is expected that no user data has been lost as a result of this erratum. Original-patch-by: Tim Chase <[email protected]> Reworked-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2094
* Add generic errata infrastructureBrian Behlendorf2014-02-212-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From time to time it may be necessary to inform the pool administrator about an errata which impacts their pool. These errata will by shown to the administrator through the 'zpool status' and 'zpool import' output as appropriate. The errata must clearly describe the issue detected, how the pool is impacted, and what action should be taken to resolve the situation. Additional information for each errata will be provided at http://zfsonlinux.org/msg/ZFS-8000-ER. To accomplish the above this patch adds the required infrastructure to allow the kernel modules to notify the utilities that an errata has been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t which has been added to the pool configuration nvlist. To add a new errata the following changes must be made: * A new errata identifier must be assigned by adding a new enum value to the zpool_errata_t type. New enums must be added to the end to preserve the existing ordering. * Code must be added to detect the issue. This does not strictly need to be done at pool import time but doing so will make the errata visible in 'zpool import' as well as 'zpool status'. Once detected the spa->spa_errata member should be set to the new enum. * If possible code should be added to clear the spa->spa_errata member once the errata has been resolved. * The show_import() and status_callback() functions must be updated to include an informational message describing the errata. This should include an action message describing what an administrator should do to address the errata. * The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be updated to describe the errata. This space can be used to provide as much additional information as needed to fully describe the errata. A link to this documentation will be automatically generated in the output of 'zpool import' and 'zpool status'. Original-idea-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Richard Yao <[email protected] Issue #2094
* Revert changes to zbookmark_tRichard Yao2014-02-211-1/+0
| | | | | | | | | | | | | | | | Commit 1421c89142376bfd41e4de22ed7c7846b9e41f95 added a field to zbookmark_t that unintentinoally caused a disk format change. This negatively affected backward compatibility and platform portability. Therefore, this field is being removed. The function that field permitted is left unimplemented until a later patch that will reimplement the field in a way that does not affect the disk format. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2094
* Propagate errors when registering "relatime" property callback.Tim Chase2014-02-121-1/+1
| | | | | | | | | | | | Various errors can occur when registering property callbacks. As the author's comments indicate, the code is very paranoid about preserving the first-seen error when registering callbacks. This patch causes an error seen while registering the "relatime" callback to not clobber a previously-seen error. Reported-by: Jorgen Lundman <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2117
* Fix corrupted l2_asize in arcstatsBrian Behlendorf2014-02-051-1/+1
| | | | | | | | | | | | | | | | | | | Commit e0b0ca9 accidentally corrupted the l2_asize displayed in arcstats. This was caused by changing the l2arc_buf_hdr.b_asize member from an int to uint32_t type. There are places in the code where this field is cast to a uint64_t resulting in the b_hits member being treated as part of b_asize. To resolve the issue the type has been changed to a uint64_t, and the b_hits member is placed after the enum to prevent the size of the structure from increasing. This is a good example of exactly why it's a bad idea to use ambiguous types (int) in these structures. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1990
* 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0Matthew Ahrens2014-01-311-1/+8
| | | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> Refences: https://www.illumos.org/issues/4188 illumos/illumos-gate@bb411a08b05466bfe0c7095b6373bbc1587e259a Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2091
* Illumos 4504 traverse_visitbp: visit group before userMatthew Ahrens2014-01-291-6/+6
| | | | | | | | | | | | | | | | 4504 traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> References: https://illumos.org/issues/4504 http://code.delphix.com/illumos-4504 http://svnweb.freebsd.org/base?view=revision&revision=260812 Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #2079
* Implement relatime.Tim Chase2014-01-293-9/+88
| | | | | | | | | | | | Add the "relatime" property. When set to "on", a file's atime will only be updated if the existing atime at least a day old or if the existing ctime or mtime has been updated since the last access. This behavior is compatible with the Linux "relatime" mount option. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2064 Closes #1917
* Call gethrtime() only once per new txg creationCyril Plisko2014-01-232-4/+4
| | | | | | | | | | | | | | | | | | | | | | | When transitioning current open TXG into QUIESCE state and opening a new one txg_quiesce() calls gethrtime(): - to mark the birth time of the new TXG - to record the SPA txg history kstat - implicitely inside spa_txg_history_add() These timestamps are practically the same, so that the first one can be used instead of the other two. The only visible difference is that inside spa_txg_history_add() the time spent in kmem_zalloc() will be counted towards the opened TXG. Since at this point the new TXG already exists (tx->tx_open_txg has been already incremented) it is actually a correct accounting. In any case this extra work is only happening when spa_txg_history kstat is activated (i.e. zfs_txg_history > 0) and doesn't affect the normal processing in any way. Signed-off-by: Cyril Plisko <[email protected]> Issue #2075
* Add additional state TXG_STATE_WAIT_FOR_SYNC for txg.Igor Lvovsky2014-01-232-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | In several cases when digging into kstats we can found two txgs in SYNC state, e.g. txg birth state nreserved nread nwritten ... 985452 258127184872561 C 0 373948416 2376272384 ... 985453 258129016180616 C 0 378173440 28793344 ... 985454 258129016271523 S 0 0 0 ... 985455 258130864245986 S 0 0 0 ... 985456 258130867458851 O 0 0 0 ... However only first txg (985454) is really syncing at this moment. The other one (985455) marked as SYNCED is actually in a post-QUIESCED state and waiting to start sync. So, the new TXG_STATE_WAIT_FOR_SYNC state between TXG_STATE_QUIESCED and TXG_STATE_SYNCED was added to reveal this situation. txg birth state nreserved nread nwritten ... 1086896 235261068743969 C 0 163577856 8437248 ... 1086897 235262870830801 C 0 280625152 822594048 ... 1086898 235264172219064 S 0 0 0 ... 1086899 235264936134407 W 0 0 0 ... 1086900 235264936296156 O 0 0 0 ... Signed-off-by: Igor Lvovsky <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2075
* Use enum type(zfetch_dirn_t) insteadShen Yan2014-01-231-2/+4
| | | | | | | Fix code with zfetch_dirn_t, which is more readable and clear. Signed-off-by: Brian Behlendorf <[email protected]> Closes #2068
* Allow chown/chgrp when no ACL SAs exist.Tim Chase2014-01-231-0/+11
| | | | | | | | | | | | | | | From the comment in the commit: Some ZFS implementations (ZEVO) create neither a ZNODE_ACL nor a DACL_ACES SA in which case ENOENT is returned from zfs_acl_node_read() when the SA can't be located. Allow chown/chgrp to succeed in these cases rather than returning an error that makes no sense in the context of the caller. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue zfs-osx/zfs#86 Closes #1911 Closes #2029
* vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE)Ned Bass2014-01-231-1/+1
| | | | | | | | | | | | The vdev_file_io_start() function may be processing a zio that the txg_sync thread is waiting on. In this case it is not safe to perform memory allocations that may generate new I/O since this could cause a deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE instead of TQ_SLEEP. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1928
* Use long holds in zvol_set_volsize()Brian Behlendorf2014-01-141-55/+67
| | | | | | | | | | | | | | | | | | | Under Linux the zvol_set_volsize() function was originally written to use dmu_objset_hold()/dmu_objset_rele(). Subsequently, the dmu_objset_own()/dmu_objset_disown() interfaces were added but the ZVOL code wasn't updated to take advantage of them. This was never an issue but after the dsl_pool_config changes the code now takes the config lock twice. The cleanest solution is to shift to using dmu_objset_own() which takes a long hold on the dataset and does not hold the dsl pool lock. This patch also slightly restructures the existing code such that it more closely resembles the upstream Illumos code. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2039
* Drain iput taskq outside z_teardown_lockBrian Behlendorf2014-01-091-8/+8
| | | | | | | | | | | | | | It's unsafe to drain the iput taskq while holding the z_teardown_lock as a writer. This is because when the last reference on an inode is dropped it may still have pages which need to be written to disk. This will be done through zpl_writepages which will acquire the z_teardown_lock as a reader in ZFS_ENTER. Therefore, if we're holding the lock as a writer in zfs_sb_teardown the unmount will deadlock. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #1988
* Force LZ4_FORCE_SW_BITCOUNT for SparcBrian Behlendorf2014-01-091-0/+3
| | | | | | | | | | | | | | This change was proposed for Sparc but it's not clear to me why it's required. Proper support exists in the lz4 code to detect the endianness and the required builtins are available for gcc. Still I'm including the patch because it will only impact Sparc and it may resolve a case which hasn't occured to me. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700
* Fix zfs_getattr_fast typesBrian Behlendorf2014-01-091-1/+6
| | | | | | | | | | | | | On Sparc sp->blksize will be a 64-bit value which is then cast incorrectly to a 32-bit value. For big endian systems this results in an incorrect value for sp->blksize. To resolve the problem local variables of the correct size are used and then assigned to sp->blksize. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700
* Fix nvlist 'Bus Error' for SparcBrian Behlendorf2014-01-091-2/+4
| | | | | | | | | | | | | The mis-aligned memory accesses in nvpair_native_embedded() and nvpair_native_embedded_array() will cause a 'Bus Error' for architectures such as Sparc which not fully byte addressible. To avoid this issue care is taken to avoid dereferencing the potentially mis-aligned packed nvlist_t. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700
* Use local variable to read zp->z_modeBrian Behlendorf2014-01-091-1/+4
| | | | | | | | | | | | | | | | | | When accessing the zp->z_mode through the SA bulk interface we expect that 64-bits are available to hold the result. However, on 32-bit platforms mode_t will only be 32-bits so we cannot pass it to SA_ADD_BULK_ATTR(). Instead a local uint64_t variable must be used and the result assigned to zp->z_mode. This went unnoticed on 32-bit little endian platforms because the bytes happen to end up in the correct 32-bits. But on big endian platforms like Sparc the zp->z_mode will always end up set to zero. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: marku89 <[email protected]> Issue #1700