summaryrefslogtreecommitdiffstats
path: root/module/zfs/dnode.c
Commit message (Collapse)AuthorAgeFilesLines
* Improved dnode allocation and dmu_hold_impl() (#6611)Giuseppe Di Natale2017-09-131-216/+306
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit dbeb879 (PR #6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]>
* Fix dnode allocation raceBrian Behlendorf2017-08-081-57/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6414 Closes #6439
* Clean up large dnode codeMatthew Ahrens2017-06-291-1/+2
| | | | | | | | | | | | Resolves issues discovered when porting to OpenZFS. * Lint warnings. * Made dnode_move_impl() large dnode aware. This functionality is currently unused on Linux. Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #6262
* OpenZFS 7968 - multi-threaded spa_sync()Matthew Ahrens2017-03-201-29/+7
| | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported-by: Matthew Ahrens <[email protected]> spa_sync() iterates over all the dirty dnodes and processes each of them by calling dnode_sync(). If there are many dirty dnodes (e.g. because we created or removed a lot of files), the single thread of spa_sync() calling dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied concurrently in open context (e.g. due to concurrent file creation), the os_lock will experience lock contention via dnode_setdirty(). The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to use separate threads to process each of the sublists in the multilist. OpenZFS-issue: https://www.illumos.org/issues/7968 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4a2a54c Closes #5752
* OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_spaceBrian Behlendorf2017-03-071-19/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Steve Gonczi <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Background information: This assertion about tx_space_* verifies that we are not dirtying more stuff than we thought we would. We “need” to know how much we will dirty so that we can check if we should fail this transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() — typically less than a millisecond), we call dbuf_dirty() on the exact blocks that will be modified. Once this happens, the temporary accounting in tx_space_* is unnecessary, because we know exactly what blocks are newly dirtied; we call dnode_willuse_space() to track this more exact accounting. The fundamental problem causing this bug is that dmu_tx_hold_*() relies on the current state in the DMU (e.g. dn_nlevels) to predict how much will be dirtied by this transaction, but this state can change before we actually perform the transaction (i.e. call dbuf_dirty()). This bug will be fixed by removing the assertion that the tx_space_* accounting is perfectly accurate (i.e. we never dirty more than was predicted by dmu_tx_hold_*()). By removing the requirement that this accounting be perfectly accurate, we can also vastly simplify it, e.g. removing most of the logic in dmu_tx_count_*(). The new tx space accounting will be very approximate, and may be more or less than what is actually dirtied. It will still be used to determine if this transaction will put us over quota. Transactions that are marked by dmu_tx_mark_netfree() will be excepted from this check. We won’t make an attempt to determine how much space will be freed by the transaction — this was rarely accurate enough to determine if a transaction should be permitted when we are over quota, which is why dmu_tx_mark_netfree() was introduced in 2014. We also won’t attempt to give “credit” when overwriting existing blocks, if those blocks may be freed. This allows us to remove the do_free_accounting logic in dbuf_dirty(), and associated routines. This logic attempted to predict what will be on disk when this txg syncs, to know if the overwritten block will be freed (i.e. exists, and has no snapshots). OpenZFS-issue: https://www.illumos.org/issues/7793 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a Upstream bugs: DLPX-32883a Closes #5804 Porting notes: - DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(), Using the default dnode size would be slightly better. - DEBUG_DMU_TX wrappers and configure option removed. - Resolved _by_dnode() conflicts these changes have not yet been applied to OpenZFS.
* OpenZFS 7500 - Simplify dbuf_free_range by removing dn_unlisted_l0_blkidGeorge Melikov2017-01-261-6/+1
| | | | | | | | | | | | Authored by: Stephen Blinick <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Approved by: Gordon Ross <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7500 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/653af1b Closes #5639
* OpenZFS 6676 - Race between unique_insert() and unique_remove() causes ZFS ↵George Melikov2017-01-261-3/+3
| | | | | | | | | | | | | | | | | | fsid change Authored by: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Sanjay Nadkarni <[email protected]> Reviewed by: Dan Vatca <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6676 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40510e8 Closes #5667
* Fix unallocated object detection for large_dnode datasetsLOLi2017-01-131-0/+1
| | | | | | | | | | | | | | | | | Fix dmu_object_next() to correctly handle unallocated objects on large_dnode datasets. We implement this by scanning the dnode block until we find the correct offset to be used in dnode_next_offset(). This is necessary because we can't assume *objectp is a hole even if dmu_object_info() returns ENOENT. This fixes a couple of issues with zfs receive on large_dnode datasets. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #5027 Closes #5532
* Use cstyle -cpP in `make cstyle` checkBrian Behlendorf2016-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Enable picky cstyle checks and resolve the new warnings. The vast majority of the changes needed were to handle minor issues with whitespace formatting. This patch contains no functional changes. Non-whitespace changes are as follows: * 8 times ; to { } in for/while loop * fix missing ; in cmd/zed/agents/zfs_diagnosis.c * comment (confim -> confirm) * change endline , to ; in cmd/zpool/zpool_main.c * a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks * /* CSTYLED */ markers * change == 0 to ! * ulong to unsigned long in module/zfs/dsl_scan.c * rearrangement of module_param lines in module/zfs/metaslab.c * add { } block around statement after for_each_online_node Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Håkan Johansson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5465
* OpenZFS 6988 spa_sync() spends half its time in dmu_objset_do_userquota_updatesJinshan Xiong2016-10-071-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using a benchmark which creates 2 million files in one TXG, I observe that the thread running spa_sync() is on CPU almost the entire time we are syncing, and therefore can be a performance bottleneck. About 50% of the time in spa_sync() is in dmu_objset_do_userquota_updates(). The problem is that dmu_objset_do_userquota_updates() calls zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was modified (or created). In this benchmark, all the files are owned by the same user/group, so all 2 million calls to zap_increment_int() are modifying the same entry in the zap. The same issue exists for the DMU_GROUPUSED_OBJECT. We should keep an in-memory map from user to space delta while we are syncing, and when we finish, iterate over the in-memory map and modify the ZAP once per entry. This reduces the number of calls to zap_increment_int() from "number of objects modified" to "number of owners/groups of modified files". This reduced the time spent in spa_sync() in the file create benchmark by ~33%, from 11 seconds to 7 seconds. Upstream bugs: DLPX-44799 Ported by: Ned Bass <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6988 ZFSonLinux-issue: https://github.com/zfsonlinux/zfs/issues/4642 OpenZFS-commit: unmerged Porting notes: - Added curly braces around declaration of userquota_cache_t cache to quiet compiler warning; - Handled the userobj accounting the same way it proposed in this path. Signed-off-by: Jinshan Xiong <[email protected]>
* fix: Shift exponent too largeGvozden Neskovic2016-09-291-2/+17
| | | | | | | | | | | | Undefined operation is reported by running ztest (or zloop) compiled with GCC UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection with large enough offset values. Logically, left shift operation would work, but bit shift semantics in C, and limitation of uint64_t, do not produce desired result. Issue #5059, #4883 Signed-off-by: Gvozden Neskovic <[email protected]>
* OpenZFS 6950 - ARC should cache compressed dataGeorge Wilson2016-09-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0 Issue #5078
* Performance optimization of AVL tree comparator functionsGvozden Neskovic2016-08-311-19/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Richard Elling <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5033
* Limit the amount of dnode metadata in the ARCTim Chase2016-07-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4345 Issue #4512 Issue #4773 Closes #4858
* Implement large_dnode pool featureNed Bass2016-06-241-42/+196
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3542
* Illumos 5987 - zfs prefetch code needs workMatthew Ahrens2016-01-121-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5987 zfs prefetch code needs work Reviewed by: Adam Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Gordon Ross <[email protected]> References: https://www.illumos.org/issues/5987 zfs prefetch code needs work illumos/illumos-gate@cf6106c 5987 zfs prefetch code needs work Porting notes: - [module/zfs/dbuf.c] - 5f6d0b6 Handle block pointers with a corrupt logical size - [module/zfs/dmu_zfetch.c] - c65aa5b Fix gcc missing parenthesis warnings - 428870f Update core ZFS code from build 121 to build 141. - 79c76d5 Change KM_PUSHPAGE -> KM_SLEEP - b8d06fc Switch KM_SLEEP to KM_PUSHPAGE - Account for ISO C90 - mixed declarations and code - warnings - Module parameters (new/changed): - Replaced zfetch_block_cap with zfetch_max_distance (Max bytes to prefetch per stream (default 8MB; 8 * 1024 * 1024)) - Preserved zfs_prefetch_disable as 'int' for consistency with existing Linux module options. - [include/sys/trace_arc.h] - Added new tracepoints - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__sync__wait__for__async); - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__demand__hit__predictive__prefetch); - [man/man5/zfs-module-parameters.5] - Updated man page Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 5960, 5925Paul Dagnelie2016-01-081-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5960 zfs recv should prefetch indirect blocks 5925 zfs receive -o origin= Reviewed by: Prakash Surya <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> References: https://www.illumos.org/issues/5960 https://www.illumos.org/issues/5925 https://github.com/illumos/illumos-gate/commit/a2cdcdd Porting notes: - [lib/libzfs/libzfs_sendrecv.c] - b8864a2 Fix gcc cast warnings - 325f023 Add linux kernel device support - 5c3f61e Increase Linux pipe buffer size on 'zfs receive' - [module/zfs/zfs_vnops.c] - 3558fd7 Prototype/structure update for Linux - c12e3a5 Restructure zfs_readdir() to fix regressions - [module/zfs/zvol.c] - Function @zvol_map_block() isn't needed in ZoL - 9965059 Prefetch start and end of volumes - [module/zfs/dmu.c] - Fixed ISO C90 - mixed declarations and code - Function dmu_prefetch() 'int i' is initialized before the following code block (c90 vs. c99) - [module/zfs/dbuf.c] - fc5bb51 Fix stack dbuf_hold_impl() - 9b67f60 Illumos 4757, 4913 - 34229a2 Reduce stack usage for recursive traverse_visitbp() - [module/zfs/dmu_send.c] - Fixed ISO C90 - mixed declarations and code - b58986e Use large stacks when available - 241b541 Illumos 5959 - clean up per-dataset feature count code - 77aef6f Use vmem_alloc() for nvlists - 00b4602 Add linux kernel memory support Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>
* Identify locks flagged by lockdepOlaf Faaland2015-12-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible recursive locking in some cases and possible circular locking dependency in others, within the SPL and ZFS modules. This patch uses a mutex type defined in SPL, MUTEX_NOLOCKDEP, to mark such mutexes when they are initialized. This mutex type causes attempts to take or release those locks to be wrapped in lockdep_off() and lockdep_on() calls to silence the dependency checker and allow the use of lock_stats to examine contention. For RW locks, it uses an analogous lock type, RW_NOLOCKDEP. The goal is that these locks are ultimately changed back to type MUTEX_DEFAULT or RW_DEFAULT, after the locks are annotated to reflect their relationship (e.g. z_name_lock below) or any real problem with the lock dependencies are fixed. Some of the affected locks are: tc_open_lock: ============= This is an array of locks, all with same name, which txg_quiesce must take all of in order to move txg to next state. All default to the same lockdep class, and so to lockdep appears recursive. zp->z_name_lock: ================ In zfs_rmdir, dzp = znode for the directory (input to zfs_dirent_lock) zp = znode for the entry being removed (output of zfs_dirent_lock) zfs_rmdir()->zfs_dirent_lock() takes z_name_lock in dzp zfs_rmdir() takes z_name_lock in zp Since both dzp and zp are type znode_t, the locks have the same default class, and lockdep considers it a possible recursive lock attempt. l->l_rwlock: ============ zap_expand_leaf() sometimes creates two new zap leaf structures, via these call paths: zap_deref_leaf()->zap_get_leaf_byblk()->zap_leaf_open() zap_expand_leaf()->zap_create_leaf()->zap_expand_leaf()->zap_create_leaf() Because both zap_leaf_open() and zap_create_leaf() initialize l->l_rwlock in their (separate) leaf structures, the lockdep class is the same, and the linux kernel believes these might both be the same lock, and emits a possible recursive lock warning. Signed-off-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3895
* Illumos 5911 - ZFS "hangs" while deleting fileMatthew Ahrens2015-07-061-14/+65
| | | | | | | | | | | | | | | | | | | | | | 5911 ZFS "hangs" while deleting file Reviewed by: Bayard Bell <[email protected]> Reviewed by: Alek Pinchuk <[email protected]> Reviewed by: Simon Klinkert <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/5911 https://github.com/illumos/illumos-gate/commit/46e1baa Porting notes: Resolved ISO C90 forbids mixed declarations and code wanting in the dnode_free_range() function. Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3554
* Illumos 5422 - preserve AVL invariants in dn_dbufsAlex Reece2015-05-111-6/+4
| | | | | | | | | | | | | | | | | Author: Alex Reece <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Albert Lee <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5422 https://github.com/illumos/illumos-gate/commit/a846f19 Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3381
* Illumos 5027 - zfs large block supportMatthew Ahrens2015-05-111-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5027 zfs large block support Reviewed by: Alek Pinchuk <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5027 https://github.com/illumos/illumos-gate/commit/b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <[email protected]> Closes #354
* Illumos 5056 - ZFS deadlock on db_mtx and dn_holdsJustin T. Gibbs2015-04-281-30/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5056 ZFS deadlock on db_mtx and dn_holds Author: Justin Gibbs <[email protected]> Reviewed by: Will Andrews <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5056 https://github.com/illumos/illumos-gate/commit/bc9014e Porting Notes: sa_handle_get_from_db(): - the original patch includes an otherwise unmentioned fix for a possible usage of an uninitialised variable dmu_objset_open_impl(): - Under Illumos list_link_init() is the same as filling a list_node_t with NULLs, so they don't notice if they miss doing list_link_init() on a zero'd containing structure (e.g. allocated with kmem_zalloc as here). Under Linux, not so much: an uninitialised list_node_t goes "Boom!" some time later when it's used or destroyed. dmu_objset_evict_dbufs(): - reduce stack usage using kmem_alloc() Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 5314 - Remove "dbuf phys" db->db_data pointer aliases in ZFSJustin T. Gibbs2015-04-281-1/+1
| | | | | | | | | | | | | | | | 5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS Author: Justin T. Gibbs <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Will Andrews <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5314 https://github.com/illumos/illumos-gate/commit/c137962 Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 5095 - panic when adding a duplicate dbuf to dn_dbufsAlex Reece2015-04-281-16/+18
| | | | | | | | | | | | | | | | | | | 5095 panic when adding a duplicate dbuf to dn_dbufs Author: Alex Reece <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Mattew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Josef Sipek <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5095 https://github.com/illumos/illumos-gate/commit/86bb58a Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 5038 - Remove "old-style" flexible array usage in ZFS.Justin T. Gibbs2015-04-281-3/+3
| | | | | | | | | | | | | | | 5038 Remove "old-style" flexible array usage in ZFS. Author: Justin T. Gibbs <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/5038 https://github.com/illumos/illumos-gate/commit/7f18da4 Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 4873 - zvol unmap calls can take a very long time for larger datasetsAlex Reece2015-04-281-11/+48
| | | | | | | | | | | | | | | | | | | | | | | | | 4873 zvol unmap calls can take a very long time for larger datasets Author: Alex Reece <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Basil Crow <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/4873 https://github.com/illumos/illumos-gate/commit/0f6d88a Porting Notes: dbuf_free_range(): - reduce stack usage using kmem_alloc() - the sorted AVL tree will handle the spill block case correctly without all the special handling in the for() loop Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 4975 - missing mutex_destroy() calls in zfsJorgen Lundman2015-04-281-0/+5
| | | | | | | | | | | | | | | | | | | 4975 missing mutex_destroy() calls in zfs Author: Jorgen Lundman <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Rich Lowe <[email protected]> Reviewed by: Seth Nimbosa <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Don Brady <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4975 https://github.com/illumos/illumos-gate/commit/d2b3cbb Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruptionJustin T. Gibbs2015-03-121-1/+7
| | | | | | | | | | | | | | | | | | 5630 stale bonus buffer in recycled dnode_t leads to data corruption Author: Justin T. Gibbs <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Will Andrews <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5630 https://github.com/illumos/illumos-gate/commit/cd485b4 Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #3172
* Change KM_PUSHPAGE -> KM_SLEEPBrian Behlendorf2015-01-161-2/+2
| | | | | | | | | | | | | | | By marking DMU transaction processing contexts with PF_FSTRANS we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings us back in line with upstream. In some cases this means simply swapping the flags back. For others fnvlist_alloc() was replaced by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to fnvlist_alloc() which assumes KM_SLEEP. The one place KM_PUSHPAGE is kept is when allocating ARC buffers which allows us to dip in to reserved memory. This is again the same as upstream. Signed-off-by: Brian Behlendorf <[email protected]>
* Retire KM_NODEBUGBrian Behlendorf2015-01-161-2/+1
| | | | | | | | | | | | | | | Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress the large allocation warning have been replaced by vmem_alloc() as appropriate. The updated vmem_alloc() call will not print a warning regardless of the size of the allocation. A careful reader will notice that not all callers have been changed to vmem_alloc(). Some have only had the KM_NODEBUG flag removed. This was possible because the default warning threshold has been increased to 32k. This is desirable because it minimizes the need for Linux specific code changes. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove duplicate typedefs from trace.hNed Bass2015-01-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate typedef declarations with the same type. The trace.h header contains some typedefs to avoid 'unknown type' errors for C files that haven't declared the type in question. But this causes build failures for C files that have already declared the type. Newer versions of GCC (e.g. v4.6) allow duplicate typedefs with the same type unless pedantic error checking is in force. To support the older versions we need to remove the duplicate typedefs. Removal of the typedefs means we can't built tracepoints code using those types unless the required headers have been included. To facilitate this, all tracepoint event declarations have been moved out of trace.h into separate headers. Each new header is explicitly included from the C file that uses the events defined therein. The trace.h header is still indirectly included form zfs_context.h and provides the implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces. This makes those interfaces readily available throughout the code base. The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also still provided by trace.h, so it is a prerequisite for the other trace_*.h headers. These new Linux implementation-specific headers do introduce a small divergence from upstream ZFS in several core C files, but this should not present a significant maintenance burden. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2953
* Illumos 5139 - SEEK_HOLE failed to report a hole at end of fileMatthew Ahrens2014-09-231-0/+9
| | | | | | | | | | | | | | | | | | | | 5139 SEEK_HOLE failed to report a hole at end of file Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Max Grossman <[email protected]> Reviewed by: Peng Dai <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5139 https://github.com/illumos/illumos-gate/commit/0fbc0cd Ported by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2714
* Illumos 4757, 4913Matthew Ahrens2014-08-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Max Grossman <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4757 https://www.illumos.org/issues/4913 https://github.com/illumos/illumos-gate/commit/5d7b4d4 Porting notes: For compatibility with the fastpath code the zio_done() function needed to be updated. Because embedded-data block pointers do not require DVAs to be allocated the associated vdevs will not be marked and therefore should not be unmarked. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2544
* Illumos #4374Matthew Ahrens2014-07-301-105/+20
| | | | | | | | | | | | | | | | | | | 4374 dn_free_ranges should use range_tree_t Reviewed by: George Wilson <[email protected]> Reviewed by: Max Grossman <[email protected]> Reviewed by: Christopher Siden <[email protected] Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4374 https://github.com/illumos/illumos-gate/commit/bf16b11 Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2531
* Illumos 4370, 4371Max Grossman2014-07-281-31/+22
| | | | | | | | | | | | | | | | | | | | 4370 avoid transmitting holes during zfs send 4371 DMU code clean up Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Approved by: Garrett D'Amore <[email protected]>a References: https://www.illumos.org/issues/4370 https://www.illumos.org/issues/4371 https://github.com/illumos/illumos-gate/commit/43466aa Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2529
* Illumos #4101, #4102, #4103, #4105, #4106George Wilson2014-07-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Garrett D'Amore <[email protected]> Prior to this patch, space_maps were preferred solely based on the amount of free space left in each. Unfortunately, this heuristic didn't contain any information about the make-up of that free space, which meant we could keep preferring and loading a highly fragmented space map that wouldn't actually have enough contiguous space to satisfy the allocation; then unloading that space_map and repeating the process. This change modifies the space_map's to store additional information about the contiguous space in the space_map, so that we can use this information to make a better decision about which space_map to load. This requires reallocating all space_map objects to increase their bonus buffer size sizes enough to fit the new metadata. The above feature can be enabled via a new feature flag introduced by this change: com.delphix:spacemap_histogram In addition to the above, this patch allows the space_map block size to be increase. Currently the block size is set to be 4K in size, which has certain implications including the following: * 4K sector devices will not see any compression benefit * large space_maps require more metadata on-disk * large space_maps require more time to load (typically random reads) Now the space_map block size can adjust as needed up to the maximum size set via the space_map_max_blksz variable. A bug was fixed which resulted in potentially leaking an object when removing a mirrored log device. The previous logic for vdev_remove() did not deal with removing top-level vdevs that are interior vdevs (i.e. mirror) correctly. The problem would occur when removing a mirrored log device, and result in the DTL space map object being leaked; because top-level vdevs don't have DTL space map objects associated with them. References: https://www.illumos.org/issues/4101 https://www.illumos.org/issues/4102 https://www.illumos.org/issues/4103 https://www.illumos.org/issues/4105 https://www.illumos.org/issues/4106 https://github.com/illumos/illumos-gate/commit/0713e23 Porting notes: A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also, the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary. Ported-by: Tim Chase <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2488
* Use default slab typesBrian Behlendorf2014-05-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should not override the default memory type of the kmem cache. This was done previously to force certain objects which were slightly over object size limit cut off in to KMC_KMEM caches for better performance. The zfsonlinux/spl#356 patch slightly increases the default cut off from 511 bytes 1024 bytes for x86_64. This means there is long longer a need to override the default for the caches. And since the default values are now being used the new spl_kmem_cache_slab_limit and spl_kmem_cache_kmem_limit tunables will apply to all kmem caches. The following is a list of caches that will be impacted: | object size | forced type | default type ----------------- | ------------- | ------------- | -------------- dnode_t | 936 bytes | KMC_KMEM | KMC_KMEM zio_cache | 1104 bytes | *KMC_KMEM | *KMC_VMEM zio_link_cache | 48 bytes | KMC_KMEM | KMC_KMEM zio_vdev_cache | 131088 bytes | KMC_VMEM | KMC_VMEM zio_buf_512 | 512 bytes | KMC_KMEM | KMC_KMEM zio_data_buf_512 | 512 bytes | KMC_KMEM | KMC_KMEM zio_buf_1024 | 1024 bytes | KMC_KMEM | KMC_KMEM zio_data_buf_1024 | 1024 bytes | +KMC_VMEM | +KMC_KMEM * Cache memory type will change from KMC_KMEM to KMC_VMEM. + Cache memory type will change from KMC_VMEM to KMC_KMEM. This patch removes another slight point of divergence between ZoL and Illumos. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #2337
* Illumos #4045 write throttle & i/o scheduler performance workMatthew Ahrens2013-12-061-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead. * the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in zfsonlinux/zfs@c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Ned Bass <[email protected]> Reviewed by: Brendan Gregg <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e Ported-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1913
* Illumos #4047Matthew Ahrens2013-11-051-1/+6
| | | | | | | | | | | | | | | | | | | | | 4047 panic from dbuf_free_range() from dmu_free_object() while doing zfs receive Reviewed by: Adam Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4047 illumos/illumos-gate@713d6c208802cfbb806329ec0d154b641b80c355 Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1775 Porting notes: 1. The exported symbol dmu_free_object() was renamed to dmu_free_long_object() in Illumos.
* Illumos #3742Will Andrews2013-11-041-8/+10
| | | | | | | | | | | | | | | | | | | | | 3742 zfs comments need cleaner, more consistent style Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Eric Schrock <[email protected]> Approved by: Christopher Siden <[email protected]> References: https://www.illumos.org/issues/3742 illumos/illumos-gate@f7170741490edba9d1d9c697c177c887172bc741 Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1775 Porting notes: 1. The change to zfs_vfsops.c was dropped because it involves zfs_mount_label_policy, which does not exist in the Linux port.
* Illumos #3598Matthew Ahrens2013-10-311-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | 3598 want to dtrace when errors are generated in zfs Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/3598 illumos/illumos-gate@be6fd75a69ae679453d9cda5bff3326111e6d1ca Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1775 Porting notes: 1. include/sys/zfs_context.h has been modified to render some new macros inert until dtrace is available on Linux. 2. Linux-specific changes have been adapted to use SET_ERROR(). 3. I'm NOT happy about this change. It does nothing but ugly up the code under Linux. Unfortunately we need to take it to avoid more merge conflicts in the future. -Brian
* Illumos #3464Matthew Ahrens2013-09-041-1/+5
| | | | | | | | | | | | | | | | | 3464 zfs synctask code needs restructuring Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/3464 illumos/illumos-gate@3b2aab18808792cbd248a12f1edf139b89833c13 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1495
* Illumos #3006Madhav Suresh2013-06-191-38/+38
| | | | | | | | | | | | | | | | | | | | 3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument is zero Reviewed by Matt Ahrens <[email protected]> Reviewed by George Wilson <[email protected]> Approved by Eric Schrock <[email protected]> References: illumos/illumos-gate@fb09f5aad449c97fe309678f3f604982b563a96f https://illumos.org/issues/3006 Requires: zfsonlinux/spl@1c6d149feb4033e4a56fb987004edc5d45288bcb Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1509
* Illumos #2619 and #2747Christopher Siden2013-01-081-8/+11
| | | | | | | | | | | | | | | | | | | | | | 2619 asynchronous destruction of ZFS file systems 2747 SPA versioning with zfs feature flags Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Dan Kruchinin <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@53089ab7c84db6fb76c16ca50076c147cda11757 illumos/illumos-gate@ad135b5d644628e791c3188a6ecbd9c257961ef8 illumos changeset: 13700:2889e2596bd6 https://www.illumos.org/issues/2619 https://www.illumos.org/issues/2747 NOTE: The grub specific changes were not ported. This change must be made to the Linux grub packages. Ported-by: Brian Behlendorf <[email protected]>
* Switch KM_SLEEP to KM_PUSHPAGERichard Yao2012-08-271-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* Improve meta data performanceBrian Behlendorf2011-11-031-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Profiling the system during meta data intensive workloads such as creating/removing millions of files, revealed that the system was cpu bound. A large fraction of that cpu time was being spent waiting on the virtual address space spin lock. It turns out this was caused by certain heavily used kmem_caches being backed by virtual memory. By default a kmem_cache will dynamically determine the type of memory used based on the object size. For large objects virtual memory is usually preferable and for small object physical memory is a better choice. See the spl_slab_alloc() function for a longer discussion on this. However, there is a certain amount of gray area when defining a 'large' object. For the following caches it turns out they were just over the line: * dnode_cache * zio_cache * zio_link_cache * zio_buf_512_cache * zfs_data_buf_512_cache Now because we know there will be a lot of churn in these caches, and because we know the slabs will still be reasonably sized. We can safely request with the KMC_KMEM flag that the caches be backed with physical memory addresses. This entirely avoids the need to serialize on the virtual address space lock. As a bonus this also reduces our vmalloc usage which will be good for 32-bit kernels which have a very small virtual address space. It will also probably be good for interactive performance since unrelated processes could also block of this same global lock. Finally, we may see less cpu time being burned in the arc_reclaim and txg_sync_threads. Signed-off-by: Brian Behlendorf <[email protected]> Issue #258
* Fix zio_execute() deadlockBrian Behlendorf2011-07-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To avoid deadlocking the system it is crucial that all memory allocations performed in the zio_execute() call path are marked KM_PUSHPAGE (GFP_NOFS). This ensures that while a z_wr_iss thread is processing the syncing transaction group it does not re-enter the filesystem code and deadlock on itself. Call Trace: [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl] [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs] [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs] [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs] [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff81159932>] kmem_getpages+0x62/0x170 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160 [<ffffffff8115b059>] __kmalloc+0x189/0x220 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl] [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs] [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs] [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs] [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs] [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs] [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs] [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs] [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs] [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl] [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <[email protected]> Issue #291
* Quiet down very frequent large allocation warning in ZFS.Ricardo M. Correia2010-09-171-1/+1
| | | | | | | | | In my machine, dnode_hold_impl() allocates 9992 bytes in DEBUG mode and it causes a large stream of stack traces in the logs. Instead, use KM_NODEBUG to quiet down this known large alloc. Signed-off-by: Ricardo M. Correia <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Fix dnode_move() scopeBrian Behlendorf2010-08-311-1/+3
| | | | | | The dnode_move() functionality is only used in the kernel build. As such we should be careful to wrap all of the related code with '#ifdef _KERNEL' to avoid gcc warnings about unused code.
* Fix gcc unused variable warningsBrian Behlendorf2010-08-311-1/+1
| | | | | | Gcc -Wall warn: 'unused variable' Signed-off-by: Brian Behlendorf <[email protected]>