| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4328
|
|
|
|
|
|
|
|
| |
The libzfs_graph.c source file should have been removed in 330d06f,
it is entirely unused.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4766
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands. In addition, non-
privileged users should be able to run all of the following commands:
* zpool [list | iostat | status | get]
* zfs [list | get]
Historically this functionality was not available on Linux. In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability. Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.
Even with this change some limitations remain. Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace). This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code. It
may be possible to add this functionality in the future.
This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite. These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.
Two minor bug fixes were required for test-running.py. First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it. Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.
Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #362
Closes #434
Closes #4100
Closes #4394
Closes #4410
Closes #4487
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.
New zcommon module parameters:
- zfs_fletcher_4_impl (str): selects the implementation to use.
"fastest" - use the fastest version available
"cycle" - cycle trough all available impl for ztest
"scalar" - use the original version
"avx2" - new AVX2 implementation if available
Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar: 4216 MB/s
- AVX2: 14499 MB/s
See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Jinshan Xiong <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4330
|
|
|
|
|
|
|
|
|
| |
Counterpart to fd4c7b7, the same approach was taken to resolve
the compatibility issue.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4717
Issue #4665
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130
Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.
$ zpool iostat -r
mypool sync_read sync_write async_read async_write scrub
req_size ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 0 0 0 0 0 0 530 0 0 0
1K 0 0 260 0 0 0 116 246 0 0
2K 0 0 0 0 0 0 0 431 0 0
4K 0 0 0 0 0 0 3 107 0 0
8K 15 0 35 0 0 0 0 6 0 0
16K 0 0 0 0 0 0 0 39 0 0
32K 0 0 0 0 0 0 0 0 0 0
64K 20 0 40 0 0 0 0 0 0 0
128K 0 0 20 0 0 0 0 0 0 0
256K 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 155 19 0 0
8M 0 0 0 0 0 0 0 811 0 0
16M 0 0 0 0 0 0 0 68 0 0
--------------------------------------------------------------------------------
Also rename the stray "-G" in the man page to be "-w" for latency histograms.
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #4659
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Register iterate_shared if it exists so the kernel will used shared
lock and allowing concurrent readdir.
Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE
to allow concurrent seeking.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4664
Closes #4665
|
|
|
|
|
|
| |
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4665
|
|
|
|
|
|
| |
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4665
|
|
|
|
|
|
|
|
|
|
|
|
| |
This field is a duplicate of the inode->i_generation, so just
kill it.
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4538
Closes #4654
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 83025286175d1ee1c29b842531070f3250a172ba and
ebecfcd6991bebe71511cb8fd409112798f203b2 which broke the build.
While these patches do apply cleanly and passed previous test
runs they need to be updated to account for the changes made in
commit 241b5415748859a3c272fc8f570f2368e93adde9.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3878
|
|
|
|
|
| |
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3878
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The zfs range lock interface no longer tightly depends on a
znode_t and therefore can be used in ztest. This allows the
previous ztest specific implementation to be removed, and for
additional test coverage of the shared version.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4023
Issue #4024
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
struct zvol_state contains a dummy znode, which is around 1KB on x64,
only for zfs_range_lock. But in reality, other than z_range_lock and
z_range_avl, zfs_range_lock only need znode on regular file, which
means we add 1KB on a structure and gain nothing.
In this patch, we remove the dummy znode for zvol_state. In order to
do that, we also need to refactor zfs_range_lock a bit. We move
z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t.
This new struct replaces znode_t as the main handle inside the range
lock functions.
We also add pointers to z_size, z_blksz, and z_max_blksz so range lock
code doesn't depend on znode_t. This allows non-ZPL consumers like
Lustre to use the range locks with their equivalent znode_t structure.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4510
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Userland version of cv_timedwait_hires() always assumes absolute time.
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Reviewed by: Robert Mustacchi <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported by: Denys Rtveliashvili <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/6739
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/41c6413
Porting Notes:
The ported change has revealed a number of problems in the Linux-specific code,
as it was expecting incorrect return codes from pthread_* functions.
Reviewed and improved the usage of pthread_* function in lib/libzpool/kernel.c.
|
|
|
|
|
|
|
|
|
|
|
| |
The was originally using interruptible cv_timedwait_sig, but was changed
to uninterruptible cv_timedwait_hires in ae6d0c6. Use _sig_hires instead
to allow interruptible sleep.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4633
Closes #4634
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 4cd77889b684fd0dd1a0a995b692dda3db76a9ac. The
i_generation field in the inode is 32-bit and the SA code expects
64-bit fixed values. Revert this optimization for now until
this is cleanly addressed.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4538
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Update the zfs module to collect statistics on average latencies, queue sizes,
and keep an internal histogram of all IO latencies. Along with this, update
"zpool iostat" with some new options to print out the stats:
-l: Include average IO latencies stats:
total_wait disk_wait syncq_wait asyncq_wait scrub
read write read write read write read write wait
----- ----- ----- ----- ----- ----- ----- ----- -----
- 41ms - 2ms - 46ms - 4ms -
- 5ms - 1ms - 1us - 4ms -
- 5ms - 1ms - 1us - 4ms -
- - - - - - - - -
- 49ms - 2ms - 47ms - - -
- - - - - - - - -
- 2ms - 1ms - - - 1ms -
----- ----- ----- ----- ----- ----- ----- ----- -----
1ms 1ms 1ms 413us 16us 25us - 5ms -
1ms 1ms 1ms 413us 16us 25us - 5ms -
2ms 1ms 2ms 412us 26us 25us - 5ms -
- 1ms - 413us - 25us - 5ms -
- 1ms - 460us - 29us - 5ms -
196us 1ms 196us 370us 7us 23us - 5ms -
----- ----- ----- ----- ----- ----- ----- ----- -----
-w: Print out latency histograms:
sdb total disk sync_queue async_queue
latency read write read write read write read write scrub
------- ------ ------ ------ ------ ------ ------ ------ ------ ------
1ns 0 0 0 0 0 0 0 0 0
...
33us 0 0 0 0 0 0 0 0 0
66us 0 0 107 2486 2 788 12 12 0
131us 2 797 359 4499 10 558 184 184 6
262us 22 801 264 1563 10 286 287 287 24
524us 87 575 71 52086 15 1063 136 136 92
1ms 152 1190 5 41292 4 1693 252 252 141
2ms 245 2018 0 50007 0 2322 371 371 220
4ms 189 7455 22 162957 0 3912 6726 6726 199
8ms 108 9461 0 102320 0 5775 2526 2526 86
17ms 23 11287 0 37142 0 8043 1813 1813 19
34ms 0 14725 0 24015 0 11732 3071 3071 0
67ms 0 23597 0 7914 0 18113 5025 5025 0
134ms 0 33798 0 254 0 25755 7326 7326 0
268ms 0 51780 0 12 0 41593 10002 10002 0
537ms 0 77808 0 0 0 64255 13120 13120 0
1s 0 105281 0 0 0 83805 20841 20841 0
2s 0 88248 0 0 0 73772 14006 14006 0
4s 0 47266 0 0 0 29783 17176 17176 0
9s 0 10460 0 0 0 4130 6295 6295 0
17s 0 0 0 0 0 0 0 0 0
34s 0 0 0 0 0 0 0 0 0
69s 0 0 0 0 0 0 0 0 0
137s 0 0 0 0 0 0 0 0 0
-------------------------------------------------------------------------------
-h: Help
-H: Scripted mode. Do not display headers, and separate fields by a single
tab instead of arbitrary space.
-q: Include current number of entries in sync & async read/write queues,
and scrub queue:
syncq_read syncq_write asyncq_read asyncq_write scrubq_read
pend activ pend activ pend activ pend activ pend activ
----- ----- ----- ----- ----- ----- ----- ----- ----- -----
0 0 0 0 78 29 0 0 0 0
0 0 0 0 78 29 0 0 0 0
0 0 0 0 0 0 0 0 0 0
- - - - - - - - - -
0 0 0 0 0 0 0 0 0 0
- - - - - - - - - -
0 0 0 0 0 0 0 0 0 0
----- ----- ----- ----- ----- ----- ----- ----- ----- -----
0 0 227 394 0 19 0 0 0 0
0 0 227 394 0 19 0 0 0 0
0 0 108 98 0 19 0 0 0 0
0 0 19 98 0 0 0 0 0 0
0 0 78 98 0 0 0 0 0 0
0 0 19 88 0 0 0 0 0 0
----- ----- ----- ----- ----- ----- ----- ----- ----- -----
-p: Display numbers in parseable (exact) values.
Also, update iostat syntax to allow the user to specify specific vdevs
to show statistics for. The three options for choosing pools/vdevs are:
Display a list of pools:
zpool iostat ... [pool ...]
Display a list of vdevs from a specific pool:
zpool iostat ... [pool vdev ...]
Display a list of vdevs from any pools:
zpool iostat ... [vdev ...]
Lastly, allow zpool command "interval" value to be floating point:
zpool iostat -v 0.5
Signed-off-by: Tony Hutter <[email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4433
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3993 zpool(1M) and zfs(1M) should support -p for "list" and "get"
4700 "zpool get" doesn't support -H or -o options
Reviewed by: Dan McDonald <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/3993
OpenZFS-issue: https://www.illumos.org/issues/4700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c58b352
Porting notes:
I removed ZoL's zpool_get_prop_literal() in favor of
zpool_get_prop(..., boolean_t literal) since that's what OpenZFS
uses. The functionality is the same.
|
|
|
|
|
|
|
|
|
|
|
| |
6544 incorrect comment in libzfs.h about offline status
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/6544
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cb605c4
Closes #4595
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
6736 ZFS per-vdev ZAPs
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Don Brady <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/6736
https://github.com/openzfs/openzfs/commit/215198a
Ported-by: Don Brady <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4515
|
|
|
|
|
|
|
|
| |
This field is a duplicate of the inode->i_generation, so just kill it
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4538
|
|
|
|
|
|
|
|
|
|
|
| |
As described in torvalds/linux@5f3a4a2 the &init_user_ns, and
not the current user_ns, should be passed to posix_acl_from_xattr()
and posix_acl_to_xattr(). Conveniently the init_user_ns is
available through the init credential (kcred).
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Massimo Maggi <[email protected]>
Closes #4177
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
6844 dnode_next_offset can detect fictional holes
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
dnode_next_offset is used in a variety of places to iterate over the
holes or allocated blocks in a dnode. It operates under the premise that
it can iterate over the blockpointers of a dnode in open context while
holding only the dn_struct_rwlock as reader. Unfortunately, this premise
does not hold.
When we create the zio for a dbuf, we pass in the actual block pointer
in the indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the
bp is now inconsistent from the perspective of dnode_next_offset: the bp
will appear to be a hole until zio_dva_allocate finally finishes filling
it in. In the meantime, dnode_next_offset can detect a hole in the dnode
when none exists.
I was able to experimentally demonstrate this behavior with the
following setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the
first L0 block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the
middle of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent
hole.
The fix is to ensure that it is valid to iterate over the indirect
blocks in a dnode while holding the dn_struct_rwlock by passing the zio
a copy of the BP and updating the actual BP in dbuf_write_ready while
holding the lock.
References:
https://www.illumos.org/issues/6844
https://github.com/openzfs/openzfs/pull/82
DLPX-35372
Ported-by: Brian Behlendorf <[email protected]>
Closes #4548
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and
pn_free() functions must be implemented. The existing illumos
implementation were used for this purpose.
The `flags` argument which was used in places wrapped by the
HAVE_PN_UTILS condition has beed added back to zfs_remove() and
zfs_link() functions. This removes a small point of divergence
between the ZoL code and upstream.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4522
|
|
|
|
|
|
|
|
| |
Also enable lazytime in mount.zfs
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4482
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem for atime:
We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its
handling is a mess. A huge part of mess regarding atime comes from
zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave
inconsistently with those three values.
zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you
don't pass ATTR_ATIME. Which means every write(2) operation which only updates
ctime and mtime will cause atime changes to not be written to disk.
Also zfs_inode_update from write(2) will replace inode->i_atime with what's
inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2).
You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0.
Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's
inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new),
SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll
leave with a stale atime.
The problem for relatime:
We do have a relatime config inside ZFS dataset, but how it should interact
with the mount flag MS_RELATIME is not well defined. It seems it wanted
relatime mount option to override the dataset config by showing it as
temporary in `zfs get`. But at the same time, `zfs set relatime=on|off` would
also seems to want to override the mount option. Not to mention that
MS_RELATIME flag is actually never passed into ZFS, so it never really worked.
How Linux handles atime:
The Linux kernel actually handles atime completely in VFS, except for writing
it to disk. So if we remove the atime handling in ZFS, things would just work,
no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever
VFS updates the i_atime, it will notify the underlying filesystem via
sb->dirty_inode().
And also there's one thing to note about atime flags like MS_RELATIME and
other flags like MS_NODEV, etc. They are mount point flags rather than
filesystem(sb) flags. Since native linux filesystem can be mounted at multiple
places at the same time, they can all have different atime settings. So these
flags are never passed down to filesystem drivers.
What this patch tries to do:
We remove znode->z_atime, since we won't gain anything from it. We remove most
of the atime handling and leave it to VFS. The only thing we do with atime is
to write it when dirty_inode() or setattr() is called. We also add
file_accessed() in zpl_read() since it's not provided in vfs_read().
After this patch, only the MS_RELATIME flag will have effect. The setting in
dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME
set according to the setting in dataset in future patch.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4482
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is foundational work for ZED.
Updates a leaf vdev's persistent device strings on Linux platform
* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES
Some examples:
path: '/dev/sdb1'
devid: 'scsi-350000394a8ca4fbc-part1'
phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'
path: '/dev/mapper/mpatha'
devid: 'dm-uuid-mpath-35000c5006304de3f'
Signed-off-by: Don Brady <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2856
Closes #3978
Closes #4416
|
|
|
|
|
|
|
|
|
|
| |
Effectively provide our own version of assert()/verify() for use
in user space. This minimizes our dependencies and aligns the
user space assertion handling with what's used in the kernel.
Signed-off-by: Carlo Landmeter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4449
|
|
|
|
|
|
|
|
| |
When compiling with musl libc the return type will be incorrect.
Signed-off-by: Carlo Landmeter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4454
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is initial support for x86 vectorized implementations of ZFS parity
and checksum algorithms.
For the compilation phase, configure step checks if toolchain supports relevant
instruction sets. Each implementation must ensure that the code is not passed
to compiler if relevant instruction set is not supported. For this purpose,
following new defines are provided if instruction set is supported:
- HAVE_SSE,
- HAVE_SSE2,
- HAVE_SSE3,
- HAVE_SSSE3,
- HAVE_SSE4_1,
- HAVE_SSE4_2,
- HAVE_AVX,
- HAVE_AVX2.
For detecting if an instruction set can be used in runtime, following functions
are provided in (include/linux/simd_x86.h):
- zfs_sse_available()
- zfs_sse2_available()
- zfs_sse3_available()
- zfs_ssse3_available()
- zfs_sse4_1_available()
- zfs_sse4_2_available()
- zfs_avx_available()
- zfs_avx2_available()
- zfs_bmi1_available()
- zfs_bmi2_available()
These function should be called once, on module load, or initialization.
They are safe to use from user and kernel space.
If an implementation is using more than single instruction set, both compiler
and runtime support for all relevant instruction sets should be checked.
Kernel fpu methods:
- kfpu_begin()
- kfpu_end()
Use __get_cpuid_max and __cpuid_count from <cpuid.h>
Both gcc and clang have support for these. They also handle ebx register
in case it is used for PIC code.
Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4381
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add the ZFS Test Suite and test-runner framework from illumos.
This is a continuation of the work done by Turbo Fredriksson to
port the ZFS Test Suite to Linux. While this work was originally
conceived as a stand alone project integrating it directly with
the ZoL source tree has several advantages:
* Allows the ZFS Test Suite to be packaged in zfs-test package.
* Facilitates easy integration with the CI testing.
* Users can locally run the ZFS Test Suite to validate ZFS.
This testing should ONLY be done on a dedicated test system
because the ZFS Test Suite in its current form is destructive.
* Allows the ZFS Test Suite to be run directly in the ZoL source
tree enabled developers to iterate quickly during development.
* Developers can easily add/modify tests in the framework as
features are added or functionality is changed. The tests
will then always be in sync with the implementation.
Full documentation for how to run the ZFS Test Suite is available
in the tests/README.md file.
Warning: This test suite is designed to be run on a dedicated test
system. It will make modifications to the system including, but
not limited to, the following.
* Adding new users
* Adding new groups
* Modifying the following /proc files:
* /proc/sys/kernel/core_pattern
* /proc/sys/kernel/core_uses_pid
* Creating directories under /
Notes:
* Not all of the test cases are expected to pass and by default
these test cases are disabled. The failures are primarily due
to assumption made for illumos which are invalid under Linux.
* When updating these test cases it should be done in as generic
a way as possible so the patch can be submitted back upstream.
Most existing library functions have been updated to be Linux
aware, and the following functions and variables have been added.
* Functions:
* is_linux - Used to wrap a Linux specific section.
* block_device_wait - Waits for block devices to be added to /dev/.
* Variables: Linux Illumos
* ZVOL_DEVDIR "/dev/zvol" "/dev/zvol/dsk"
* ZVOL_RDEVDIR "/dev/zvol" "/dev/zvol/rdsk"
* DEV_DSKDIR "/dev" "/dev/dsk"
* DEV_RDSKDIR "/dev" "/dev/rdsk"
* NEWFS_DEFAULT_FS "ext2" "ufs"
* Many of the disabled test cases fail because 'zfs/zpool destroy'
returns EBUSY. This is largely causes by the asynchronous nature
of device handling on Linux and is expected, the impacted test
cases will need to be updated to handle this.
* There are several test cases which have been disabled because
they can trigger a deadlock. A primary example of this is to
recursively create zpools within zpools. These tests have been
disabled until the root issue can be addressed.
* Illumos specific utilities such as (mkfile) should be added to
the tests/zfs-tests/cmd/ directory. Custom programs required by
the test scripts can also be added here.
* SELinux should be either is permissive mode or disabled when
running the tests. The test cases should be updated to conform
to a standard policy.
* Redundant test functionality has been removed (zfault.sh).
* Existing test scripts (zconfig.sh) should be migrated to use
the framework for consistency and ease of testing.
* The DISKS environment variable currently only supports loopback
devices because of how the ZFS Test Suite expects partitions to
be named (p1, p2, etc). Support must be added to generate the
correct partition name based on the device location and name.
* The ZFS Test Suite is part of the illumos code base at:
https://github.com/illumos/illumos-gate/tree/master/usr/src/test
Original-patch-by: Turbo Fredriksson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #6
Closes #1534
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset
zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()
Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.
* Each taskq must be single threaded to ensure tasks are always
processed in the order in which they were dispatched.
* There is a taskq per-pool in order to keep the pools independent.
This way if one pool is suspended it will not impact another.
* The preferred location to dispatch a zvol minor task is a sync
task. In this context there is easy access to the spa_t and
minimal error handling is required because the sync task must
succeed.
Support for asynchronous zvol minor operations address issue #3681.
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2217
Closes #3678
Closes #3681
|
|
|
|
|
|
|
|
|
| |
Added by-partlabel and by-partuuid to the default device search
path. Made made device names in by-label more preferable.
Signed-off-by: Thijs Cramer <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3892
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Historically libblkid support was detected as part of configure
and optionally enabled. This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.
This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools. For distributions
which include a modern version of libblkid there is no change in
behavior. Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.
Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2448
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
locality information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay
References:
https://github.com/freebsd/freebsd@5c7a6f5d
https://github.com/freebsd/freebsd@31b7f68d
https://github.com/freebsd/freebsd@e186f564
Performance Testing:
https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141
Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc946f82c76e0b34c2f0370ed9aea022 by my
collegue Andriy Gapon has been included. It applied perfectly, but
added a cstyle regression.
- This replaces 556011dbec2d10579819078559a77630fc559112 entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4334
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following options have been added to the zpool add, iostat,
list, status, and split subcommands. The default behavior was
not modified, from zfs(8).
-g Display vdev GUIDs instead of the normal short
device names. These GUIDs can be used in-place of
device names for the zpool detach/off‐
line/remove/replace commands.
-L Display real paths for vdevs resolving all symbolic
links. This can be used to lookup the current block
device name regardless of the /dev/disk/ path used
to open it.
-p Display full paths for vdevs instead of only the
last component of the path. This can be used in
conjunction with the -L flag.
This behavior may also be enabled using the following environment
variables.
ZPOOL_VDEV_NAME_GUID
ZPOOL_VDEV_NAME_FOLLOW_LINKS
ZPOOL_VDEV_NAME_PATH
This change is based on worked originally started by Richard Yao
to add a -g option. Then extended by @ilovezfs to add a -L option
for openzfsonosx. Those changes have been merged, re-factored,
a -p option added and extended to all relevant zpool subcommands.
Original-patch-by: Richard Yao <[email protected]>
Extended-by: ilovezfs <[email protected]>
Extended-by: Brian Behlendorf <[email protected]>
Signed-off-by: ilovezfs <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2011
Closes #4341
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
https://www.illumos.org/issues/6414
https://github.com/illumos/illumos-gate/commit/eb5bb58
Ported-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
6815179 zpool import with a large number of LUNs is too slow
6844191 zpool import, scanning of disks should be multi-threaded
References:
https://github.com/illumos/illumos-gate/commit/4f67d75
Porting notes:
- This change was originally never ported to Linux due to it
dependence on the thread pool interface. This patch solves
that issue by switching the code to use the existing taskq
implementation which provides the same basic functionality.
However, in order for this to work properly thread_init()
and thread_fini() must be called around to taskq consumer
to perform the needed thread initialization.
- The check_one_slice, nozpool_all_slices, and check_slices
functions have been disabled for Linux. They are difficult,
but possible, to implement for Linux due to how partitions
are get names. Since this is only an optimization this code
can be added at a latter date.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4950 files sometimes can't be removed from a full filesystem
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Reviewed by: Boris Protopopov <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/4950
https://github.com/illumos/illumos-gate/commit/4bb7380
Porting notes:
- ZoL currently does not log discards to zvols, so the portion of
this patch that modifies the discard logging to mark it as
freeing space has been discarded.
2. may_delete_now had been removed from zfs_remove() in ZoL.
It has been reintroduced.
3. We do not try to emulate vnodes, so the following lines are
not valid on Linux:
mutex_enter(&vp->v_lock);
may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp);
mutex_exit(&vp->v_lock);
This has been replaced with:
mutex_enter(&zp->z_lock);
may_delete_now = atomic_read(&ip->i_count) == 1 && !(zp->z_is_mapped);
mutex_exit(&zp->z_lock);
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The registered xattr .list handler was simplified in the 4.5 kernel
to only perform a permission check. Given a dentry for the file it
must return a boolean indicating if the name is visible. This
differs slightly from the previous APIs which also required the
function to copy the name in to the provided list and return its
size. That is now all the responsibility of the caller.
This should be straight forward change to make to ZoL since we've
always required the caller to make the copy. However, this was
slightly complicated by the need to support 3 older APIs. Yes,
between 2.6.32 and 4.5 there are 4 versions of this interface!
Therefore, while the functional change in this patch is small it
includes significant cleanup to make the code understandable and
maintainable. These changes include:
- Improved configure checks for .list, .get, and .set interfaces.
- Interfaces checked from newest to oldest.
- Strict checking for each possible known interface.
- Configure fails when no known interface is available.
- HAVE_*_XATTR_LIST renamed HAVE_XATTR_LIST_* for consistency
with similar iops and fops configure checks.
- POSIX_ACL_XATTR_{DEFAULT|ACCESS} were removed forcing callers to
move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT|ACCESS}.
Compatibility wrapper were added for old kernels.
- ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing
ZPL_XATTR_{GET|SET} WRAPPERs. Only the inode is guaranteed to be
a valid pointer, passing NULL for the 'list' and 'name' variables
is allowed and must be checked for. All .list functions were
updated to use the wrapper to aid readability.
- zpl_xattr_filldir() updated to use the .list function for its
permission check which is consistent with the updated Linux 4.5
interface. If a .list function is registered it should return 0
to indicate a name should be skipped, if there is no registered
function the name will be added.
- Additional documentation from xattr(7) describing the correct
behavior for each namespace was added before the relevant handlers.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Issue #4228
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5045 use atomic_{inc,dec}_* instead of atomic_add_*
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Garrett D'Amore <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
https://www.illumos.org/issues/5045
https://github.com/illumos/illumos-gate/commit/1a5e258
Porting notes:
- All changes to non-ZFS files dropped.
- Changes to zfs_vfsops.c dropped because they were Illumos specific.
Ported-by: Brian Behlendorf <[email protected]>
Closes #4220
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
6298 zfs_create_008_neg and zpool_create_023_neg need to be updated
for large block support
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
https://www.illumos.org/issues/6298
https://github.com/illumos/illumos-gate/commit/e9316f7
Ported-by: Brian Behlendorf <[email protected]>
Closes #4217
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When debugging with tracepoints, we see string pointers:
zfs 3017 [006] 8878.728915: zfs:zfs_set__error: ffffffffa0eec3fc:3013:ffffffffa0ebcd60(): error 0x2
ffffffffa0e1ca43 spa_open_common (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
ffffffffa0e1cbe3 spa_open (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
ffffffffa0e6f6ef zfs_ioc_stable (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
ffffffffa0e6f2a9 zfsdev_ioctl (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
ffffffff811909dd do_vfs_ioctl ([kernel.kallsyms])
ffffffff81190c41 sys_ioctl ([kernel.kallsyms])
ffffffff8156e2e9 system_call_fastpath ([kernel.kallsyms])
7ff7d8be69c7 __GI___ioctl (/lib64/libc-2.19.so)
7ff7d90cac53 lzc_ioctl.constprop.3 (/lib64/libzfs_core.so.1.0.0)
636f695f637a6c00 [unknown] ([unknown])
Printing the actual strings is more convenient:
zfs 3461 [001] 10599.847692: zfs:zfs_set__error: spa.c:3013:spa_open_common(): error 0x2
ffffffffa116ba43 spa_open_common (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
ffffffffa116bbe3 spa_open (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
ffffffffa11be8df zfs_ioc_stable (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
ffffffffa11be499 zfsdev_ioctl (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
ffffffff811909dd do_vfs_ioctl ([kernel.kallsyms])
ffffffff81190c41 sys_ioctl ([kernel.kallsyms])
ffffffff8156e2e9 system_call_fastpath ([kernel.kallsyms])
7f11b843c9c7 __GI___ioctl (/lib64/libc-2.19.so)
7f11b8920c53 lzc_ioctl.constprop.3 (/lib64/libzfs_core.so.1.0.0)
636f695f637a6c00 [unknown] ([unknown])
A few other tracepoints have strings as well, so switch to printing
the actual string values at the same time.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4212
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The fast write mutex is intended to protect accounting, but it is
redundant because all accounting is performed through atomic operations.
It also serializes all metaslab IO behind a mutex, which introduces a
theoretical scaling regression that the Illumos developers did not like
when we showed this to them. Removing it makes the selection of the
metaslab_group lock free as it is on Illumos. The selection is not quite
the same without the lock because the loop races with IO completions,
but any imbalances caused by this are likely to be corrected by
subsequent metaslab group selections.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3643
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed. This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist. Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.
In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held. In
zfs_znode_hold_exit() the process is reversed. The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.
This scheme has two important properties:
1) No memory allocations are performed while holding one of the z_hold_locks.
This ensures evict(), which can be called from direct memory reclaim, will
never block waiting on a z_hold_locks which just happens to have hashed
to the same index.
2) All locks used to serialize access to an object are per-object and never
shared. This minimizes lock contention without creating a large number
of dedicated locks.
On the downside it does require znode_lock_t structures to be frequently
allocated and freed. However, because these are backed by a kmem cache
and very short lived this cost is minimal.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4106
|
|
|
|
|
|
|
|
|
|
|
| |
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array. Increasing this value may help
make the deadlock described in #4106 less common, but this is not a
proper fix. This patch is primarily to aid debugging and analysis.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Issue #4106
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3465 ::walk ... | ::<dcmd> misinterprets input as symbol names
3466 ::tsd should handle missing/NULL values better
3467 mdb_ctf_vread() could be more useful
3468 mdb enhancements for zfs development
3470 ::whatis does not print callers from KMF_LITE
3473 mdb_get_module() returns wrong module
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Eric Schrock <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Robert Mustacchi <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/3468
https://github.com/illumos/illumos-gate/commit/28e4da2
Porting notes:
- The only portion of this patch which applies to ZoL is a small
change to types used in the refcount structure.
Ported-by: Brian Behlendorf <[email protected]>
Closes #4216
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Under RHEL6/CentOS6 the default stack size must be increased to 32K
to prevent overflowing the stack when running ztest. This isn't an
issue for other distributions due to either the version of pthreads
or perhaps the compiler. Doubling the stack size resolves the
issue safely for all distribution and leaves us some headroom.
$ sudo -E ztest -V -T 300 -f /var/tmp/
5 vdevs, 7 datasets, 23 threads, 300 seconds...
loading space map for vdev 0 of 1, metaslab 0 of 30 ...
...
loading space map for vdev 0 of 1, metaslab 14 of 30 ...
child died with signal 11
Exited ztest with error 3
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4215
|