| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This includes code that was mistakenly left out of the 7dae2c8 merge into
0.6.5.10. Its inclusion fixes a kernel warning on Kubuntu 17.04:
WARN_ON(sb->s_bdi != &noop_backing_dev_info);
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6089
Closes #6324
(backported from zfs upstream commit 7dae2c81e7b2e68a596c5b431444be0fae308156)
Signed-off-by: Colin Ian King <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 4.11 introduces a new type, refcount_t, which conflicts with the
type of the same name defined within ZFS.
Rename the ZFS type zfs_refcount_t. Within the ZFS code, use a macro to
cause references to refcount_t to be changed to zfs_refcount_t at
compile time. This reduces conflicts when later landing OpenZFS
patches.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #5823
Closes #5842
|
|
|
|
|
|
|
|
| |
Also enable lazytime in mount.zfs
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4482
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem for atime:
We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its
handling is a mess. A huge part of mess regarding atime comes from
zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave
inconsistently with those three values.
zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you
don't pass ATTR_ATIME. Which means every write(2) operation which only updates
ctime and mtime will cause atime changes to not be written to disk.
Also zfs_inode_update from write(2) will replace inode->i_atime with what's
inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2).
You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0.
Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's
inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new),
SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll
leave with a stale atime.
The problem for relatime:
We do have a relatime config inside ZFS dataset, but how it should interact
with the mount flag MS_RELATIME is not well defined. It seems it wanted
relatime mount option to override the dataset config by showing it as
temporary in `zfs get`. But at the same time, `zfs set relatime=on|off` would
also seems to want to override the mount option. Not to mention that
MS_RELATIME flag is actually never passed into ZFS, so it never really worked.
How Linux handles atime:
The Linux kernel actually handles atime completely in VFS, except for writing
it to disk. So if we remove the atime handling in ZFS, things would just work,
no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever
VFS updates the i_atime, it will notify the underlying filesystem via
sb->dirty_inode().
And also there's one thing to note about atime flags like MS_RELATIME and
other flags like MS_NODEV, etc. They are mount point flags rather than
filesystem(sb) flags. Since native linux filesystem can be mounted at multiple
places at the same time, they can all have different atime settings. So these
flags are never passed down to filesystem drivers.
What this patch tries to do:
We remove znode->z_atime, since we won't gain anything from it. We remove most
of the atime handling and leave it to VFS. The only thing we do with atime is
to write it when dirty_inode() or setattr() is called. We also add
file_accessed() in zpl_read() since it's not provided in vfs_read().
After this patch, only the MS_RELATIME flag will have effect. The setting in
dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME
set according to the setting in dataset in future patch.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4482
|
|
|
|
| |
Signed-off-by: Chunwei Chen <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Non-Linux OpenZFS implementations require additional support to be
used a root pool. This code should simply be removed to avoid
confusion and improve readability.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4951
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer. This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.
In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b. In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.
The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency. This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.
To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.
For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4858
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come
from setxattr, which will handle by the acl xattr_handler, and we already
handles that well. However, nfsd will directly calls inode->set_acl or
return error if it doesn't exists.
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Massimo Maggi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #5371
Closes #5375
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
New REQ_OP_* definitions have been introduced to separate the
WRITE, READ, and DISCARD operations from the flags. This included
changing the encoding of bi_rw. It places REQ_OP_* in high order
bits and other stuff in low order bits. This encoding is done
through the new helper function bio_set_op_attrs. For complete
details refer to:
https://github.com/torvalds/linux/commit/f215082
https://github.com/torvalds/linux/commit/4e1b2d5
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4892
Closes #4899
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
for trouble. We should check every dataset on import, using a 1024 byte
buffer and checking each time to see if the dataset's new name is longer
than 256 bytes.
OpenZFS-issue: https://www.illumos.org/issues/6876
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.
This change partially reverts e89260a. This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
|
|
|
|
|
|
|
|
|
|
|
| |
The ZIO_CHECKSUM_IS_ZERO macro was added in master commit:
37f8a88 Illumos 5746 - more checksumming in zfs send
That whole patch is not suitable for the release branch
but some other backported patches on that macro.
Signed-off-by: Ned Bass <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Register iterate_shared if it exists so the kernel will used shared
lock and allowing concurrent readdir.
Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE
to allow concurrent seeking.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4664
Closes #4665
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
struct zvol_state contains a dummy znode, which is around 1KB on x64,
only for zfs_range_lock. But in reality, other than z_range_lock and
z_range_avl, zfs_range_lock only need znode on regular file, which
means we add 1KB on a structure and gain nothing.
In this patch, we remove the dummy znode for zvol_state. In order to
do that, we also need to refactor zfs_range_lock a bit. We move
z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t.
This new struct replaces znode_t as the main handle inside the range
lock functions.
We also add pointers to z_size, z_blksz, and z_max_blksz so range lock
code doesn't depend on znode_t. This allows non-ZPL consumers like
Lustre to use the range locks with their equivalent znode_t structure.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4510
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset
zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()
Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.
* Each taskq must be single threaded to ensure tasks are always
processed in the order in which they were dispatched.
* There is a taskq per-pool in order to keep the pools independent.
This way if one pool is suspended it will not impact another.
* The preferred location to dispatch a zvol minor task is a sync
task. In this context there is easy access to the spa_t and
minimal error handling is required because the sync task must
succeed.
Support for asynchronous zvol minor operations address issue #3681.
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2217
Closes #3678
Closes #3681
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed. This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist. Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.
In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held. In
zfs_znode_hold_exit() the process is reversed. The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.
This scheme has two important properties:
1) No memory allocations are performed while holding one of the z_hold_locks.
This ensures evict(), which can be called from direct memory reclaim, will
never block waiting on a z_hold_locks which just happens to have hashed
to the same index.
2) All locks used to serialize access to an object are per-object and never
shared. This minimizes lock contention without creating a large number
of dedicated locks.
On the downside it does require znode_lock_t structures to be frequently
allocated and freed. However, because these are backed by a kmem cache
and very short lived this cost is minimal.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4106
|
|
|
|
|
|
|
|
|
|
|
| |
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array. Increasing this value may help
make the deadlock described in #4106 less common, but this is not a
proper fix. This patch is primarily to aid debugging and analysis.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Issue #4106
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The function sa_update() accepts a 32-bit length parameter and
assigns it to a 16-bit field in sa_bulk_attr_t, potentially
truncating the passed-in value. This could lead to corrupt system
attribute (SA) records getting written to the pool. Add a VERIFY to
sa_update() to detect cases where overflow would occur. The SA length
is limited to 16-bit values by the on-disk format defined by
sa_hdr_phys_t.
The function zfs_sa_set_xattr() is vulnerable to this bug if the
unpacked nvlist of xattrs is less than 64k in size but the packed
size is greater than 64k. Fix this by appropriately checking the
size of the packed nvlist before calling sa_update(). Add error
handling to zpl_xattr_set_sa() to keep the cached list of SA-based
xattrs consistent with the data on disk.
Lastly, zfs_sa_set_xattr() calls dmu_tx_abort() on an assigned
transaction if sa_update() returns an error, but the DMU only allows
unassigned transactions to be aborted. Wrap the sa_update() call in a
VERIFY0, remove the transaction abort, and call dmu_tx_commit()
unconditionally. This is consistent practice with other callers
of sa_update().
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #4150
|
|
|
|
|
|
|
|
|
|
|
| |
Provide a generic interface to prefetch ZAP entries by name. This
functionality is being added for external consumers such as Lustre.
It is based of the existing zap_prefetch_uint64() version which is
used by the deduplication code.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #4061
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
objsetid is not unique across pool, so using it solely as key would cause
panic when automounting two snapshot on different pools with the same
objsetid. We fix this by adding spa pointer as additional key.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Issue #3948
Issue #3786
Issue #3887
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline. This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread. However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.
To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed. This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer. Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly. Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.
* z_wr_iss
zio_vdev_io_start
vdev_queue_io -> Takes vq->vq_lock
vdev_queue_io_to_issue
vdev_queue_aggregate
zio_buf_alloc -> Waiting on spl_kmem_cache process
* z_wr_int
zio_vdev_io_done
vdev_queue_io_done
mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss
* txg_sync
spa_sync
dsl_pool_sync
zio_wait -> Waiting on zio being handled by z_wr_int
* spl_kmem_cache
spl_cache_grow_work
kv_alloc
spl_vmalloc
...
evict
zpl_evict_inode
zfs_inactive
dmu_tx_wait
txg_wait_open -> Waiting on txg_sync
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #3808
Closes #3867
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <[email protected]>
Reviewed by: Xin LI <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/6267
https://github.com/illumos/illumos-gate/commit/d205810
Signed-off-by: Brian Behlendorf <[email protected]>
Ported-by: Ned Bass <[email protected]>
Issue #3865
Issue #3443
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e7409e6d85289f4955adf07fac834f5. There are two issues with the
mount option handling:
* Libzfs has historically included "xattr" in its list of default mount
options. This overrides the dataset property, so the dataset is always
configured to use directory-based xattrs even when the xattr dataset
property is set to off or sa. Address this by removing "xattr" from
the set of default mount options in libzfs.
* There was no way to enable system attribute-based extended attributes
using temporary mount options. Add the mount options "saxattr" and
"dirxattr" which enable the xattr behavior their names suggest. This
approach has the advantages of mirroring the valid xattr dataset
property values and following existing conventions for mount option
names.
Issue #3787
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
6214 zpools going south
Reviewed by: Igor Kozhukhov <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Reviewed by: Saso Kiselkov <[email protected]>
References:
https://www.illumos.org/issues/6214
http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/
Porting Notes:
Reintroduce b_compress to the l2arc_buf_hdr_t. In commit b9541d6
the compression flags were moved to the generic b_flags in the
arc_buf_hdr_t. This is a problem because l2arc_compress_buf()
may manipulate the compression flags and this can only be done
safely under the hash lock which is not held. See Illumos 6214
for a detailed analysis of the race.
HDR_GET_COMPRESS() macro was removed from arc_buf_info().
Ported-by: Brian Behlendorf <[email protected]>
Closes #3757
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Internally ZFS keeps a small log to facilitate debugging. By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.
$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp message
1441141408 spa=tank async request task=1
1441141408 txg 70 open pool version 5000; software version 5000/5; ...
1441141409 spa=tank async request task=32
1441141409 txg 72 import pool version 5000; software version 5000/5; ...
1441141414 command: lt-zpool import tank
Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log. As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.
$ ZFS_DEBUG=on ./cmd/ztest/ztest -V
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #3728
|
|\
| |
| |
| |
| |
| |
| |
| | |
Performance improvements for zvols.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3720
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.
Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.
Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:
1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.
An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat. Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.
Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.
Signed-off-by: Richard Yao <[email protected]>
|
|/
|
|
|
|
|
|
|
|
|
|
|
| |
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #985
Closes #3351
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 49ddb315066e372f31bda29a5c546a9eccc8b418 added the
zfs_arc_average_blocksize parameter to allow control over the size of
the arc hash table. The dbuf hash table's size should be determined
similarly.
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3721
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3712
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels. And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients. This patch makes
the following core improvements.
* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.
* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process. This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.
* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.
* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds. This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.
* Snapshots in active use will not be automatically unmounted. As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.
* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated. This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots. This patch modifies the
behavior such that only negative dentries are invalidated. This solves
the issue and may result in a performance improvement.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3589
Closes #3344
Closes #3295
Closes #3257
Closes #3243
Closes #3030
Closes #2841
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since ZoL allows large blocks to be used by volumes, unlike upstream
illumos, the feature flag must be checked prior to volume creation.
This is critical because unlike filesystems, volumes will create a
object which uses large blocks as part of the create. Therefore, it
cannot be safely checked in zfs_check_settable() after the dataset
can been created.
In addition this patch updates the relevant error messages to use
zfs_nicenum() to print the maximum blocksize.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3591
|
|
|
|
|
|
|
|
| |
See also #3546, commit c1718e9
Signed-off-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3673
|
|
|
|
|
|
| |
Signed-off-by: Frédéric VANNIÈRE <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3546
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Under Linux filesystem threads responsible for handling I/O are
normally created with the maximum priority. Non-I/O filesystem
processes run with the default priority. ZFS should adopt the
same priority scheme under Linux to maintain good performance
and so that it will complete fairly when other Linux filesystems
are active. The priorities have been updated to the following:
$ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta'
- TS 10743 19 -20 [spl_kmem_cache]
- TS 10744 19 -20 [spl_system_task]
- TS 10745 19 -20 [spl_dynamic_tas]
- TS 10764 19 0 [dbu_evict]
- TS 10765 19 0 [arc_prune]
- TS 10766 19 0 [arc_reclaim]
- TS 10767 19 0 [arc_user_evicts]
- TS 10768 19 0 [l2arc_feed]
- TS 10769 39 0 [z_unmount]
- TS 10770 39 -20 [zvol]
- TS 11011 39 -20 [z_null_iss]
- TS 11012 39 -20 [z_null_int]
- TS 11013 39 -20 [z_rd_iss]
- TS 11014 39 -20 [z_rd_int_0]
- TS 11022 38 -19 [z_wr_iss]
- TS 11023 39 -20 [z_wr_iss_h]
- TS 11024 39 -20 [z_wr_int_0]
- TS 11032 39 -20 [z_wr_int_h]
- TS 11033 39 -20 [z_fr_iss_0]
- TS 11041 39 -20 [z_fr_int]
- TS 11042 39 -20 [z_cl_iss]
- TS 11043 39 -20 [z_cl_int]
- TS 11044 39 -20 [z_ioctl_iss]
- TS 11045 39 -20 [z_ioctl_int]
- TS 11046 39 -20 [metaslab_group_]
- TS 11050 19 0 [z_iput]
- TS 11121 38 -19 [z_wr_iss]
Note that under Linux the meaning of a processes priority is inverted
with respect to illumos. High values on Linux indicate a _low_ priority
while high value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
This patch depends on https://github.com/zfsonlinux/spl/pull/466.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #3607
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Alex Reece <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
https://www.illumos.org/issues/5817
https://github.com/illumos/illumos-gate/commit/2fd872a
Ported-by: Brian Behlendorf <[email protected]>
Issue #3533
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Steven Hartland <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/5376
https://github.com/illumos/illumos-gate/commit/2ec99e3
Porting Notes:
The good news is that many of the recent changes made upstream to the
ARC tackled issues previously observed by ZoL with similar solutions.
The bad news is those solution weren't identical to the ones we applied.
This patch is designed to split the difference and apply as much of the
upstream work as possible.
* The arc_available_memory() function was removed previous in ZoL but
due to the upstream changes it makes sense to add it back. This function
has been customized for Linux so that it can be used to determine a low
memory. This provides the same basic functionality as the illumos version
allowing us to minimize changes through the rest of the code base. The
exact mechanism used to detect a low memory state remains unchanged so
this change isn't a significant as it might first appear.
* This patch includes the long standing fix for arc_shrink() which was
originally proposed in #2167. Since there were related changes to this
function it made sense to include that work.
* The arc_init() function has been re-factored. As before it sets sane
default values for the ARC but then calls arc_tuning_update() to apply
user specific tuning made via module options. The arc_tuning_update()
function is then called periodically by the arc_reclaim_thread() to
apply changes to the tunings made during normal operation.
Ported-by: Brian Behlendorf <[email protected]>
Closes #3616
Closes #2167
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5347 idle pool may run itself out of space
Reviewed by: Alex Reece <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Steven Hartland <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://github.com/illumos/illumos-gate/commit/231aab8
https://github.com/illumos/illumos-gate/commit/4a92375 3642
https://www.illumos.org/issues/5347
https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix)
https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390
https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643
Porting notes:
This is completing the partial fix from FreeBSD
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3586
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Josef 'Jeff' Sipek <[email protected]>
Reviewed by: Xin LI <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
https://github.com/illumos/illumos-gate/commit/db1741f
https://www.illumos.org/issues/5661
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3571
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4745 fix AVL code misspellings
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Richard Lowe <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
https://github.com/illumos/illumos-gate/commit/6907ca4
https://www.illumos.org/issues/4745
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3565
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5008 lock contention (rrw_exit) while running a read only load
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Alex Reece <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Richard Yao <[email protected]>
Reviewed by: Saso Kiselkov <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
Porting notes:
This patch ported perfectly cleanly to ZoL. During testing 100% cached
small-block reads, extreme contention was noticed on rrl->rr_lock from
rrw_exit() due to the frequent entering and leaving ZPL. Illumos picked
up this patch from FreeBSD and it also helps under Linux.
On a 1-minute 4K cached read test with 10 fio processes pinned to a single
socket on a 4-socket (10 thread per socket) NUMA system, contentions on
rrl->rr_lock were reduced from 508799 to 43085.
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3555
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <[email protected]>
Reviewed by: Alek Pinchuk <[email protected]>
Reviewed by: Simon Klinkert <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/5911
https://github.com/illumos/illumos-gate/commit/46e1baa
Porting notes:
Resolved ISO C90 forbids mixed declarations and code wanting in
the dnode_free_range() function.
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3554
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5981 Deadlock in dmu_objset_find_dp
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
https://www.illumos.org/issues/5981
https://github.com/illumos/illumos-gate/commit/1d3f896
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3553
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5175 implement dmu_read_uio_dbuf() to improve cached read performance
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Alex Reece <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
https://www.illumos.org/issues/5175
https://github.com/illumos/illumos-gate/commit/f8554bb
Porting notes:
This patch doesn't include the changes for the COMSTAR (Common
Multiprotocol SCSI Target) - since it's not available for ZoL.
http://thegreyblog.blogspot.co.at/2010/02/setting-up-solaris-comstar-and.html
Ported by: kernelOfTruth <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3392
|
|
|
|
|
|
|
|
|
|
| |
Illumos 5701 (zpool list reports incorrect "alloc" value for cache
devices) removed l2ad_evict from l2arc_dev_t. It should also be removed
from the zfs_l2arc_evict_class event class.
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3534
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5163 arc should reap range_seg_cache
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Saso Kiselkov <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/5163
https://github.com/illumos/illumos-gate/commit/83803b5
Porting Notes:
Added umem_cache_reap_now() wrapped to suppress unused variable
warning for user space build in arc_kmem_reap_now().
Ported-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Over the years the default values for the taskqs used on Linux have
differed slightly from illumos. In the vast majority of cases this
was done to avoid creating an obnoxious number of idle threads which
would pollute the process listing.
With the addition of support for dynamic taskqs all multi-threaded
queues should be created as dynamic taskqs. This allows us to get
the best of both worlds.
* The illumos default values for the I/O pipeline can be restored.
These values are known to work well for most workloads. The only
exception is the zio write interrupt taskq which is changed to
ZTI_P(12, 8). At least under Linux more threads has been shown
to improve performance, see commit 7e55f4e.
* Reduces the number of idle threads on the system when it's not
under heavy load. The maximum number of threads will only be
created when they are required.
* Remove the vdev_file_taskq and rely on the system_taskq instead
which is now dynamic and may have up to 64-threads. Again this
brings us back inline with upstream.
* Tasks dispatched with taskq_dispatch_ent() are allowed to use
dynamic taskqs. The Linux taskq implementation supports this.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #3507
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Alek Pinchuk <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/5701
https://github.com/illumos/illumos-gate/commit/a52fc31
Porting Notes:
arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
correctly placed at arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr).
Ported by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unit testing at ClusterHQ found that passing an invalid file handle to
zfs_ioc_hold results in a NULL pointer dereference on a system without
assertions:
IP: [<ffffffffa0218aa0>] zfsdev_getminor+0x10/0x20 [zfs]
Call Trace:
[<ffffffffa021b4b0>] zfs_onexit_fd_hold+0x20/0x40 [zfs]
[<ffffffffa0214043>] zfs_ioc_hold+0x93/0xd0 [zfs]
[<ffffffffa0215890>] zfsdev_ioctl+0x200/0x500 [zfs]
An assertion would have caught this had they been enabled, but this is
something that the kernel module should handle without failing. We
resolve this by searching the linked list to ensure that the file
handle's private_data points to a valid zfsdev_state_t.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Andriy Gapon <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3506
|
|
|
|
|
|
|
|
|
|
|
| |
This is the counterpart to zfsonlinux/spl@2345368 which replaces the
cv_wait_interruptible() function with cv_wait_sig(). There is no
functional change to patch merely brings the function names in to
sync to maximize portability.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3450
Issue #3402
|