| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3104 eliminate empty bpobjs
Reviewed by: George Wilson <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Garrett D'Amore <[email protected]>
Approved by: Eric Schrock <[email protected]>
References:
illumos/illumos-gate@f17457368189aa911f774c38c1f21875a568bdca
illumos changeset: 13782:8f78aae28a63
https://www.illumos.org/issues/3104
Ported-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To save valuable stack all zio's were made asynchronous when in the
tgx_sync_thread context or during pool initialization. See commit
2fac4c2 for the original patch and motivation.
Unfortuantely, the changes to dsl_pool_sync_context() made by the
feature flags broke this logic causing in __zio_execute() to dispatch
itself infinitely when called during pool initialization. This
commit refines the existing logic to specificly target only the two
cases we care about.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Richard Lowe <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
illumos/illumos-gate@25345e466695fbe736faa53b8f3413d8e8f81981
https://www.illumos.org/issues/3349
Ported-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async
destroyed datasets
Reviewed by: Christopher Siden <[email protected]>
Approved by: Eric Schrock <[email protected]>
References:
illumos/illumos-gate@ce636f8b38e8c9ff484e880d9abb27251a882860
illumos changeset: 13776:cd512c80fd75
https://www.illumos.org/issues/3086
Ported-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Eric Schrock <[email protected]>
References:
illumos/illumos-gate@57221772c3fc05faba04bf48ddff45abf2bbf2bd
https://www.illumos.org/issues/2762
Ported-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Garrett D'Amore <[email protected]>
Approved by: Eric Schrock <[email protected]>
References:
illumos/illumos-gate@dfbb943217bf8ab22a1a9d2e9dca01d4da95ee0b
illumos changeset: 13777:b1e53580146d
https://www.illumos.org/issues/3090
https://www.illumos.org/issues/3102
Ported-by: Brian Behlendorf <[email protected]>
Closes #939
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Richard Lowe <[email protected]>
Reviewed by: Dan Kruchinin <[email protected]>
Approved by: Eric Schrock <[email protected]>
References:
illumos/illumos-gate@53089ab7c84db6fb76c16ca50076c147cda11757
illumos/illumos-gate@ad135b5d644628e791c3188a6ecbd9c257961ef8
illumos changeset: 13700:2889e2596bd6
https://www.illumos.org/issues/2619
https://www.illumos.org/issues/2747
NOTE: The grub specific changes were not ported. This change
must be made to the Linux grub packages.
Ported-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c
} else {
int i;
ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
/* the blkptrs we are losing better be unallocated */
for (i = dn->dn_next_nblkptr[txgoff];
i < dnp->dn_nblkptr; i++)
ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning. To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body. Since a dnode contains
at most 3 block pointers this overhead should be negligible.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #950
|
|
|
|
|
|
|
| |
Update zio_wait() to use cv_wait_io() to ensure the iowait time
is properly accounted for.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 9dcb97198338ba2d8764dd5604b278118612f74
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.
The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device. This issue was mitigated somewhat by
commit 84daaddedbfc9cf4bd1490d8a6f4b2967051e308 which ensures
the I/O elevator gets changed even for DM style devices.
This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely on io_schedule() via cv_wait_io().
The additional debugging information which was added to the
delay event has been intentionally left in place.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock. But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.
To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock(). Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.
When it is contended we risk a lock inversion if we were to
block waiting for the lock. Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again. This process can be repeated safely until both
locks are acquired.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jorgen Lundman <[email protected]>
Closes #612
|
|
|
|
|
|
|
|
| |
This reverts commit 31f2b5abdf95d8426d8bfd66ca7f62ec70215e3c back
to the original code until the fsync(2) performance regression
can be addressed.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit(). This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.
However, under Linux I'm not seeing the same contention that
was originally described. Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD. If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.
This just leaves one small ZFS TSD consumer. If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes zfsonlinux/spl#174
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:
$ ls -d /dev/dm-*
/dev/dm-0
/dev/dm-1
/dev/dm-2
/dev/dm-3
it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:
$ ls -d /dev/sd*
/dev/sda
/dev/sda1
/dev/sdb
/dev/sdb1
Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.
As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.
This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.
Signed-off-by: Prakash Surya <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1149
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented. This prevented
a zpool from use a zvol for its slog device.
This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c. Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system. If
they match we know the device is a zvol.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1131
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed. That change
accidentally resulted in unexpected ctime updates which upset tools
like git. That was always a horrible hack and I'm happy it will
never make it in to a tagged release.
The right fix is something I initially resisted doing because I
was worried about the additional overhead. However, in hindsight
the overhead isn't as bad as I feared.
This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied. We leverage this
callback to keep the znode SAs strictly in sync with the inode.
However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime. This will cover the callpath of most concern to us.
->filemap_page_mkwrite->file_update_time->update_time->
mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #764
Closes #1140
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #816
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux kernel commit d8e794d accidentally broke the delayed work
APIs for non-GPL callers. While the APIs to schedule a delayed
work item are still available to all callers, it is no longer
possible to initialize the delayed work item.
I'm cautiously optimistic we could get the delayed_work_timer_fn
exported for all callers in the upstream kernel. But frankly
the compatibility code to use this kernel interface has always
been problematic.
Therefore, this patch abandons direct use the of the Linux
kernel interface in favor of the new delayed taskq interface.
It provides roughly the same functionality as delayed work queues
but it's a stable interface under our control.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1053
|
|
|
|
|
|
|
|
|
|
|
| |
When writes to zvols invoke ZIL, zfs_range_new_proxy() is called,
which allocates memory using KM_SLEEP, triggering a warning.
Switch to KM_PUSHPAGE to silence that warning. See commit
b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1138
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit b00131d43ca344d4b205a03ab3eb771a060e5087 which
is no longer needed due to e89260a1c8851ce05ea04b23606ba438b271d890.
This change forces all xattr znodes to hold a reference on their
parent which ensures prune_icache() will never attempt to evict
both the parent and child concurrently. This effectively prevents
the deadlock condition from ever occuring.
Therefore we can safely revert back to the upstream synchronous
cleanup code. This is nice because it keeps our code base closer
to upstream and resolves the performance issues introduced by the
original deadlock fix.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #457
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().
But more importantly than preserving the exact time add the missing
call to sa_bulk_update(). This ensures that the znode modifications
are written to disk as part of the transaction. Without this the
inode may mistaken rollback to the previous on-disk znode state.
Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode. This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.
Original-patch-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #764
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent. Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it. Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.
This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.
The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX. Otherwise we
run the risk of deadlocking on the open create TX.
Ideally the security xattr should be fully constructed
before the new inode is unlocked. However, doing so would
require far more extensive changes to ZFS.
This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #671
|
|
|
|
|
|
|
|
|
|
|
| |
Add the missing error handling to load_nvlist(). There's no good
reason this needs to be fatal. All callers of load_nvlist() do
correctly handle an error condition and it is preferable that an
error be returned. This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1120
|
|
|
|
|
|
|
|
|
|
| |
Due to the slightly increased size of the ZFS super block
caused by 30315d2 there are now allocation warnings. The
allocation size is still small (just over 8k) and super
blocks are rarely allocated so we suppress the warning.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1101
|
|
|
|
|
|
|
|
|
|
| |
If zvol_alloc() fails zv will be set to NULL and dereferenced
in out_dmu_objset_disown. To avoid this entirely the zv->objset
line is moved up in to the success block.
Original-patch-by: Jorgen Lundman <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1109
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Eric Schrock <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Gordon Ross <[email protected]>
Reviewed by: Garrett D'Amore <[email protected]>
Approved by: Richard Lowe <[email protected]>
Refererces to Illumos issue:
https://www.illumos.org/issues/2671
This patch has been slightly modified from the upstream Illumos
version. In the upstream implementation a warning message is
logged to the console. To prevent pointless console noise this
notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift"
event.
The event indicates a non-optimial (but entirely safe) ashift
value was used to create the pool. Depending on your workload
this may impact pool performance. Unfortunately, the only way
to correct the issue is to recreate the pool with a new ashift.
NOTE: The unrelated fix to the comment in zpool_main.c appears
in the upstream commit and was preserved for consistnecy.
Ported-by: Cyril Plisko <[email protected]>
Reworked-by: Brian Behlendorf <[email protected]>
Closes #955
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Gunnar Beutner did all the hard work on this one by correctly
identifying that this issue is a race between dmu_sync() and
dbuf_dirty().
Now in all cases the caller is responsible for preventing this
race by making sure the zfs_range_lock() is held when dirtying
a buffer which may be referenced in a log record. The mmap
case which relies on zfs_putpage() was not taking the range
lock. This code was accidentally dropped when the function
was rewritten for the Linux VFS.
This patch adds the required range locking to zfs_putpage().
It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
aren't strictly required due to the VFS holding a reference.
However, this makes the code more consistent with the upsteam
code and there's no harm in being extra careful here.
Original-patch-by: Gunnar Beutner <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #541
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When using a zvol to back a btrfs filesystem the btrfs mount
would hang. This was due to the bio completion callback used
in btrfs assuming that lower level drivers would never modify
the bio->bi_io_vecs after they were submitted via bio_submit().
If they are modified btrfs will miscalculate which pages need
to be unlocked resulting in a hang.
It's worth mentioning that other file systems such as ext[234]
and xfs work fine because they do not make the same assumption
in the bio completion callback.
The most straight forward way to fix the issue is to present
the semantics expected by btrfs. This is done by cloning the
bios attached to each request and then using the clones bvecs
to perform the required accounting. The clones are freed after
each read/write and the original unmodified bios are linked back
in to the request.
Signed-off-by: Chris Wedgwood <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #469
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There have been reports of ZFS deadlocking due to what appears to
be a lost IO. This patch addes some debugging to determine the
exact state of the IO which neither 1) completed, 2) failed, or
3) timed out after zio_delay_max (30) seconds.
This information will be logged using the ZFS FMA infrastructure
as a 'delay' event and posted to the internal zevent log. By
default the last 64 events will be kept in the log but the limit
is configurable via the zfs_zevent_len_max module option.
To dump the contents of the log use the 'zpool events -v' command
and look for the resource.fs.zfs.delay event. It will include
various information about the pool, vdev, and zio which may shed
some light on the issue.
In the context of this change the 120 second kernel blocked thread
watchdog has been disabled for synchronous IOs.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #930
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Create a kstat file which contains useful statistics about the
last N txgs processed. This can be helpful when analyzing pool
performance. The new KSTAT_TYPE_TXG type was added for this
purpose and it tracks the following statistics per-txg.
txg - Unique txg number
state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
birth; - Creation time
nread - Bytes read
nwritten; - Bytes written
reads - IOPs read
writes - IOPs write
open_time; - Length in nanoseconds the txg was open
quiesce_time - Length in nanoseconds the txg was quiescing
sync_time; - Length in nanoseconds the txg was syncing
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The interface for the ddt_zap_count() function assumes it can
never fail. However, internally ddt_zap_count() is implemented
with zap_count() which can potentially fail. Now because there
was no way to return the error to the caller a VERIFY was used
to ensure this case never happens.
Unfortunately, it has been observed that pools can be damaged in
such a way that zap_count() fails. The result is that the pool can
not be imported without hitting the VERIFY and crashing the system.
This patch reworks ddt_object_count() so the error can be safely
caught and returned to the caller. This allows a pool which has
be damaged in this way to be safely rewound for import.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #910
|
|
|
|
|
|
|
|
|
| |
This reverts commit a5c20e2a0a9046c06d86615fbf51dc04f12bba14 which
accidentally introduced a regression for real 4k sector devices.
See issue #1065 for details.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1065
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following warning was originally added to provide visibility
in to how often a dio gets heavily fragmented in to over 16 bios.
This can happen due to constraints imposed by the block device
and may have a negitive impact on performance but is otherwise
harmless. To prevent needless confusion and worry the message
has been removed.
kernel: WARNING: Resized bio's/dio to 32
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
When automounting a snapshot in the .zfs/snapshot directory
make sure to quote both the dataset name and the mount point.
This ensures that if either component contains spaces, which
are allowed, they get handled correctly.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1027
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the current code, logbias=throughput implies the following:
1) All synchronous writes are logged in indirect mode.
2) The slog is not used.
(1) makes sense because it avoids writing the data twice, which is
obviously a good thing when the user wants maximum pool throughput.
(2), however, is a surprising decision. Considering all writes are
indirect, the log record doesn't contain the actual data, only pointers
to DMU blocks. As a result, log records written in logbias=throughput
mode are quite small, and as such, it doesn't make any sense to write
them to the main pool since slogs are usually optimized for small
synchronous writes.
In fact, the current behavior is actually harmful for performance,
because log blocks and data blocks from dmu_sync() seldom have the same
allocation size and as a result are usually allocated from different
metaslabs. This means that if a spindle has to write both log blocks and
DMU blocks (which is likely to happen under heavy load), it will have to
seek between the two. Allocating the log blocks from the slog pool
instead of the main pool avoids these unnecessary seeks.
This commit makes ZFS use the slog on datasets with logbias=throughput.
Real-life performance testing shows a 50% synchronous write performance
increase with some large commit sizes, and no negative effect in other
cases.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1013
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, ZIL blocks are spread over vdevs using hint block pointers
managed by the ZIL commit code and passed to metaslab_alloc(). Spreading
log blocks accross vdevs is important for performance: indeed, using
mutliple disks in parallel decreases the ZIL commit latency, which is
the main performance metric for synchronous writes. However, the current
implementation suffers from the following issues:
1) It would be best if the ZIL module was not aware of such low-level
details. They should be handled by the ZIO and metaslab modules;
2) Because the hint block pointer is managed per log, simultaneous
commits from multiple logs might use the same vdevs at the same time,
which is inefficient;
3) Because dmu_write() does not honor the block pointer hint, indirect
writes are not spread.
The naive solution of rotating the metaslab rotor each time a block is
allocated for the ZIL or dmu_sync() doesn't work in practice because the
first ZIL block to be written is actually allocated during the previous
commit. Consequently, when metaslab_alloc() decides the vdev for this
block, it will do so while a bunch of other allocations are happening at
the same time (from dmu_sync() and other ZILs). This means the vdev for
this block is chosen more or less at random. When the next commit
happens, there is a high chance (especially when the number of blocks
per commit is slightly less than the number of the disks) that one disk
will have to write two blocks (with a potential seek) while other disks
are sitting idle, which defeats spreading and increases the commit
latency.
This commit introduces a new concept in the metaslab allocator:
fastwrites. Basically, each top-level vdev maintains a counter
indicating the number of synchronous writes (from dmu_sync() and the
ZIL) which have been allocated but not yet completed. When the metaslab
is called with the FASTWRITE flag, it will choose the vdev with the
least amount of pending synchronous writes. If there are multiple vdevs
with the same value, the first matching vdev (starting from the rotor)
is used. Once metaslab_alloc() has decided which vdev the block is
allocated to, it updates the fastwrite counter for this vdev.
The rationale goes like this: when an allocation is done with
FASTWRITE, it "reserves" the vdev until the data is written. Until then,
all future allocations will naturally avoid this vdev, even after a full
rotation of the rotor. As a result, pending synchronous writes at a
given point in time will be nicely spread over all vdevs. This contrasts
with the previous algorithm, which is based on the implicit assumption
that blocks are written instantaneously after they're allocated.
metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to
manually increase or decrease fastwrite counters, respectively. They
should be used with caution, as there is no per-BP tracking of fastwrite
information, so leaks and "double-unmarks" are possible. There is,
however, an assert in the vdev teardown code which will fire if the
fastwrite counters are not zero when the pool is exported or the vdev
removed. Note that as stated above, marking is also done implictly by
metaslab_alloc().
ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to
the metaslab when allocating (assuming ZIO does the allocation, which is
only true in the case of dmu_sync). This flag will also trigger an
unmark when zio_done() fires.
A side-effect of the new algorithm is that when a ZIL stops being used,
its last block can stay in the pending state (allocated but not yet
written) for a long time, polluting the fastwrite counters. To avoid
that, I've implemented a somewhat crude but working solution which
unmarks these pending blocks in zil_sync(), thus guaranteeing that
linguering fastwrites will get pruned at each sync event.
The best performance improvements are observed with pools using a large
number of top-level vdevs and heavy synchronous write workflows
(especially indirect writes and concurrent writes from multiple ZILs).
Real-life testing shows a 200% to 300% performance increase with
indirect writes and various commit sizes.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1013
|
|
|
|
|
|
|
|
| |
The following incorrect usage of cv_broadcast() was caught by
code inspection. The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
| |
The following incorrect usage of cv_signal and cv_broadcast()
was caught by code inspection. The cv_signal and cv_broadcast()
functions must be called under the associated mutex to preventing
racing with cv_wait().
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
| |
The following incorrect usage of cv_broadcast() was caught by
code inspection. The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.
Signed-off-by: Massimo Maggi <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1038
|
|
|
|
|
|
|
|
| |
Prevent users from setting the zfs_vdev_aggregation_limit tuning
larger than SPA_MAXBLOCKSIZE.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #520
|
|
|
|
|
|
|
|
|
| |
Otherwise it will cause zpl_shares_lookup() to return a invalid
pointer when an error occurs.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Yuxuan Shui <[email protected]>
Closes #626 #885 #947 #977
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create. Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.
ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.
Signed-off-by: Yuxuan Shui <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #873
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup. Instead
only the inamedata->flags are passed.
ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.
Signed-off-by: Yuxuan Shui <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #873
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the
mount flags are now passed to sget() so they can be used when
initializing a new superblock.
ZFS never uses sget() in this fashion so we can simply pass a
zero and add a zpl_sget() compatibility wrapper.
Signed-off-by: Yuxuan Shui <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #873
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The .write_super callback was removed the the super_operations
structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e.
All file systems are now expected to self manage writing any dirty
state assoicated with their super block.
ZFS never made use of this callback so it can simply be removed
from the super_operations structure.
Signed-off-by: Yuxuan Shui <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #873
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.
This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.
For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.
We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.
As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
the path that the additional, useless padding data has to travel.
For example, when using 4k-sector drives that lie about their sector
size, using 512b read requests instead of 4k means that there will
be less data traveling down the ATA/SCSI interface, even though the
drive actually reads 4k from the platter.
The only exception is raidz, because raidz needs to read the whole
allocated block for parity.
This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1022
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are currently three vmem_size() consumers all of which are
part of the ARC implemention. However, since the expected behavior
of the Linux and Solaris virtual memory subsystems are so different
the behavior in each of these instances needs to be reevaluated.
* arc_evict_needed() - This is actually dead code. Arena support
was never added to the SPL and zio_arena is always NULL. This
support isn't needed so we simply remove this dead code.
* arc_memory_throttle() - On Solaris where virtual memory constitutes
almost all of the address space we can reasonably expect there to be
a fairly large amount free. However, on Linux by default we only
have about 100MB total and that's heavily used by the ARC. So the
expectation on Linux is that this will usually be a small value.
Therefore we remove the vmem_size() check for i386 systems because
the expectation is that it will be less than the zfs_write_limit_max.
* arc_init() - Here vmem_size() is used to initially size the ARC.
Since the ARC is currently backed by the virtual address space it
makes sense to use this as a limit on the ARC for 32-bit systems.
This code can be removed when the ARC is backed by the page cache.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #831
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Allow the zfs_txg_timeout variable to be dynamically tuned at run
time. By pulling it down out of the variable declaration it will
be evaluted each time through the loop.
The zfs_txg_timeout variable is now declared extern in a the common
sys/txg.h header rather than locally in dsl_scan.c. This prevents
potential type mismatches if the global variable needs to be used
elsewhere.
Move the module_param() code in to the same source file where
zfs_txg_timeout is declared. This is the most logical location.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit c409e4647f221ab724a0bd10c480ac95447203c3 introduced a
number of module parameters. This required several types to be
changed to accomidate the required module parameters Linux macros.
Unfortunately, arc.c contained its own extern definition of the
zfs_write_limit_max variable and its type was not updated to be
consistent with its dsl_pool.c counterpart. If the variable had
been properly marked extern in a common header, then gcc would
have generated a warning and this would not have slipped through.
The result of this was that the ARC unconditionally expected
zfs_write_limit_max to be 64-bit. Unfortunately, the largest size
integer module parameter that Linux supports is unsigned long, which
varies in size depending on the host system's native word size. The
effect was that on 32-bit systems, ARC incorrectly performed 64-bit
operations on a 32-bit value by reading the neighboring 32 bits as
the upper 32 bits of the 64-bit value.
We correct that by changing the extern declaration to use the unsigned
long type and move these extern definitions in to the common arc.h
header. This should make ARC correctly treat zfs_write_limit_max as a
32-bit value on 32-bit systems.
Reported-by: Jorgen Lundman <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #749
|