| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3875 panic in zfs_root() after failed rollback
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Gordon Ross <[email protected]>
References:
https://www.illumos.org/issues/3875
illumos/illumos-gate@91948b51b8e978ddc88a36b2bc3ae83c20cdc9aa
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3888 zfs recv -F should destroy any snapshots created since
the incremental source
Reviewed by: George Wilson <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Peng Dai <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/3888
illumos/illumos-gate@34f2f8cf94052481c81be2e134b94a00b501bf21
Porting notes:
1. Commit 1fde1e37208c2f56c72c70a06676676f04b65998 wrapped a
declaration in dsl_dataset_modified_since_lastsnap in ASSERTV().
The ASSERTV() and local variable have been removed to avoid an
unused variable warning.
Signed-off-by: Brian Behlendorf <[email protected]>
Ported-by: Richard Yao <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3894 zfs should not allow snapshot of inconsistent dataset
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Gordon Ross <[email protected]>
References:
https://www.illumos.org/issues/3894
illumos/illumos-gate@ca48f36f20f6098ceb19d5b084b6b3d4b8eca9fa
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Christopher Siden <[email protected]>
References:
https://www.illumos.org/issues/3740
illumos/illumos-gate@a7a845e4bf22fd1b2a284729ccd95c7370a0438c
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
Porting notes:
1. 13fe019870c8779bf2f5b3ff731b512cf89133ef introduced a merge conflict
in dsl_dataset_user_release_tmp where some variables were moved
outside of the preprocessor directive.
2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge
conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable
because this commit refactors the code, adding a new KM_SLEEP
allocation. It is not clear to me whether this should be converted
to KM_PUSHPAGE.
3. We had a merge conflict in libzfs_sendrecv.c because of copyright
notices.
4. Several small C99 compatibility fixed were made.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3744 zfs shouldn't ignore errors unmounting snapshots
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Christopher Siden <[email protected]>
References:
https://www.illumos.org/issues/3744
illumos/illumos-gate@fc7a6e3fefc649cb65c8e2a35d194781445008b0
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
Porting notes:
1. There is no clear way to distinguish between a failure when we
tried to unmount the snapdir of a zvol (which does not exist)
and the failure when we try to unmount a snapdir of a dataset,
so the changes to zfs_unmount_snap() were dropped in favor of
an altered Linux function that unconditionally returns 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Eric Schrock <[email protected]>
Approved by: Christopher Siden <[email protected]>
References:
https://www.illumos.org/issues/3742
illumos/illumos-gate@f7170741490edba9d1d9c697c177c887172bc741
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
Porting notes:
1. The change to zfs_vfsops.c was dropped because it involves
zfs_mount_label_policy, which does not exist in the Linux port.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3741 zfs needs better comments
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Eric Schrock <[email protected]>
Approved by: Christopher Siden <[email protected]>
References:
https://www.illumos.org/issues/3741
illumos/illumos-gate@3e30c24aeefdee1631958ecf17f18da671781956
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
https://www.illumos.org/issues/3582
illumos/illumos-gate@0689f76
Ported by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3642 dsl_scan_active() should not issue I/O to determine if async
destroying is active
3643 txg_delay should not hold the tc_lock
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Approved by: Gordon Ross <[email protected]>
References:
https://www.illumos.org/issues/3642
https://www.illumos.org/issues/3643
illumos/illumos-gate@4a92375985c37d61406d66cd2b10ee642eb1f5e7
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
Porting Notes:
1. The alignment assumptions for the tx_cpu structure assume that
a kmutex_t is 8 bytes. This isn't true under Linux but tc_pad[]
was adjusted anyway for consistency since this structure was
never carefully aligned in ZoL. If careful alignment does impact
performance significantly this should be reworked to be portable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
https://www.illumos.org/issues/3598
illumos/illumos-gate@be6fd75a69ae679453d9cda5bff3326111e6d1ca
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
Porting notes:
1. include/sys/zfs_context.h has been modified to render some new
macros inert until dtrace is available on Linux.
2. Linux-specific changes have been adapted to use SET_ERROR().
3. I'm NOT happy about this change. It does nothing but ugly
up the code under Linux. Unfortunately we need to take it to
avoid more merge conflicts in the future. -Brian
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3588 provide zfs properties for logical (uncompressed) space
used and referenced
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/3588
illumos/illumos-gate@77372cb0f35e8d3615ca2e16044f033397e88e21
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3537 want pool io kstats
Reviewed by: George Wilson <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Eric Schrock <[email protected]>
Reviewed by: Sa?o Kiselkov <[email protected]>
Reviewed by: Garrett D'Amore <[email protected]>
Reviewed by: Brendan Gregg <[email protected]>
Approved by: Gordon Ross <[email protected]>
References:
http://www.illumos.org/issues/3537
illumos/illumos-gate@c3a6601
Ported by: Cyril Plisko <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Porting Notes:
1. The patch was restructured to take advantage of the existing
spa statistics infrastructure. To accomplish this the kstat
was moved in to spa->io_stats and the init/destroy code moved
to spa_stats.c.
2. The I/O kstat was simply named <pool> which conflicted with the
pool directory we had already created. Therefore it was renamed
to <pool>/io
3. An update handler was added to allow the kstat to be zeroed.
|
|
|
|
|
|
| |
This is required to make Illumos 3962 merge.
Signed-off-by: Richard Yao <[email protected]>
|
|
|
|
|
|
|
|
| |
This resolves merge conflicts when merging Illumos #3588 and Illumos #4047.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change adds support for Posix ACLs by storing them as an xattr
which is common practice for many Linux file systems. Since the
Posix ACL is stored as an xattr it will not overwrite any existing
ZFS/NFSv4 ACLs which may have been set. The Posix ACL will also
be non-functional on other platforms although it may be visible
as an xattr if that platform understands SA based xattrs.
By default Posix ACLs are disabled but they may be enabled with
the new 'aclmode=noacl|posixacl' property. Set the property to
'posixacl' to enable them. If ZFS/NFSv4 ACL support is ever added
an appropriate acltype will be added.
This change passes the POSIX Test Suite cleanly with the exception
of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too).
http://www.tuxera.com/community/posix-test-suite/
Signed-off-by: Massimo Maggi <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #170
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently there is no mechanism to inspect which dbufs are being
cached by the system. There are some coarse counters in arcstats
by they only give a rough idea of what's being cached. This patch
aims to improve the current situation by adding a new dbufs kstat.
When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash. For each dbuf it will dump detailed information
about the buffer. It will also dump additional information about
the referenced arc buffer and its related dnode. This provides a
more complete view in to exactly what is being cached.
With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working. For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming. Or a
similar list could be generated based on dnode type. Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Prakash Surya <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
This change adds a new kstat to gain some visibility into the
amount of time spent in each call to dmu_tx_assign. A histogram
is exported via the new dmu_tx_assign file. The information
contained in this histogram is the frequency dmu_tx_assign
took to complete given an interval range.
Signed-off-by: Prakash Surya <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change is an attempt to add visibility in to how txgs are being
formed on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to txg. These entries are then exported through the kstat
interface, which can then be interpreted in userspace.
For each txg, the following information is exported:
* Unique txg number (uint64_t)
* The time the txd was born (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The current txg state ((O)pen/(Q)uiescing/(S)yncing/(C)ommitted)
* The number of reserved bytes for the txg (uint64_t)
* The number of bytes read during the txg (uint64_t)
* The number of bytes written during the txg (uint64_t)
* The number of read operations during the txg (uint64_t)
* The number of write operations during the txg (uint64_t)
* The time the txg was closed (hrtime_t)
* The time the txg was quiesced (hrtime_t)
* The time the txg was synced (hrtime_t)
Note that while the raw kstat now stores relative hrtimes for the
open, quiesce, and sync times. Those relative times are used to
calculate how long each state took and these deltas and printed by
output handlers.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
| |
This reverts commit e95853a331529a6cb96fdf10476c53441e59f4e1.
|
|
|
|
|
|
| |
This reverts commit 92334b14ec378b1693573b52c09816bbade9cf3e.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Linus Torvalds merged LZ4 into Linux 3.11. This causes a conflict
whenever CONFIG_LZ4_DECOMPRESS=y or CONFIG_LZ4_COMPRESS=y are set in the
kernel's .config. We rename the symbols to avoid the conflict.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1789
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3464 zfs synctask code needs restructuring
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
https://www.illumos.org/issues/3464
illumos/illumos-gate@3b2aab18808792cbd248a12f1edf139b89833c13
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1495
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once
Reviewed by: George Wilson <[email protected]>
Reviewed by: Chris Siden <[email protected]>
Reviewed by: Garrett D'Amore <[email protected]>
Reviewed by: Bill Pijewski <[email protected]>
Reviewed by: Dan Kruchinin <[email protected]>
Approved by: Eric Schrock <[email protected]>
References:
https://www.illumos.org/issues/2882
https://www.illumos.org/issues/2883
https://www.illumos.org/issues/2900
illumos/illumos-gate@4445fffbbb1ea25fd0e9ea68b9380dd7a6709025
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1293
Porting notes:
WARNING: This patch changes the user/kernel ABI. That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules. Ensure you load the matching kernel
modules from master after updating the utilities. Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:
$ zpool list
failed to read pool configuration: bad address
no pools available
$ zfs list
no datasets available
Add zvol minor device creation to the new zfs_snapshot_nvl function.
Remove the logging of the "release" operation in
dsl_dataset_user_release_sync(). The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos. This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).
Squash some "may be used uninitialized" warning/erorrs.
Fix some printf format warnings for %lld and %llu.
Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.
Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit torvalds/linux@2233f31aade393641f0eaed43a71110e629bb900
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.
To handle this the code was reworked to use the new ->iterate
interface. Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.
Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions. Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.
Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function. The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names. Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1653
Issue #1591
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Garrett D'Amore <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
http://www.illumos.org/issues/3618
illumos/illumos-gate@c55e05cb35da47582b7afd38734d2f0d9c6deb40
Notes on porting to ZFS on Linux:
The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.
One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:
zfs_vdev_time_shift = 6
to
zfs_vdev_time_shift = 29
The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.
(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)
Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).
Ported-by: Cyril Plisko <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1643
|
|
|
|
|
|
|
|
|
|
|
|
| |
3964 L2ARC should always compress metadata buffers
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
References:
https://www.illumos.org/issues/3964
Ported-by: Brian Behlendorf <[email protected]>
Closes #1379
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3137 L2ARC compression
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
illumos/illumos-gate@aad02571bc59671aa3103bb070ae365f531b0b62
https://www.illumos.org/issues/3137
http://wiki.illumos.org/display/illumos/L2ARC+Compression
Notes for Linux port:
A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set. This allows the legacy behavior
to be preserved.
Ported by: James H <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1379
|
|
|
|
|
|
|
|
|
|
| |
These functions are used in neither Illumos nor ZFSOnLinux. They appear
to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove
them.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1614
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change adds a new kstat to gain some visibility into the amount of
time spent in each call to dmu_tx_assign. A histogram is exported via
a new dmu_tx_assign_histogram-$POOLNAME file. The information contained
in this histogram is the frequency dmu_tx_assign took to complete given
an interval range. For example, given the below histogram file:
$ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank
12 1 0x01 32 1536 19792068076691 20516481514522
name type data
1 us 4 859
2 us 4 252
4 us 4 171
8 us 4 2
16 us 4 0
32 us 4 2
64 us 4 0
128 us 4 0
256 us 4 0
512 us 4 0
1024 us 4 0
2048 us 4 0
4096 us 4 0
8192 us 4 0
16384 us 4 0
32768 us 4 1
65536 us 4 1
131072 us 4 1
262144 us 4 4
524288 us 4 0
1048576 us 4 0
2097152 us 4 0
4194304 us 4 0
8388608 us 4 0
16777216 us 4 0
33554432 us 4 0
67108864 us 4 0
134217728 us 4 0
268435456 us 4 0
536870912 us 4 0
1073741824 us 4 0
2147483648 us 4 0
one can see most calls to dmu_tx_assign completed in 32us or less, but a
few outliers did not. Specifically, 4 of the calls took between 262144us
and 131072us. This information is difficult, if not impossible, to gather
without this change.
Signed-off-by: Prakash Surya <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1584
|
|
|
|
|
|
|
| |
The path to code is also changed in zfsonlinux.
Signed-off-by: Brian Behlendorf <[email protected]>
Issues #1566
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
illumos/illumos-gate@1b912ec7100c10e7243bf0879af0fe580e08c73d
https://www.illumos.org/issues/3498
Ported-by: Brian Behlendorf <[email protected]>
Closes #1249
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3122 zfs destroy filesystem should prefetch blocks
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
illumos/illumos-gate@b4709335aa83dcbfd0dba33c9be21fcabebd28e4
https://www.illumos.org/issues/3122
Ported-by: Brian Behlendorf <[email protected]>
Closes #1565
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.
Tested with xfstests 285 and 286. All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.
Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.
An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.
Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1384
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The non-blocking allocation handlers in nvlist_alloc() would be
mistakenly assigned if any flags other than KM_SLEEP were passed.
This meant that nvlists allocated with KM_PUSHPUSH or other KM_*
debug flags were effectively always using atomic allocations.
While these failures were unlikely it could lead to assertions
because KM_PUSHPAGE allocations in particular are guaranteed to
succeed or block. They must never fail.
Since the existing API does not allow us to pass allocation
flags to the private allocators the cleanest thing to do is to
add a KM_PUSHPAGE allocator.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes zfsonlinux/spl#249
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Will Andrews <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1503
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing)
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
illumos/illumos-gate@16a4a8074274d2d7cc408589cf6359f4a378c861
https://www.illumos.org/issues/3552
https://www.illumos.org/issues/3564
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1513
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
argument is zero
Reviewed by Matt Ahrens <[email protected]>
Reviewed by George Wilson <[email protected]>
Approved by Eric Schrock <[email protected]>
References:
illumos/illumos-gate@fb09f5aad449c97fe309678f3f604982b563a96f
https://illumos.org/issues/3006
Requires:
zfsonlinux/spl@1c6d149feb4033e4a56fb987004edc5d45288bcb
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1509
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The vn_rdwr() function performs I/O by calling the vfs_write() or
vfs_read() functions. These functions reside just below the system
call layer and the expectation is they have almost the entire 8k of
stack space to work with. In fact, certain layered configurations
such as ext+lvm+md+multipath require the majority of this stack to
avoid stack overflows.
To avoid this posibility the vn_rdwr() call in dump_bytes() has been
moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be
performed with the majority of the stack space available. This ends
up being very similiar to as if the I/O were issued via sys_write()
or sys_read().
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1399
Closes #1423
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Gordon Ross <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
illumos/illumos-gate@ec94d32
https://illumos.org/issues/3581
Notes for Linux port:
Earlier commit 08d08eb reduced contention on this taskq lock by simply
reducing the number of z_fr_iss threads from 100 to one-per-CPU. We
also optimized the taskq implementation in zfsonlinux/spl@3c6ed54.
These changes significantly improved unlink performance to acceptable
levels.
This patch further reduces time spent spinning on this lock by
randomly dispatching the work items over multiple independent task
queues. The Illumos ZFS developers stated that this lock contention
only arose after "3329 spa_sync() spends 10-20% of its time in
spa_free_sync_cb()" was landed. It's not clear if 3329 affects the
Linux port or not. I didn't see spa_free_sync_cb() show up in
oprofile sessions while unlinking large files, but I may just not
have used the right test case.
I tested unlinking a 1 TB of data with and without the patch and
didn't observe a meaningful difference in elapsed time. However,
oprofile showed that the percent time spent in taskq_thread() was
reduced from about 16% to about 5%. Aside from a possible slight
performance benefit this may be worth landing if only for the sake of
maintaining consistency with upstream.
Ported-by: Ned Bass <[email protected]>
Closes #1327
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Eric Schrock <[email protected]>
Reviewed by: Richard Lowe <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Approved by: Eric Schrock <[email protected]>
References:
illumos/illumos-gate@01f55e48fb4d524eaf70687728aa51b7762e2e97
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335
Ported-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Eric Schrock <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation
*) Usage of the cyclic interface was replaced by the delayed taskq
interface. This avoids the need to implement new compatibility
code and allows us to rely on the existing taskq implementation.
*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
because declaring externs in source files as was done in the
original patch is just plain wrong.
*) Instead of panicing the system when the deadman triggers a
zevent describing the blocked vdev and the first pending I/O
is posted. If the panic behavior is desired Linux provides
other generic methods to panic the system when threads are
observed to hang.
*) For reference, to delay zios by 30 seconds for testing you can
use zinject as follows: 'zinject -d <vdev> -D30 <pool>'
References:
illumos/illumos-gate@283b84606b6fc326692c03273de1774e8c122f9a
https://www.illumos.org/issues/3246
Ported-by: Brian Behlendorf <[email protected]>
Closes #1396
|
|
|
|
|
| |
Signed-off-by: Jan Engelhardt <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Install the common zfs kernel development headers under
/usr/src/zfs-<version>/ rather than in a kernel specific
directory. The kernel specific build products such as
zfs_config.h and Modules.symvers are left installed under
/usr/src/zfs-<version>/<kernel>.
This was done to be consistent with where dkms expects
kernel module source to be packaged. It also allows for
a common zfs-kmod-devel package which includes the headers,
and per-kernel zfs-kmod-devel-<kernel> packages.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices. By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/. When set to
'visible' all zvol snapshots for the dataset will be visible.
This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created. When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions. This leads
to a variety of issues:
1) The zvol partition tables will be read in the context of
the `modprobe zfs` for automatically imported pools. This
is undesirable and should be done asynchronously, but for
now reducing the number of visible devices helps.
2) Udev expects to be able to complete its work for a new
block devices fairly quickly. When many zvol devices are
added at the same time this is no longer be true. It can
lead to udev timeouts and missing /dev/zvol links.
3) Simply having lots of devices in /dev/ can be aukward from
a management standpoint. Hidding the devices your unlikely
to ever use helps with this. Any snapshot device which is
needed can be made visible by changing the snapdev property.
NOTE: This patch changes the default behavior for zvols which
was effectively 'snapdev=visible'.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1235
Closes #945
Issue #956
Issue #756
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.
Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:
vdev_opts_t 52
zil_replay_func_t 24
zio_compress_info_t 24
zio_checksum_info_t 9
space_map_ops_t 7
arc_byteswap_func_t 5
The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1300
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 4b2f65b253952c5103311cc8bb4b8cdc6836fd7e increased the user
space stack by 4x to resolve certain stack overflows. As such it
no longer makes sense to worry about a single extra page which
might or might not be part of the process stack. There is now
ample headroom for normal usage.
By eliminating this configure check we are also resolving the
following segfault which intentionally occurs at configure time
and may be logged in dmesg.
conftest[22156]: segfault at 7fbf18a47e48 ip 00000000004007fe
sp 00007fbf18a4be50 error 6 in conftest[400000+1000]
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3035 LZ4 compression support in ZFS and GRUB
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Christopher Siden <[email protected]>
References:
illumos/illumos-gate@a6f561b4aee75d0d028e7b36b151c8ed8a86bc76
https://www.illumos.org/issues/3035
http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS
This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux. Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.
Support for GRUB has been dropped from this patch. That code
is available but those changes will need to made to the upstream
GRUB package.
Lastly, several hunks of dead code were dropped for clarity. They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.
Ported-by: Eric Dillmann <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1217
|
|
|
|
|
|
|
|
|
|
|
|
| |
Retire the dmu_snapshot_id() function which was introduced in the
initial .zfs control directory implementation. There is already
an existing dsl_dataset_snap_lookup() which does exactly what we
need, and the dmu_snapshot_id() function as implemented is racy.
https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1238
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL. The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.
Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back. Then a new inode with the same inode number would
be create and collide with the existing cached inode. Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.
Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.
The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly. If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.
This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled. The inode
will then only be accessible from the cached dentries. Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated. Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.
Special care is taken for negative dentries since they do not
reference any inode. These dentires will be invalidate based
on when they were added to the dentry cache. Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.
Two nice side effects of this fix are:
* Removes the dependency on spl_invalidate_inodes(), it can now
be safely removed from the SPL when we choose to do so.
* zfs_znode_alloc() no longer requires a dentry to be passed.
This effectively reverts this portition of the code to its
upstream counterpart. The dentry is not instantiated more
correctly in the Linux ZPL layer.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #795
|