| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zfsonlinux issue #3681 - lock order inversion between zvol_open() and
dsl_pool_sync()...zvol_rename_minors()
Remove trylock of spa_namespace_lock as it is no longer needed when
zvol minor operations are performed in a separate context with no
prior locking state; the spa_namespace_lock is no longer held
when bdev->bd_mutex or zfs_state_lock might be taken in the code
paths originating from the zvol minor operation callbacks.
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3681
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset
zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()
Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.
* Each taskq must be single threaded to ensure tasks are always
processed in the order in which they were dispatched.
* There is a taskq per-pool in order to keep the pools independent.
This way if one pool is suspended it will not impact another.
* The preferred location to dispatch a zvol minor task is a sync
task. In this context there is easy access to the spa_t and
minimal error handling is required because the sync task must
succeed.
Support for asynchronous zvol minor operations address issue #3681.
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2217
Closes #3678
Closes #3681
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a race condition when new transaction group is added
to dp->dp_dirty_datasets list by the zap_update in the zvol_update_volsize.
Meanwhile, before these dirty data are synchronized, the receive process
can cause that dmu_recv_end_sync is executed. Then finally dirty data
are going to be synchronized but the synchronization ends with the NULL
pointer dereference error.
Signed-off-by: ab-oe <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4116
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Close the race window in zvol_open() to prevent removal of
zvol_state in the 'first open' code path. Move the call to
check_disk_change() under zvol_state_lock to make sure the
zvol_media_changed() and zvol_revalidate_disk() called by
check_disk_change() are invoked with positive zv_open_count.
Skip opened zvols when removing minors and set private_data
to NULL for zvols that are not in use whose minors are being
removed, to indicate to zvol_open() that the state is gone.
Skip opened zvols when renaming minors to avoid modifying
zv_name that might be in use, e.g. in zvol_ioctl().
Drop zvol_state_lock before calling add_disk() when creating
minors to avoid deadlocks with zvol_open().
Wrap dmu_objset_find() with spl_fstran_mark()/unmark().
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #4344
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is
that the former takes a hold while the latter uses an existing hold.
`zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while
our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no
apparent reason, we inherited a `zvol_read()` function from
OpenSolaris that does `dmu_read_uio()`. illumos-gate also still
uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`,
which is more performant.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4316
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In illumos-gate, `zvol_read` and `zvol_write` are both passed uio_t
rather than bio_t. Since we are translating from bio to uio for both, we
might as well unify the logic and have code more similar to its illumos
counterpart. At the same time, we can fix some regressions that occurred
versus the original code from illumos-gate.
We refactor zvol_write to take uio and also correct the
following problems:
1. We did `dnode_hold()` on each IO when we already had a hold.
2. We would attempt to send writes that exceeded `DMU_MAX_ACCESS` to the
DMU.
3. We could call `zil_commit()` twice. In this case, this is because
Linux uses the `->write` function to send flushes and can aggregate the
flush with a write. If a synchronous write occurred with the flush, we
effectively flushed twice when there is no need to do that.
zvol_read also suffers from the first two problems. Other platforms
suffer from the first, so we leave that for a second patch so that there
is a discrete patch for them to cherry-pick.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4316
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As the comments in zvol_discard() suggested, the discard operation
could be logged to the zil. This is a port of the relevant code from
Nexenta as it was added in "701 UNMAP support for COMSTAR" and has been
attributed to the author of that commit.
References:
https://github.com/Nexenta/illumos-nexenta/commit/b77b923
https://github.com/zfsonlinux/zfs/blob/089fa91b/module/zfs/zvol.c#L637
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4950 files sometimes can't be removed from a full filesystem
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Reviewed by: Boris Protopopov <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/4950
https://github.com/illumos/illumos-gate/commit/4bb7380
Porting notes:
- ZoL currently does not log discards to zvols, so the portion of
this patch that modifies the discard logging to mark it as
freeing space has been discarded.
2. may_delete_now had been removed from zfs_remove() in ZoL.
It has been reintroduced.
3. We do not try to emulate vnodes, so the following lines are
not valid on Linux:
mutex_enter(&vp->v_lock);
may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp);
mutex_exit(&vp->v_lock);
This has been replaced with:
mutex_enter(&zp->z_lock);
may_delete_now = atomic_read(&ip->i_count) == 1 && !(zp->z_is_mapped);
mutex_exit(&zp->z_lock);
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3557 dumpvp_size is not updated correctly when a dump zvol's size is changed
3558 setting the volsize on a dump device does not return back ENOSPC
3559 setting a volsize larger than the space available sometimes succeeds
3560 dumpadm should be able to remove a dump device
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Approved by: Albert Lee <[email protected]>
References:
https://www.illumos.org/issues/3559
https://github.com/illumos/illumos-gate/commit/c61ea56
Porting notes:
- Internal zvol.c changes not applied due to implementation differences.
The external interface and behavior was already consistent with the
latest upstream code.
- Retired 2.6.28 HAVE_CHECK_DISK_SIZE_CHANGE configure check. All
supported kernels (2.6.32 and newer) provide this interface.
Ported-by: Brian Behlendorf <[email protected]>
Closes #4217
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
References:
https://www.illumos.org/issues/5960
https://www.illumos.org/issues/5925
https://github.com/illumos/illumos-gate/commit/a2cdcdd
Porting notes:
- [lib/libzfs/libzfs_sendrecv.c]
- b8864a2 Fix gcc cast warnings
- 325f023 Add linux kernel device support
- 5c3f61e Increase Linux pipe buffer size on 'zfs receive'
- [module/zfs/zfs_vnops.c]
- 3558fd7 Prototype/structure update for Linux
- c12e3a5 Restructure zfs_readdir() to fix regressions
- [module/zfs/zvol.c]
- Function @zvol_map_block() isn't needed in ZoL
- 9965059 Prefetch start and end of volumes
- [module/zfs/dmu.c]
- Fixed ISO C90 - mixed declarations and code
- Function dmu_prefetch() 'int i' is initialized before
the following code block (c90 vs. c99)
- [module/zfs/dbuf.c]
- fc5bb51 Fix stack dbuf_hold_impl()
- 9b67f60 Illumos 4757, 4913
- 34229a2 Reduce stack usage for recursive traverse_visitbp()
- [module/zfs/dmu_send.c]
- Fixed ISO C90 - mixed declarations and code
- b58986e Use large stacks when available
- 241b541 Illumos 5959 - clean up per-dataset feature count code
- 77aef6f Use vmem_alloc() for nvlists
- 00b4602 Add linux kernel memory support
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Since uio now supports bvec, we can convert bio into uio and reuse
dmu_{read,write}_uio. This way, we can remove some duplicate code.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4078
|
|
|
|
|
|
|
|
|
|
|
| |
As part of block polling support in Linux 4.4, make_request_fn should
return a cookie value of type blk_qc_t. For now, we make zvol_request
always return BLK_QC_T_NONE until we assess whether and how we want
to support block polling.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4021
|
|
|
|
|
|
|
|
|
| |
All users of zv_lock were removed by 37f9dac, but we forgot to remove
it. Lets remove it as clean up.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3858
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit torvalds/linux@4246a0b63bd8f56a1469b12eafeb875b1041a451
("block: add a bi_error field to struct bio") dropped the error
argument from bio_endio in favor of newly introduced bio->bi_error.
This also replaces bio->bi_flags value BIO_UPTODATE.
bio_endio was a 3 argument function until Linux 2.6.24, which made it
a 2 argument function, and now the prototype has changed yet again to
a 1 argument function. Support for pre 2.6.24 kernels was already
dropped with 37f9dac592bf ("zvol processing should use struct bio")
which assumed the 2 argument version in zvol_request(). Remaining code
to support the 3 argument version is hereby removed.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Issue #3799
|
|
|
|
|
|
|
|
|
|
|
|
| |
37f9dac592bf5889c3efb305c48ac39b4c7dd140 replaced the end-start
calculation with a cached value, but neglected to update it on discard
operations. This can cause us to discard data not requested, causing
data loss on zvols.
Reported-by: Richard Connon <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3798
|
|
|
|
|
|
|
|
|
|
|
|
| |
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume. Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3659
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zfsonlinux/zfs@e20cd6f7a8922709b1aa2ecefd783390102d79e0 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503bc40e32d7f54a9b817264e81ce131b4 provided
suitable symbols for restoring this functionality 4 months later. We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3741
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be
processed, such that we cannot do optimizations like block alignment.
Consequently, the discard semantics prior to 2.6.36 require us to always
process unaligned discards. Previously, we would do this optimization
regardless. This patch changes things to correctly restrict this
optimization to situations where REQ_SECURE exists, but is not included
in the flags.
Signed-off-by: Richard Yao <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.
Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.
Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:
1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.
An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat. Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.
Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.
Signed-off-by: Richard Yao <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
zvols should not be an entropy source for the kernel. Disable it to be
consistent with the upstream kernel.
torvalds/linux@b277da0a8a594308e17881f4926879bd5fca2a2d
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3713
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since ZoL allows large blocks to be used by volumes, unlike upstream
illumos, the feature flag must be checked prior to volume creation.
This is critical because unlike filesystems, volumes will create a
object which uses large blocks as part of the create. Therefore, it
cannot be safely checked in zfs_check_settable() after the dataset
can been created.
In addition this patch updates the relevant error messages to use
zfs_nicenum() to print the maximum blocksize.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3591
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When support for large blocks was added DMU_MAX_ACCESS was increased
to allow for blocks of up to 16M to fit in a transaction handle.
This had the side effect of increasing the max_hw_sectors_kb for
volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M.
This is an issue for volumes which by default use an 8K block size
because it results in dmu_buf_hold_array_by_dnode() allocating a
64K array for the dbufs. The solution is to restore the maximum
size to ~10M. This patch specifically changes it to 16M which is
close enough.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3684
|
|
|
|
|
|
|
|
|
|
| |
The zvol_threads module option should be bounded to a reasonable
range. The taskq must have at least 1 thread and shouldn't have
more than 1,024 at most. The default value of 32 is a reasonable
default.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3614
|
|
|
|
|
|
|
|
|
|
|
| |
As of gcc version 5.1.1 a new warning has been added to detect the
use of a boolean in a switch statement (-Wswitch-bool). Resolve the
warning by explicitly casting the value to an integer type.
zfs-0.6.4/module/zfs/zvol.c: In function 'zvol_request':
error: switch condition has boolean value [-Werror=switch-bool]
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Over the years the default values for the taskqs used on Linux have
differed slightly from illumos. In the vast majority of cases this
was done to avoid creating an obnoxious number of idle threads which
would pollute the process listing.
With the addition of support for dynamic taskqs all multi-threaded
queues should be created as dynamic taskqs. This allows us to get
the best of both worlds.
* The illumos default values for the I/O pipeline can be restored.
These values are known to work well for most workloads. The only
exception is the zio write interrupt taskq which is changed to
ZTI_P(12, 8). At least under Linux more threads has been shown
to improve performance, see commit 7e55f4e.
* Reduces the number of idle threads on the system when it's not
under heavy load. The maximum number of threads will only be
created when they are required.
* Remove the vdev_file_taskq and rely on the system_taskq instead
which is now dynamic and may have up to 64-threads. Again this
brings us back inline with upstream.
* Tasks dispatched with taskq_dispatch_ent() are allowed to use
dynamic taskqs. The Linux taskq implementation supports this.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #3507
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZoL had been setting max_sectors to UINT_MAX, but until Linux 3.19, it
the kernel artifically capped it at 1024 (BLK_DEF_MAX_SECTORS).
This cap was removed in torvalds/linux@34b48db. This patch changes
it to DMU_MAX_ACCESS (in sectors) and also changes the ASSERT in
dmu_tx_hold_write() to allow the maximum transfer size.
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3212
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Thank to commit a4430fce691d492aec382de0dfa937c05ee16500 we're
now correctly returning EROFS when opening a zvol on a read-only
pool. Unfortunately, it looks like this causes us to trigger
some unexpected behavior by __blkdev_get().
In the failure case it's possible __blkdev_get() will call
__blkdev_put() for a bdev which was never successfully opened.
This results in us trying to close the device again and hitting
the NULL dereference.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1343
|
|
|
|
|
|
|
|
| |
Rather than ASSERT when for some reason the readonly property of
a zvol can't be read cleanly handle the failure.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1343
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings
us back in line with upstream. In some cases this means simply
swapping the flags back. For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.
The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory. This is again the
same as upstream.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim. This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction. For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Under Linux the zvol_set_volsize() function was originally written
to use dmu_objset_hold()/dmu_objset_rele(). Subsequently, the
dmu_objset_own()/dmu_objset_disown() interfaces were added but
the ZVOL code wasn't updated to take advantage of them.
This was never an issue but after the dsl_pool_config changes
the code now takes the config lock twice. The cleanest solution
is to shift to using dmu_objset_own() which takes a long hold
on the dataset and does not hold the dsl pool lock.
This patch also slightly restructures the existing code such
that it more closely resembles the upstream Illumos code.
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2039
|
|
|
|
|
|
|
|
|
|
|
| |
Update zvol.c to conform to the style guidelines, verified by
running cstyle.pl on the source file. This patch contains
no functional changes.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Issue #1821
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Early versions of ZFS coordinated the creation and destruction
of device minors from userspace. This was inherently racy and
in late 2009 these ioctl()s were removed leaving everything up
to the kernel. This significantly simplified the code.
However, we never picked up these changes in ZoL since we'd
already significantly adjusted this code for Linux. This patch
aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s
and moving all the functionality down in to the kernel. Since
this cleanup will change the kernel/user ABI it's being done
in the same tag as the previous libzfs_core ABI changes. This
will minimize, but not eliminate, the disruption to end users.
Once merged ZoL, Illumos, and FreeBSD will basically be back
in sync in regards to handling ZVOLs in the common code. While
each platform must have its own custom zvol.c implemenation the
interfaces provided are consistent.
NOTES:
1) This patch introduces one subtle change in behavior which
could not be easily avoided. Prior to this change callers
of 'zfs create -V ...' were guaranteed that upon exit the
/dev/zvol/ block device link would be created or an error
returned. That's no longer the case. The utilities will no
longer block waiting for the symlink to be created. Callers
are now responsible for blocking, this is why a 'udev_wait'
call was added to the 'label' function in scripts/common.sh.
2) The read-only behavior of a ZVOL now solely depends on if
the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant
policy setting in the gendisk structure was removed. This
both simplifies the code and allows us to safely leverage
set_disk_ro() to issue a KOBJ_CHANGE uevent. See the
comment in the code for futher details on this.
3) Because __zvol_create_minor() and zvol_alloc() may now be
called in a sync task they must use KM_PUSHPAGE.
References:
illumos/illumos-gate@681d9761e8516a7dc5ab6589e2dfe717777e1123
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #1969
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3236 zio nop-write
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
illumos/illumos-gate@80901aea8e78a2c20751f61f01bebd1d5b5c2ba5
https://www.illumos.org/issues/3236
Porting Notes
1. This patch is being merged dispite an increased instance of
https://www.illumos.org/issues/3113 being triggered by ztest.
Ported-by: Brian Behlendorf <[email protected]>
Issue #1489
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
https://www.illumos.org/issues/3598
illumos/illumos-gate@be6fd75a69ae679453d9cda5bff3326111e6d1ca
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
Porting notes:
1. include/sys/zfs_context.h has been modified to render some new
macros inert until dtrace is available on Linux.
2. Linux-specific changes have been adapted to use SET_ERROR().
3. I'm NOT happy about this change. It does nothing but ugly
up the code under Linux. Unfortunately we need to take it to
avoid more merge conflicts in the future. -Brian
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3464 zfs synctask code needs restructuring
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
https://www.illumos.org/issues/3464
illumos/illumos-gate@3b2aab18808792cbd248a12f1edf139b89833c13
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1495
|
|
|
|
|
|
|
|
|
|
| |
Linux kernel commit torvalds/linux@db2a144 changed the return type
of block_device_operations->release() to void. Detect the expected
prototype and defined our callout accordingly.
Signed-off-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1494
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One of the side effects of calling zvol_create_minors() in
zvol_init() is that all pools listed in the cache file will
be opened. Depending on the state and contents of your pool
this operation can take a considerable length of time.
Doing this at load time is undesirable because the kernel
is holding a global module lock. This prevents other modules
from loading and can serialize an otherwise parallel boot
process. Doing this after module inititialization also
reduces the chances of accidentally introducing a race
during module init.
To ensure that /dev/zvol/<pool>/<dataset> devices are
still automatically created after the module load completes
a udev rules has been added. When udev notices that the
/dev/zfs device has been create the 'zpool list' command
will be run. This then will cause all the pools listed
in the zpool.cache file to be opened.
Because this process in now driven asynchronously by udev
there is the risk of problems in downstream distributions.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #756
Issue #1020
Issue #1234
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following error will occur on some (possibly all) kernels
because blk_init_queue() will try to take the spinlock before
we initialize it.
BUG: spinlock bad magic on CPU#0, zpool/4054
lock: 0xffff88021a73de60, .magic: 00000000,
.owner: <none>/-1, .owner_cpu: 0
Pid: 4054, comm: zpool Not tainted 3.9.3 #11
Call Trace:
[<ffffffff81478ef8>] spin_dump+0x8c/0x91
[<ffffffff81478f1e>] spin_bug+0x21/0x26
[<ffffffff812da097>] do_raw_spin_lock+0x127/0x130
[<ffffffff8147d851>] _raw_spin_lock_irq+0x21/0x30
[<ffffffff812c2c1e>] cfq_init_queue+0x1fe/0x350
[<ffffffff812aacb8>] elevator_init+0x78/0x140
[<ffffffff812b2677>] blk_init_allocated_queue+0x87/0xb0
[<ffffffff812b26d5>] blk_init_queue_node+0x35/0x70
[<ffffffff812b271e>] blk_init_queue+0xe/0x10
[<ffffffff8125211b>] __zvol_create_minor+0x24b/0x620
[<ffffffff81253264>] zvol_create_minors_cb+0x24/0x30
[<ffffffff811bd9ca>] dmu_objset_find_spa+0xea/0x510
[<ffffffff811bda71>] dmu_objset_find_spa+0x191/0x510
[<ffffffff81253ea2>] zvol_create_minors+0x92/0x180
[<ffffffff811f8d80>] spa_open_common+0x250/0x380
[<ffffffff811f8ece>] spa_open+0xe/0x10
[<ffffffff8122817e>] pool_status_check.part.22+0x1e/0x80
[<ffffffff81228a55>] zfsdev_ioctl+0x155/0x190
[<ffffffff8116a695>] do_vfs_ioctl+0x325/0x5a0
[<ffffffff8116a950>] sys_ioctl+0x40/0x80
[<ffffffff814812c9>] ? do_page_fault+0x9/0x10
[<ffffffff81483929>] system_call_fastpath+0x16/0x1b
zd0: unknown partition table
We fix this by calling spin_lock_init before blk_init_queue.
The manner in which zvol_init() initializes structures is
suspectible to a race between initialization and a probe on
a zvol. We reorganize zvol_init() to prevent that.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
| |
By definitition these allocations will never fail. For
consistency with the rest of the code remove this dead error
handling code.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1558
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices. By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/. When set to
'visible' all zvol snapshots for the dataset will be visible.
This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created. When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions. This leads
to a variety of issues:
1) The zvol partition tables will be read in the context of
the `modprobe zfs` for automatically imported pools. This
is undesirable and should be done asynchronously, but for
now reducing the number of visible devices helps.
2) Udev expects to be able to complete its work for a new
block devices fairly quickly. When many zvol devices are
added at the same time this is no longer be true. It can
lead to udev timeouts and missing /dev/zvol links.
3) Simply having lots of devices in /dev/ can be aukward from
a management standpoint. Hidding the devices your unlikely
to ever use helps with this. Any snapshot device which is
needed can be made visible by changing the snapdev property.
NOTE: This patch changes the default behavior for zvols which
was effectively 'snapdev=visible'.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1235
Closes #945
Issue #956
Issue #756
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The changes to zvol.c were never merged from the last onnv_147
bulk update. This was because zvol.c was largely rewritten
for Linux making it fairly easy to miss these sorts of changes.
This causes a regression when importing a zpool with zvols
read-only. This does not impact pool which only contain
filesystem datasets.
References:
illumos/illumos-gate@f9af39b
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1332
Closes #1333
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.
Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:
vdev_opts_t 52
zil_replay_func_t 24
zio_compress_info_t 24
zio_checksum_info_t 9
space_map_ops_t 7
arc_byteswap_func_t 5
The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1300
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 65d56083b4617a4cade0cff68cbbaf68114169d6 fixes the lock
inversion between spa_namespace_lock and bdev->bd_mutex but only
for the first user of spa_namespace_lock: dmu_objset_own().
Later spa_namespace_lock gets acquired by dsl_prop_get_integer()
though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()->
spa_open()->spa_open_common() without this "protection". By
moving the mutex release after this second use, even this
acquisition of the lock is "protected" by the ERESTARTSYS trick.
Signed-off-by: Massimo Maggi <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1220
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock. But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.
To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock(). Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.
When it is contended we risk a lock inversion if we were to
block waiting for the lock. Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again. This process can be repeated safely until both
locks are acquired.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jorgen Lundman <[email protected]>
Closes #612
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented. This prevented
a zpool from use a zvol for its slog device.
This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c. Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system. If
they match we know the device is a zvol.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1131
|
|
|
|
|
|
|
|
|
|
| |
If zvol_alloc() fails zv will be set to NULL and dereferenced
in out_dmu_objset_disown. To avoid this entirely the zv->objset
line is moved up in to the success block.
Original-patch-by: Jorgen Lundman <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1109
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It doesn't make sense for a zvol to use the default system I/O
scheduler because it is a virtual device. Therefore, we change
the default scheduler to 'noop' for zvols provided that the
elevator_change() function is available. This interface has
been available since Linux 2.6.36 and appears in the RHEL 6.x
kernels.
We deliberately do not implement the method for older kernels
because it was racy and could result in system crashes. It's
better to simply manually tune the scheduler for these kernels.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1017
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when processing DISCARD requests, zvol_discard() calls
dmu_free_long_range() with the precise offset and size of the request.
Unfortunately, this is not optimal for requests that are not aligned to
the zvol block boundaries. Indeed, in the case of an unaligned range,
dnode_free_range() will zero out the unaligned parts. Not only is this
useless since we are not freeing any space by doing so, it is also slow
because it translates to a read-modify-write operation.
This patch fixes the issue by rounding up the discard start offset to
the next volume block boundary, and rounding down the discard end
offset to the previous volume block boundary.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1010
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
illumos/illumos-gate@2e2c135528b3edfe9aaf67d1f004dc0202fa1a54
Illumos changeset: 13780:6da32a929222
3100 zvol rename fails with EBUSY when dirty
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Adam H. Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Garrett D'Amore <[email protected]>
Approved by: Eric Schrock <[email protected]>
Ported-by: Etienne Dechamps <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #995
|