| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is foundational work for ZED.
Updates a leaf vdev's persistent device strings on Linux platform
* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES
Some examples:
path: '/dev/sdb1'
devid: 'scsi-350000394a8ca4fbc-part1'
phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'
path: '/dev/mapper/mpatha'
devid: 'dm-uuid-mpath-35000c5006304de3f'
Signed-off-by: Don Brady <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2856
Closes #3978
Closes #4416
|
|
|
|
|
|
| |
Signed-off-by: Richard Laager <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4471
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Historically libblkid support was detected as part of configure
and optionally enabled. This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.
This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools. For distributions
which include a modern version of libblkid there is no change in
behavior. Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.
Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2448
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
locality information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay
References:
https://github.com/freebsd/freebsd@5c7a6f5d
https://github.com/freebsd/freebsd@31b7f68d
https://github.com/freebsd/freebsd@e186f564
Performance Testing:
https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141
Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc946f82c76e0b34c2f0370ed9aea022 by my
collegue Andriy Gapon has been included. It applied perfectly, but
added a cstyle regression.
- This replaces 556011dbec2d10579819078559a77630fc559112 entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4334
|
|
|
|
|
|
|
|
|
|
|
| |
Commit d2f3e29 introduced the -p option which outputs full paths
for vdevs to multiple zpool subcommands. When this was merged
there was no conflict for this flag letter. However it's certain
there will be a conflict with the -p (parsable) flag used by other
subcommands. Therefore, -p is being changed to -P to avoid this.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4368
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following options have been added to the zpool add, iostat,
list, status, and split subcommands. The default behavior was
not modified, from zfs(8).
-g Display vdev GUIDs instead of the normal short
device names. These GUIDs can be used in-place of
device names for the zpool detach/off‐
line/remove/replace commands.
-L Display real paths for vdevs resolving all symbolic
links. This can be used to lookup the current block
device name regardless of the /dev/disk/ path used
to open it.
-p Display full paths for vdevs instead of only the
last component of the path. This can be used in
conjunction with the -L flag.
This behavior may also be enabled using the following environment
variables.
ZPOOL_VDEV_NAME_GUID
ZPOOL_VDEV_NAME_FOLLOW_LINKS
ZPOOL_VDEV_NAME_PATH
This change is based on worked originally started by Richard Yao
to add a -g option. Then extended by @ilovezfs to add a -L option
for openzfsonosx. Those changes have been merged, re-factored,
a -p option added and extended to all relevant zpool subcommands.
Original-patch-by: Richard Yao <[email protected]>
Extended-by: ilovezfs <[email protected]>
Extended-by: Brian Behlendorf <[email protected]>
Signed-off-by: ilovezfs <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2011
Closes #4341
|
|
|
|
|
|
|
|
|
|
|
| |
Set a limit for the largest compressed block which can be written
to an L2ARC device. By default this limit is set to 16M so there
is no change in behavior.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Elling <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #4323
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reintroduce a slightly adapted version of the Illumos logic for
synchronous unlinks. The basic idea here is that only files
smaller than zfs_delete_blocks (20480) blocks should be deleted
synchronously. Unlinking larger files should be handled
asynchronously to minimize impact to the caller.
To accomplish this iput() which is responsible for calling
zfs_znode_delete() on Linux is only called in the delete_now
path. Otherwise zfs_async_iput() is used which allows the
last reference to be dropped by a taskq thread effectively
making the removal asynchronous.
Porting notes:
- Add zfs_delete_blocks module option for performance analysis.
The default value is DMU_MAX_DELETEBLKCNT which is the same
as upstream. Reducing this value means that smaller files
will be unlinked asynchronously like large files.
- All occurrences of zfsvfs changes to zsb.
Ported-by: KernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
6251 - add tunable to disable free_bpobj processing
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Simon Klinkert <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Albert Lee <[email protected]>
Reviewed by: Xin Li <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
https://www.illumos.org/issues/6251
https://github.com/illumos/illumos-gate/commit/139510f
Porting notes:
- Added as module option declaration.
- Added to zfs-module-parameters.5 man page.
Ported-by: Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Under RHEL6/CentOS6 the default stack size must be increased to 32K
to prevent overflowing the stack when running ztest. This isn't an
issue for other distributions due to either the version of pthreads
or perhaps the compiler. Doubling the stack size resolves the
issue safely for all distribution and leaves us some headroom.
$ sudo -E ztest -V -T 300 -f /var/tmp/
5 vdevs, 7 datasets, 23 threads, 300 seconds...
loading space map for vdev 0 of 1, metaslab 0 of 30 ...
...
loading space map for vdev 0 of 1, metaslab 14 of 30 ...
child died with signal 11
Exited ztest with error 3
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4215
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Approved by: Gordon Ross <[email protected]>
References:
https://www.illumos.org/issues/5987 zfs prefetch code needs work
illumos/illumos-gate@cf6106c 5987 zfs prefetch code needs work
Porting notes:
- [module/zfs/dbuf.c]
- 5f6d0b6 Handle block pointers with a corrupt logical size
- [module/zfs/dmu_zfetch.c]
- c65aa5b Fix gcc missing parenthesis warnings
- 428870f Update core ZFS code from build 121 to build 141.
- 79c76d5 Change KM_PUSHPAGE -> KM_SLEEP
- b8d06fc Switch KM_SLEEP to KM_PUSHPAGE
- Account for ISO C90 - mixed declarations and code - warnings
- Module parameters (new/changed):
- Replaced zfetch_block_cap with zfetch_max_distance
(Max bytes to prefetch per stream (default 8MB; 8 * 1024 * 1024))
- Preserved zfs_prefetch_disable as 'int' for consistency with
existing Linux module options.
- [include/sys/trace_arc.h]
- Added new tracepoints
- DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__sync__wait__for__async);
- DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__demand__hit__predictive__prefetch);
- [man/man5/zfs-module-parameters.5]
- Updated man page
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4891 want zdb option to dump all metadata
Reviewed by: Sonu Pillai <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Reviewed by: Richard Lowe <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
We'd like a way for zdb to dump metadata in a machine-readable
format, so that we can bring that back from a customer site for
in-house diagnosis. Think of it as a crash dump for zpools,
which can be used for post-mortem analysis of a malfunctioning
pool
References:
https://www.illumos.org/issues/4891
https://github.com/illumos/illumos-gate/commit/df15e41
Porting notes:
- [cmd/zdb/zdb.c]
- a5778ea zdb: Introduce -V for verbatim import
- In main() getopt 'opt' variable removed and the code was
brought back in line with illumos.
- [lib/libzpool/kernel.c]
- 1e33ac1 Fix Solaris thread dependency by using pthreads
- f0e324f Update utsname support
- 4d58b69 Fix vn_open/vn_rdwr error handling
- In vn_open() allocate 'dumppath' on heap instead of stack
- Properly handle 'dump_fd == -1' error path
- Free 'realpath' after added vn_dumpdir_code block
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: nathancheek <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4184
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
References:
https://www.illumos.org/issues/5960
https://www.illumos.org/issues/5925
https://github.com/illumos/illumos-gate/commit/a2cdcdd
Porting notes:
- [lib/libzfs/libzfs_sendrecv.c]
- b8864a2 Fix gcc cast warnings
- 325f023 Add linux kernel device support
- 5c3f61e Increase Linux pipe buffer size on 'zfs receive'
- [module/zfs/zfs_vnops.c]
- 3558fd7 Prototype/structure update for Linux
- c12e3a5 Restructure zfs_readdir() to fix regressions
- [module/zfs/zvol.c]
- Function @zvol_map_block() isn't needed in ZoL
- 9965059 Prefetch start and end of volumes
- [module/zfs/dmu.c]
- Fixed ISO C90 - mixed declarations and code
- Function dmu_prefetch() 'int i' is initialized before
the following code block (c90 vs. c99)
- [module/zfs/dbuf.c]
- fc5bb51 Fix stack dbuf_hold_impl()
- 9b67f60 Illumos 4757, 4913
- 34229a2 Reduce stack usage for recursive traverse_visitbp()
- [module/zfs/dmu_send.c]
- Fixed ISO C90 - mixed declarations and code
- b58986e Use large stacks when available
- 241b541 Illumos 5959 - clean up per-dataset feature count code
- 77aef6f Use vmem_alloc() for nvlists
- 00b4602 Add linux kernel memory support
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5745 zfs set allows only one dataset property to be set at a time
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Bayard Bell <[email protected]>
Reviewed by: Richard PALO <[email protected]>
Reviewed by: Steven Hartland <[email protected]>
Approved by: Rich Lowe <[email protected]>
References:
https://www.illumos.org/issues/5745
https://github.com/illumos/illumos-gate/commit/3092556
Porting notes:
- Fix the missing braces around initializer, zfs_cmd_t zc = {"\0"};
- Remove extra format argument in zfs_do_set()
- Declare at the top:
- zfs_prop_t prop;
- nvpair_t *elem;
- nvpair_t *next;
- int i;
- Additionally initialize:
- int added_resv = 0;
- zfs_prop_t prop = 0;
- Assign 0 install of NULL for uint64_t types.
- zc->zc_nvlist_conf = '\0';
- zc->zc_nvlist_src = '\0';
- zc->zc_nvlist_dst = '\0';
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3574
|
|
|
|
|
|
|
|
|
| |
Adds zio_taskq_batch_pct as an exported module parameter,
allowing users to modify it at module load time.
Signed-off-by: DHE <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4110
|
|
|
|
|
|
|
|
|
| |
Correct some misspelled words and grammatical errors, and remove
trailing white space in the man pages.
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4115
|
|
|
|
|
| |
Signed-off-by: Turbo Fredriksson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Turbo Fredriksson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes: #3762
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.
`zpios ... -S size | --blocksize size ...`
This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`
Example run below
(non realistic results from a VM and output abbreviated for space)
```
--regioncount=750 --regionsize=8M --chunksize=1M --offset=4K
--threaddelay=0 --cleanup --human-readable --verbose --cleanup
--blocksize=128K,256K,512K,1M
th-cnt rg-cnt rg-sz ch-sz blksz wr-data wr-bw rd-data rd-bw
---------------------------------------------------------------------
4 750 8m 1m 128k 5g 90.06m 5g 93.37m
4 750 8m 1m 256k 5g 79.71m 5g 99.81m
4 750 8m 1m 512k 5g 42.20m 5g 93.14m
4 750 8m 1m 1m 5g 35.51m 5g 89.36m
8 750 8m 1m 128k 5g 85.49m 5g 90.81m
8 750 8m 1m 256k 5g 61.42m 5g 99.24m
8 750 8m 1m 512k 5g 49.09m 5g 108.78m
16 750 8m 1m 128k 5g 86.28m 5g 88.73m
16 750 8m 1m 256k 5g 64.34m 5g 93.47m
16 750 8m 1m 512k 5g 68.84m 5g 124.47m
16 750 8m 1m 1m 5g 53.97m 5g 97.20m
---------------------------------------------------------------------
```
Signed-off-by: Don Brady <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3795
Closes #2071
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e7409e6d85289f4955adf07fac834f5. There are two issues with the
mount option handling:
* Libzfs has historically included "xattr" in its list of default mount
options. This overrides the dataset property, so the dataset is always
configured to use directory-based xattrs even when the xattr dataset
property is set to off or sa. Address this by removing "xattr" from
the set of default mount options in libzfs.
* There was no way to enable system attribute-based extended attributes
using temporary mount options. Add the mount options "saxattr" and
"dirxattr" which enable the xattr behavior their names suggest. This
approach has the advantages of mirroring the valid xattr dataset
property values and following existing conventions for mount option
names.
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3787
|
|
|
|
|
|
|
|
|
|
|
|
| |
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume. Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3659
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Internally ZFS keeps a small log to facilitate debugging. By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.
$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp message
1441141408 spa=tank async request task=1
1441141408 txg 70 open pool version 5000; software version 5000/5; ...
1441141409 spa=tank async request task=32
1441141409 txg 72 import pool version 5000; software version 5000/5; ...
1441141414 command: lt-zpool import tank
Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log. As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.
$ ZFS_DEBUG=on ./cmd/ztest/ztest -V
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #3728
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch is based on the previous work done by @andrey-ve and
@yshui. It triggers the automount by using kern_path() to traverse
to the known snapshout mount point. Once the snapshot is mounted
NFS can access the contents of the snapshot.
Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server. To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally. System administators desiring this
functionally must explicitly enable it.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2797
Closes #1655
Closes #616
|
|\
| |
| |
| |
| |
| |
| |
| | |
Performance improvements for zvols.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3720
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.
Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.
Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:
1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.
An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat. Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.
Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.
Signed-off-by: Richard Yao <[email protected]>
|
|/
|
|
|
|
|
|
|
|
|
|
|
| |
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #985
Closes #3351
|
|
|
|
|
|
|
|
|
|
| |
Allow for easy turning of a pools reserved free space. Previous
versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity
in reserve. Commits 3d45fdd and 0c60cc3 increased this to 1/32.
Setting spa_slop_shift=6 will restore the previous default setting.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3724
|
|
|
|
|
|
|
|
|
|
|
| |
Add new keyword 'slot' to vdev_id.conf
This selects from where to get the slot number for a SAS/SATA disk
Needed to enable access to the physical position of a disk in a
Supermicro 2027R-AR24NV .
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #3693
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This brings the behavior of arc_memory_throttle() back in sync with
illumos. The updated memory throttling policy roughly goes like this:
* Never throttle if more than 10% of memory is free. This threshold
is configurable with the zfs_arc_lotsfree_percent module option.
* Minimize any throttling of kswapd even when free memory is below
the set threshold. Allow it to write out pages as quickly as
possible to help alleviate the memory pressure.
* Delay all other threads when free memory is below the set threshold
in order to avoid compounding the memory pressure. Buffers will be
evicted from the ARC to reduce the issue.
The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3637
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages. This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim. Conceptually
this brings arc_available_memory() back in sync with illumos.
It is also desirable that the target amount of free memory be tunable
on a system. While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.
zfs_arc_sys_free - The target number of bytes the ARC should leave
as free memory on the system. This value can
checked in /proc/spl/kstat/zfs/arcstats and
setting this module option will override the
default value.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3637
|
|
|
|
|
|
|
|
|
|
| |
The zvol_threads module option should be bounded to a reasonable
range. The taskq must have at least 1 thread and shouldn't have
more than 1,024 at most. The default value of 32 is a reasonable
default.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3614
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit f521ce1 removed the minimum value for "arc_p" allowing it to
drop to zero or grow to "arc_c". This was done to improve specific
workload which constantly dirties new "metadata" but also frequently
touches a "small" amount of mfu data (e.g. mkdir's).
This change may still be desirable but it needs to be re-investigated.
in the context of the recent ARC changes from upstream. Therefore
this code is being restored to facilitate benchmarking. By setting
"zfs_arc_p_min_shift=64" we easily compare the performance.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3533
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5764 "zfs send -nv" directs output to stderr
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Basil Crow <[email protected]>
Reviewed by: Steven Hartland <[email protected]>
Reviewed by: Bayard Bell <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://github.com/illumos/illumos-gate/commit/dc5f28a
https://www.illumos.org/issues/5764
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3585
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Josef 'Jeff' Sipek <[email protected]>
Reviewed by: Xin LI <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
https://github.com/illumos/illumos-gate/commit/db1741f
https://www.illumos.org/issues/5661
Ported-by: kernelOfTruth [email protected]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3571
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A few minor mistakes than should be fixed:
zpool:
compatability -> compatibility
zfs:
accessable -> accessible
availible -> available
zfs-events:
availible -> available
zfs-module-parameters:
proceding -> proceeding
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3544
|
|
|
|
|
|
|
|
|
|
| |
This seems generally useful. metaslab_aliquot is the ZFS allocation
granularity, which is roughly equivalent to what is called the stripe
size in traditional RAID arrays. It seems relevant to performance
tuning.
Signed-off-by: Etienne Dechamps <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
sysstat's iostat omits the first report when the -y option is used.
This patch adds that functionality and omits the first report with
statistics since system boot.
Signed-off-by: Hajo Möller <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3439
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Dan McDonald <[email protected]>
Porting notes and other significant code changes:
The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.
Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.
The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.
Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:
302f753 - Integrate ARC more tightly with Linux
39e055c - Adjust arc_p based on "bytes" in arc_shrink
f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
77765b5 - Remove "arc_meta_used" from arc_adjust calculation
94520ca - Prune metadata from ghost lists in arc_adjust_meta
Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:
fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }
The following arcstats have been removed:
recycle_miss - Used by arcstat.py and arc_summary.py, both of which
have been updated appropriately.
l2_writes_hdr_miss
The following arcstats have been added:
evict_not_enough - Number of times arc_evict_state() was unable to
evict enough buffers to reach its target amount.
evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
because it was being written to the l2arc.
l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times
l2arc_write_done() failed to acquire hash_lock (and re-tries).
arc_meta_min - Shows the value of the zfs_arc_meta_min module
parameter (see below).
The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme. Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.
The following module parameters have been added:
zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
before moving on to the next sub-list.
zfs_arc_meta_min - Enforce a floor on the amount of metadata in
the ARC.
zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
ARC state.
zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
the target size to be considered "overflowing".
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]
|
|
|
|
|
|
|
|
|
|
|
| |
* Add information about the 'zpool events' command in zpool(8).
* More events and payloads defined in zfs-events(5).
* I/O Stages and I/O Flags sections added.
* Remove unused legacy "zio_deadline" payload define.
Signed-off-by: Turbo Fredriksson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3467
|
|
|
|
|
|
|
|
|
| |
If the pool/dataset command-line argument is specified with a trailing
slash, for example, "tank/", it is interpreted as the root dataset.
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3415
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit updates the copyright boilerplate within the ZED subtree.
The instructions for appending a contributor copyright line have
been removed. Manually maintaining copyright notices in this
manner is error-prone, imprecise at a file-scope granularity, and
oftentimes inaccurate. These lines can become a pernicious source of
merge conflicts. A commit log is better suited to maintaining this
information. Consequently, a line has been added to the boilerplate
to refer to the git commit log for authoritative copyright attribution.
To account for the scenario where a file may become separated from
the codebase and commit history (i.e., it is copied somewhere else),
a line has been added to identify the file's origin.
http://softwarefreedom.org/resources/2012/ManagingCopyrightInformation.html
Signed-off-by: Chris Dunlap <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3384
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5027 zfs large block support
Reviewed by: Alek Pinchuk <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Josef 'Jeff' Sipek <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Saso Kiselkov <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/5027
https://github.com/illumos/illumos-gate/commit/b515258
Porting Notes:
* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.
* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes. Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.
* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize. This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.
* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX. This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M.
* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space. We should be able to relax
this one the ABD patches are merged.
Ported-by: Brian Behlendorf <[email protected]>
Closes #354
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Support exporting all imported pools in one go, using 'zpool export -a'.
This is accomplished by moving the export parts from zpool_do_export()
in to the new function zpool_export_one(). The for_each_pool() function
is used to enumerate the list of pools to be exported. Passing an argc
of 0 implies the function should be called on all pools.
Signed-off-by: Turbo Fredriksson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes: #3203
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3897 zfs filesystem and snapshot limits
Author: Jerry Jelinek <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Christopher Siden <[email protected]>
References:
https://www.illumos.org/issues/3897
https://github.com/illumos/illumos-gate/commit/a2afb61
Porting Notes:
dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc().
Ported-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
5410 Document -S option to zfs inherit
5412 Mention -S option when zfs inherit fails on quota
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/5410
https://github.com/illumos/illumos-gate/commit/5ff8cfa9
Ported-by: DHE <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3279
|
|
|
|
|
|
|
|
| |
The default was changed in #2820.
Signed-off-by: cburroughs <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3341
|
|
|
|
|
|
|
|
|
|
| |
Commit b1a3e93217e6e474e86345010469994c066cf875 accidentally
introduced an intentation error between the 'zfs receive'
and 'zfs allow' detailed documentation sections.
Signed-off-by: DHE <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3312
|
|
|
|
|
|
|
|
| |
Add a basic summary to zfs(8) describing bookmarks.
Signed-off-by: Turbo Fredriksson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3268
|
|
|
|
|
|
|
| |
Commit b738bc5 should have updated the default value of zfs_pd_bytes_max
in the zfs(8) man page. The correct default value is 50*1024*1024.
Signed-off-by: Brian Behlendorf <[email protected]>
|