| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When building modules (as well as the kernel) with ARCH=um, the options
-Dsetjmp=kernel_setjmp and -Dlongjmp=kernel_longjmp are passed to the C
preprocessor for C files. This causes the setjmp and longjmp used in
module/lua/ldo.c to be kernel_setjmp and kernel_longjmp respectively in
the object file. However, the setjmp and longjmp that is intended to be
called is defined in an architecture dependent assembly file under the
directory module/lua/setjmp. Since it is an assembly and not a C file,
the preprocessor define is not given and the names do not change. This
becomes an issue when modpost is trying to create the Module.symvers
and sees no defined symbol for kernel_setjmp and kernel_longjmp. To fix
this, if the macro CONFIG_UML is defined, then setjmp and longjmp
macros are undefined.
When building with ARCH=um for x86 sub-architectures, CONFIG_X86 is not
defined. Instead, CONFIG_UML_X86 is defined. Despite this, the UML x86
sub-architecture can use the same object files as the x86 architectures
because the x86 sub-architecture UML kernel is running with the same
instruction set as CONFIG_X86. So the modules/Kbuild build file is
updated to add the same object files that CONFIG_X86 would add when
CONFIG_UML_X86 is defined.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Glenn Washburn <[email protected]>
Closes #13547
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Sponsored-by: Klara Inc.
Closes #12676
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows ZFS datasets to be delegated to a user/mount namespace
Within that namespace, only the delegated datasets are visible
Works very similarly to Zones/Jailes on other ZFS OSes
As a user:
```
$ unshare -Um
$ zfs list
no datasets available
$ echo $$
1234
```
As root:
```
# zfs list
NAME ZONED MOUNTPOINT
containers off /containers
containers/host off /containers/host
containers/host/child off /containers/host/child
containers/host/child/gchild off /containers/host/child/gchild
containers/unpriv on /unpriv
containers/unpriv/child on /unpriv/child
containers/unpriv/child/gchild on /unpriv/child/gchild
# zfs zone /proc/1234/ns/user containers/unpriv
```
Back to the user namespace:
```
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
containers 129M 47.8G 24K /containers
containers/unpriv 128M 47.8G 24K /unpriv
containers/unpriv/child 128M 47.8G 128M /unpriv/child
```
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Will Andrews <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
Co-authored-by: Allan Jude <[email protected]>
Co-authored-by: Mateusz Piotrowski <[email protected]>
Sponsored-by: Buddy <https://buddy.works>
Closes #12263
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When read and writing the UID/GID, we always want the value
relative to the root user namespace, the kernel will take care
of remapping this to the user namespace for us.
Calling from_kuid(user_ns, uid) with a unmapped uid will return -1
as that uid is outside of the scope of that namespace, and will result
in the files inside the namespace all being owned by 'nobody' and not
being allowed to call chmod or chown on them.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Closes #12263
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for the kernel's block multiqueue (blk-mq) interface in
the zvol block driver. blk-mq creates multiple request queues on
different CPUs rather than having a single request queue. This can
improve zvol performance with multithreaded reads/writes.
This implementation uses the blk-mq interfaces on 4.13 or newer
kernels. Building against older kernels will fall back to the
older BIO interfaces.
Note that you must set the `zvol_use_blk_mq` module param to
enable the blk-mq API. It is disabled by default.
In addition, this commit lets the zvol blk-mq layer process whole
`struct request` IOs at a time, rather than breaking them down
into their individual BIOs. This reduces dbuf lock contention
and overhead versus the legacy zvol submit_bio() codepath.
sequential dd to one zvol, 8k volblocksize, no O_DIRECT:
legacy submit_bio() 292MB/s write 453MB/s read
this commit 453MB/s write 885MB/s read
It also introduces a new `zvol_blk_mq_chunks_per_thread` module
parameter. This parameter represents how many volblocksize'd chunks
to process per each zvol thread. It can be used to tune your zvols
for better read vs write performance (higher values favor write,
lower favor read).
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ahelenia Ziemiańska <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #13148
Issue #12483
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <[email protected]>
Reviewed-by: Ahelenia Ziemiańska <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Co-authored-by: Rich Ercolani <[email protected]>
Closes #10058
Closes #12918
|
|
|
|
|
|
|
|
|
|
|
| |
As of the Linux 5.19 kernel the asm/fpu/internal.h header was
entirely removed. It has been effectively empty since the 5.16
kernel and provides no required functionality.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Attila Fülöp <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13529
|
|
|
|
|
|
|
|
|
|
|
|
| |
As of the Linux 5.19 kernel an identically named zap_flags_t typedef
is declared in the include/linux/mm_types.h linux header. Sadly,
the inclusion of this header cannot be easily avoided. To resolve
the conflict a #define is used to remap the name in the OpenZFS
sources when building against the Linux kernel.
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13515
|
|
|
|
|
|
|
|
|
| |
As of the Linux 5.19 kernel the disk_*_io_acct() helper functions
have been replaced by the bdev_*_io_acct() functions.
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13515
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 5.19 commit torvalds/linux@44abff2c0 removed the
blk_queue_secure_erase() helper function. The preferred
interface is to now use the bdev_max_secure_erase_sectors()
function to check for discard support.
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13515
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 5.19 commit torvalds/linux@70200574cc removed the
blk_queue_discard() helper function. The preferred interface
is to now use the bdev_max_discard_sectors() function to check
for discard support.
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13515
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In current zil_commit() process, transaction lwb_tx is assigned in
zil_lwb_write_issue(), and is committed in zil_lwb_flush_vdevs_done().
Thus, during lwb write out process, the txg is held in open or quiesing
state, until zil_lwb_flush_vdevs_done() is called. If the zil's zio
latency is high, it will cause txg_sync_thread() to starve.
The goal here is to defer waiting for zil_lwb_flush_vdevs_done to the
'syncing' txg state. That is, in zil_sync().
In this patch, it achieves the goal without holding transaction.
A new function zil_lwb_flush_wait_all() is introduced. It waits for
the completion of all the zil_lwb_flush_vdevs_done() by given txg.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Signed-off-by: jxdking <[email protected]>
Closes #12321
|
|
|
|
|
|
|
|
|
|
|
| |
This was breaking the kmod port build on FreeBSD with Clang 13.
Use the same trick as we do for ASSERT() to make DNODE_VERIFY() use
its parameter at compile time without actually using it at run time
in non-debug builds.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #13507
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Make prefetch distance adaptive: up to 4MB prefetch doubles for
every, hit same as before, but after that it grows by 1/8 every time
the prefetch read does not complete in time to satisfy the demand.
My tests show that 4MB is sufficient for wide NVMe pool to saturate
single reader thread at 2.5GB/s, while new 64MB maximum allows the
same thread to reach 1.5GB/s on wide HDD pool. Further distance
increase may increase speed even more, but less dramatic and with
higher latency.
- Allow early reuse of inactive prefetch streams: streams that never
saw hits can be reused immediately if there is a demand, while others
can be reused after 1s of inactivity, starting with the oldest. After
2s of inactivity streams are deleted to free resources same as before.
This allows by several times increase strided read performance on HDD
pool in presence of simultaneous random reads, previously filling the
zfetch_max_streams limit for seconds and so blocking most of prefetch.
- Always issue intermediate indirect block reads with SYNC priority.
Each of those reads if delayed for longer may delay up to 1024 other
block prefetches, that may be not good for wide pools.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes #13452
|
|
|
|
|
|
|
|
| |
It turns out that wrapping the atomic macro in () breaks build
on Linux/SPARC64. Oops.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rich Ercolani <[email protected]>
Closes #13506
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Original Log Size Limit implementation blocked all writes in case of
limit reached until the TXG is committed and the log is freed. It
caused huge delays and following speed spikes in application writes.
This implementation instead smoothly throttles writes, using exactly
the same mechanism as used for dirty data.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: jxdking <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Issue #12284
Closes #13476
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that "do LZ4 and zstd-1 both fail" is a great heuristic
for "don't even bother trying higher zstd tiers".
By way of illustration:
$ cat /incompress | mbuffer | zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_normal
summary: 39.8 GiByte in 3min 40.2sec - average of 185 MiB/s
$ echo 3 | sudo tee /sys/module/zzstd/parameters/zstd_lz4_pass
3
$ cat /incompress | mbuffer -m 4G | zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_patched
summary: 39.8 GiByte in 48.6sec - average of 839 MiB/s
$ sudo zfs list -p -o name,used,lused,ratio evenfaster/lowcomp_1M_zstd12_normal evenfaster/lowcomp_1M_zstd12_patched
NAME USED LUSED RATIO
evenfaster/lowcomp_1M_zstd12_normal 39549931520 42721221632 1.08
evenfaster/lowcomp_1M_zstd12_patched 39626399744 42721217536 1.07
$ python3 -c "print(39626399744 - 39549931520)"
76468224
$
I'll take 76 MB out of 42 GB for > 4x speedup.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Kjeld Schouten <[email protected]>
Reviewed-by: Ahelenia Ziemiańska <[email protected]>
Signed-off-by: Rich Ercolani <[email protected]>
Closes #13244
|
|
|
|
|
|
|
|
|
| |
This also expands the zfs version output from 127 characters to However
Many Are Actually Set
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13330
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The short-path is now one access() call,
we always modprobe zfs (ZFS_MODULE_LOADING which doesn't use the libzfs
boolean parsing is gone),
and we use a simple inotify IN_CREATE loop with a timerfd timeout
rather than 10ms kernel-style polling
There's one substantial difference: ZFS_MODULE_TIMEOUT=-1
now means "never give up", rather than "wait 10 minutes"
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13330
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13447
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are times when end-users may wish to have
a fast and convenient method to get zpool guid
without having to use libzfs. This commit
exposes the zpool guid via kstats in similar
manner to the zpool state.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Andrew Walker <[email protected]>
Closes #13466
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13413
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13434
|
|
|
|
|
|
|
|
|
|
| |
With the additional benefit of removing all the _all() functions and
treating a NULL list as "all" ‒ the remaining all function is for all
/datasets/, which is consistent with the rest of the API
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13165
|
|
|
|
|
|
|
|
| |
Still descend, but only once: we get a lot of mileage out of nodist_
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13316
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Holding a dbuf is a common operation which can become highly contended
in dbuf_find() when acquiring the dbuf hash mutex. This is particularly
true on Linux when reading/writing volumes since by default up to 32
threads from the zvol_taskq may be taking a hold of the same dbuf.
This should also be observable on FreeBSD as long as there are enough
processes accessing the volume concurrently.
This is further aggregrated by the fact that only the block id will
be unique when calculating the dbuf hash for a single volume. The
objset id, object id, and level will be the same for data blocks.
This has been observed to result in a somehwat less than uniform hash
distribution and a longer than expected max hash chain depth (~20)
on a large memory system (256 GB) using volumes.
This commit improves the siutation by switching the hash mutex to
an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS
from 2048 to 8192 to further reduce the odds of a hash collision.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13405
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.
This commit deals with 2 cases:
- No page writeback is active. A WB_SYNC_ALL page writeback starts
and even completes. But when it's about to check if the PG_writeback
bit has been cleared, another writeback with WB_SYNC_NONE starts.
The sync page writeback ends up waiting for the non-sync page
writeback to complete.
- A page writeback with WB_SYNC_NONE is already active when a
WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
waiting for the WB_SYNC_NONE writeback.
The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes #12662
Closes #12790
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Prefer O_* flags over F* flags that mostly mirror O_* flags anyway,
but O_* flags seem to be preferred.
- Simplify the code as all the F*SYNC flags were defined as FFSYNC flag.
- Don't define FRSYNC flag, so we don't generate unnecessary ZIL commits.
- Remove EXCL define, FreeBSD ignores the excl argument for zfs_create()
anyway.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Pawel Jakub Dawidek <[email protected]>
Closes #13400
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool. As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits. We've observed one of such systems importing pool for about
45 minutes.
This patch improves the situation from five sides:
- By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
- By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
- By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
- By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
- By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.
As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes #12789
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, determining which datasets are affected by corruption is
a manual process.
The primary difficulty in reporting the list of affected snapshots is
that since the error was initially found, the snapshot where the error
originally occurred in, may have been deleted. To solve this issue, we
add the ID of the head dataset of the original snapshot which the error
was detected in, to the stored error report. Then any time a filesystem
is deleted, the errors associated with it are deleted as well. Any time
a clone promote occurs, we modify reports associated with the original
head to refer to the new head. The stored error reports are identified
by this head ID, the birth time of the block which the error occurred
in, as well as some information about the error itself are also stored.
Once this information is stored, we can find the set of datasets
affected by an error by walking back the list of snapshots in the given
head until we find one with the appropriate birth txg, and then traverse
through the snapshots of the clone family, terminating a branch if the
block was replaced in a given snapshot. Then we report this information
back to libzfs, and to the zpool status command, where it is displayed
as follows:
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3
08:27:57 2021
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
sdb ONLINE 0 0 1.58K
errors: Permanent errors have been detected in the following files:
test@1:/test.0.0
/test/test.0.0
/test/1clone/test.0.0
A new feature flag is introduced to mark the presence of this change, as
well as promotion and backwards compatibility logic. This is an updated
version of #9175. Rebase required fixing the tests, updating the ABI of
libzfs, updating the man pages, fixing bugs, fixing the error returns,
and updating the old on-disk error logs to the new format when
activating the feature.
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Co-authored-by: TulsiJain <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #9175
Closes #12812
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originally it was thought it would be useful to split up the kmods
by functionality. This would allow external consumers to only load
what was needed. However, in practice we've never had a case where
this functionality would be needed, and conversely managing multiple
kmods can be awkward. Therefore, this change merges all but the
spl.ko kmod in to a single zfs.ko kmod.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13274
|
|
|
|
|
|
|
|
|
|
|
|
| |
For legacy reasons, a couple of VOPs have to return error numbers that
don't come from the usual errno namespace. To handle the cases where
ZFS_ENTER or ZFS_VERIFY_ZP fail, we need to be able to override the
default error return value of EIO. Extend the macros to permit this.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes #13311
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Parts of the Linux kernel build system struggle with _Noreturn. This
results in the following warnings when building on RHEL 8.5, and likely
other environments. Switch to using the __attribute__((noreturn)).
warning: objtool: dbuf_free_range()+0x2b8:
return with modified stack frame
warning: objtool: dbuf_free_range()+0x0:
stack state mismatch: cfa1=7+40 cfa2=7+8
...
WARNING: EXPORT symbol "arc_buf_size" [zfs.ko] version generation
failed, symbol will not be versioned.
WARNING: EXPORT symbol "spa_open" [zfs.ko] version generation
failed, symbol will not be versioned.
...
Additionally, __thread_exit() has been renamed spl_thread_exit() and
made a static inline function. This was needed because the kernel
will generate a warning for symbols which are __attribute__((noreturn))
and then exported with EXPORT_SYMBOL.
While we could continue to use _Noreturn in user space I've also
switched it to __attribute__((noreturn)) purely for consistency
throughout the code base.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13238
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel. These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Attila Fülöp <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13210
Closes #13236
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR changes ZFS ACL checks to evaluate
fsuid / fsgid rather than euid / egid to avoid
accidentally granting elevated permissions to
NFS clients.
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Co-authored-by: Andrew Walker <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #13221
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #12996
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #12996
|
|
|
|
|
|
|
|
|
|
| |
bcopy() has a confusing argument order and is actually a move, not a
copy; they're all deprecated since POSIX.1-2001 and removed in -2008,
and we shim them out to mem*() on Linux anyway
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #12996
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #12996
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add physical device size/capacity only for physical devices in
'zpool list -v' instead of displaying "-" in the SIZE column.
This would make it easier to see the individual device capacity and
to determine which spares are large enough to replace which devices.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Dipak Ghosh <[email protected]>
Signed-off-by: Akash B <[email protected]>
Closes #12561
Closes #13106
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cleanup the kernel SIMD code by removing kernel dependencies.
- Replace XSTATE_XSAVE with our own XSAVE implementation for all
kernels not exporting kernel_fpu{begin,end}(), see #13059
- Replace union fpregs_state by a uint8_t * buffer and get the size
of the buffer from the hardware via the CPUID instruction
- Replace kernels xgetbv() by our own implementation which was
already there for userspace.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Attila Fülöp <[email protected]>
Closes #13102
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By introducing lzc_send_wrapper() and routing all ZFS_IOC_SEND*
users through it, we fix a Linux 5.10-introduced bug (see comment)
This is all /transparent/ to the users API, ABI, and usage-wise,
and disabled on FreeBSD and if the output is already a pipe,
and transparently nestable (i.e. zfs_send_one() is wrapped,
but so is lzc_send_redacted() it calls to ‒ this wouldn't be strictly
necessary if ZFS_IOC_SEND_PROGRESS wasn't strictly denominational w.r.t.
the descriptor the send is happening on)
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Co-authored-by: Rich Ercolani <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #11445
Closes #13133
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When unlinking multiple files from a pool at 100% capacity, it was
possible for ENOSPC to be returned after the first unlink. e.g.
rm -f /mnt/fs/test1.0.0 /mnt/fs/test1.1.0 /mnt/fs/test1.2.0
rm: cannot remove '/mnt/fs/test1.1.0': No space left on device
rm: cannot remove '/mnt/fs/test1.2.0': No space left on device
After waiting for the pending deferred frees from the first unlink to
be processed the remaining files can then be unlinked. This is caused
by the quota limit in dsl_dir_tempreserve_impl() being temporarily
decreased to the allocatable pool capacity less any deferred free
space.
This is resolved using the existing mechanism of returning ERESTART
when over quota as long as we know enough space will shortly be
available after processing the pending deferred frees.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13172
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZFS allows to update and retrieve additional file level attributes for
FreeBSD. This commit allows additional file level attributes to be
updated and retrieved for Linux. These include the flags stored in the
upper half of z_pflags only.
Two new IOCTLs have been added for this purpose. ZFS_IOC_GETDOSFLAGS
can be used to retrieve the attributes, while ZFS_IOC_SETDOSFLAGS can
be used to update the attributes.
Attributes that are allowed to be updated include ZFS_IMMUTABLE,
ZFS_APPENDONLY, ZFS_NOUNLINK, ZFS_ARCHIVE, ZFS_NODUMP, ZFS_SYSTEM,
ZFS_HIDDEN, ZFS_READONLY, ZFS_REPARSE, ZFS_OFFLINE and ZFS_SPARSE.
Flags can be or'd together while calling ZFS_IOC_SETDOSFLAGS.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Umer Saleem <[email protected]>
Closes #13118
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
aarch64 is a different architecture than arm. Some
compilers might choke when both __arm__ and __aarch64__
are defined.
This change separates the checks for arm and for
aarch64 in the isa_defs.h header files.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Windel Bouwman <[email protected]>
Closes #10335
Closes #13151
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A function that returns with no value is a different thing from a
function that doesn't return at all. Those are two orthogonal
concepts, commonly confused.
pthread_create(3) expects a pointer to a start routine that has a
very precise prototype:
void *(*start_routine)(void *);
However, other thread functions, such as kernel ones, expect:
void (*start_routine)(void *);
Providing a different one is incorrect, and has only been working
because the ABIs happen to produce a compatible function.
We should use '_Noreturn void', since it's the natural type, and
then provide a '_Noreturn void *' wrapper for pthread functions.
For consistency, replace most cases of __NORETURN or
__attribute__((noreturn)) by _Noreturn. _Noreturn is understood
by -std=gnu89, so it should be safe to use everywhere.
Ref: https://github.com/openzfs/zfs/pull/13110#discussion_r808450136
Ref: https://software.codidact.com/posts/285972
Reviewed-by: Brian Behlendorf <[email protected]>
Co-authored-by: Ahelenia Ziemiańska <[email protected]>
Signed-off-by: Alejandro Colomar <[email protected]>
Closes #13120
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before:
nabijaczleweli@tarta:~/store/code/zfs$ /sbin/zpool list -Td -o name,size,alloc,free,ckpoint,expandsz,guid,load_guid,frag,cap,dedup,health,altroot,guid,dedupditto,load_guid,maxblocksize,maxdnodesize 2>/dev/null
Sun 20 Feb 03:57:44 CET 2022
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ GUID LOAD_GUID FRAG CAP DEDUP HEALTH ALTROOT GUID DEDUPDITTO LOAD_GUID MAXBLOCKSIZE MAXDNODESIZE
filling 25.5T 6.52T 18.9T - 64M 11512889483096932869 11656109927366648364 1% 25% 1.00x ONLINE - 11512889483096932869 0 11656109927366648364 1048576 16384
tarta-boot 240M 50.6M 189M - - 2372068846917849656 7752280792179633787 12% 21% 1.00x ONLINE - 2372068846917849656 0 7752280792179633787 1048576 512
tarta-zoot 55.5G 6.42G 49.1G - - 12971868889665384604 8622632123393589527 17% 11% 1.00x ONLINE - 12971868889665384604 0 8622632123393589527 1048576 16384
nabijaczleweli@tarta:~/store/code/zfs$ /sbin/zfs list -o name,guid,keyguid,ivsetguid,createtxg,objsetid,pbkdf2iters,refratio -r tarta-zoot
NAME GUID KEYGUID IVSETGUID CREATETXG OBJSETID PBKDF2ITERS REFRATIO
tarta-zoot 1110930838977259561 659P - 1 54 0 1.03x
tarta-zoot/PAGEFILE.SYS 2202570496672997800 3.20E - 2163 1539 0 1.07x
tarta-zoot/dupa 16941280502417785695 9.81E - 2274707 1322 1000000000000 1.00x
tarta-zoot/etc 17029963068508333530 12.9E - 3663 1087 0 1.52x
tarta-zoot/home 3508163802370032575 8.50E - 3664 294 0 1.00x
tarta-zoot/home/misio 7283672744014848555 13.0E - 3665 302 0 2.28x
tarta-zoot/home/nabijaczleweli 12286744508078616303 5.15E - 3666 200 0 2.05x
tarta-zoot/home/nabijaczleweli/tftp 13551632689932817643 5.16E - 3667 1095 0 1.00x
tarta-zoot/home/root 5203106193060067946 15.4E - 3668 698 0 2.86x
tarta-zoot/home/shared-config 8866040021005142194 14.5E - 3670 2069 0 1.20x
tarta-zoot/home/tymek 9472751824283011822 4.56E - 3671 1202 0 1.32x
tarta-zoot/oldboot 10460192444135730377 13.8E - 2268398 1232 0 1.01x
tarta-zoot/opt 9945621324983170410 5.84E - 3672 1210 0 1.00x
tarta-zoot/opt/icecc 13178238931846132425 9.04E - 3673 1103 0 2.83x
tarta-zoot/opt/swtpm 10172962421514870859 4.13E - 825669 145132 0 1.87x
tarta-zoot/srv 217179989022738337 3.90E - 3674 2469 0 1.00x
tarta-zoot/usr 12214213243060765090 15.0E - 3675 2477 0 2.58x
tarta-zoot/usr/local 7542700368693813134 941P - 3676 2484 0 2.33x
tarta-zoot/var 13414177124447929530 10.2E - 3677 2492 0 1.57x
tarta-zoot/var/lib 6969944550407159241 5.28E - 3678 2499 0 2.34x
tarta-zoot/var/tmp 6399468088048343912 1.34E - 3679 1218 0 3.95x
After:
nabijaczleweli@tarta:~/store/code/zfs$ cmd/zpool/zpool list -Td -o name,size,alloc,free,ckpoint,expandsz,guid,load_guid,frag,cap,dedup,health,altroot,guid,dedupditto,load_guid,maxblocksize,maxdnodesize 2>/dev/null
Sun 20 Feb 03:57:42 CET 2022
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ GUID LOAD_GUID FRAG CAP DEDUP HEALTH ALTROOT GUID DEDUPDITTO LOAD_GUID MAXBLOCKSIZE MAXDNODESIZE
filling 25.5T 6.52T 18.9T - 64M 11512889483096932869 11656109927366648364 1% 25% 1.00x ONLINE - 11512889483096932869 0 11656109927366648364 1M 16K
tarta-boot 240M 50.6M 189M - - 2372068846917849656 7752280792179633787 12% 21% 1.00x ONLINE - 2372068846917849656 0 7752280792179633787 1M 512
tarta-zoot 55.5G 6.42G 49.1G - - 12971868889665384604 8622632123393589527 17% 11% 1.00x ONLINE - 12971868889665384604 0 8622632123393589527 1M 16K
nabijaczleweli@tarta:~/store/code/zfs$ cmd/zfs/zfs list -o name,guid,keyguid,ivsetguid,createtxg,objsetid,pbkdf2iters,refratio -r tarta-zoot
NAME GUID KEYGUID IVSETGUID CREATETXG OBJSETID PBKDF2ITERS REFRATIO
tarta-zoot 1110930838977259561 741529699813639505 - 1 54 0 1.03x
tarta-zoot/PAGEFILE.SYS 2202570496672997800 3689529982640017884 - 2163 1539 0 1.07x
tarta-zoot/dupa 16941280502417785695 11312442953423259518 - 2274707 1322 1000000000000 1.00x
tarta-zoot/etc 17029963068508333530 14852574366795347233 - 3663 1087 0 1.52x
tarta-zoot/home 3508163802370032575 9802810070759776956 - 3664 294 0 1.00x
tarta-zoot/home/misio 7283672744014848555 14983161489316798151 - 3665 302 0 2.28x
tarta-zoot/home/nabijaczleweli 12286744508078616303 5937870537299886218 - 3666 200 0 2.05x
tarta-zoot/home/nabijaczleweli/tftp 13551632689932817643 5950522828900813054 - 3667 1095 0 1.00x
tarta-zoot/home/root 5203106193060067946 17718025091255443518 - 3668 698 0 2.86x
tarta-zoot/home/shared-config 8866040021005142194 16716354482778968577 - 3670 2069 0 1.20x
tarta-zoot/home/tymek 9472751824283011822 5251854710505749954 - 3671 1202 0 1.32x
tarta-zoot/oldboot 10460192444135730377 15894065034622168157 - 2268398 1232 0 1.01x
tarta-zoot/opt 9945621324983170410 6737735639539098405 - 3672 1210 0 1.00x
tarta-zoot/opt/icecc 13178238931846132425 10425145983015238428 - 3673 1103 0 2.83x
tarta-zoot/opt/swtpm 10172962421514870859 4764783754852521469 - 825669 145132 0 1.87x
tarta-zoot/srv 217179989022738337 4492810461439647259 - 3674 2469 0 1.00x
tarta-zoot/usr 12214213243060765090 17306702395865262834 - 3675 2477 0 2.58x
tarta-zoot/usr/local 7542700368693813134 1059954157997659784 - 3676 2484 0 2.33x
tarta-zoot/var 13414177124447929530 11764397504176937123 - 3677 2492 0 1.57x
tarta-zoot/var/lib 6969944550407159241 6084753728494937404 - 3678 2499 0 2.34x
tarta-zoot/var/tmp 6399468088048343912 1548692824635344277 - 3679 1218 0 3.95x
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #13122
Closes #13125
|
|
|
|
|
|
|
|
|
| |
A simple change, but so many tests break with it,
and those are the majority of this.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rich Ercolani <[email protected]>
Closes #13078
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As such, there are no specific synchronous semantics defined for
the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is
done, if sync=always is set on dataset. This provides sync semantics
for xattr=on with sync=always set on dataset.
For the xattr=sa implementation, it doesn't log to ZIL, so, even with
sync=always, xattrs are not guaranteed to be synced before xattr call
returns to caller. So, xattr can be lost if system crash happens, before
txg carrying xattr transaction is synced.
This change adds xattr=sa logging to ZIL on xattr create/remove/update
and xattrs are synced to ZIL (zil_commit() done) for sync=always.
This makes xattr=sa behavior similar to xattr=on.
Implementation notes:
The actual logging is fairly straight-forward and does not warrant
additional explanation.
However, it has been 14 years since we last added new TX types
to the ZIL [1], hence this is the first time we do it after the
introduction of zpool features. Therefore, here is an overview of the
feature activation and deactivation workflow:
1. The feature must be enabled. Otherwise, we don't log the new
record type. This ensures compatibility with older software.
2. The feature is activated per-dataset, since the ZIL is per-dataset.
3. If the feature is enabled and dataset is not for zvol, any append to
the ZIL chain will activate the feature for the dataset. Likewise
for starting a new ZIL chain.
4. A dataset that doesn't have a ZIL chain has the feature deactivated.
We ensure (3) by activating on the first zil_commit() after the feature
was enabled. Since activating the features requires waiting for txg
sync, the first zil_commit() after enabling the feature will be slower
than usual. The downside is that this is really a conservative
approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL
chain, we pay the penalty for feature activation. The upside is that the
user is in control of when we pay the penalty, i.e., upon enabling the
feature.
We ensure (4) by hooking into zil_sync(), where ZIL destroy actually
happens.
One more piece on feature activation, since it's spread across
multiple functions:
zil_commit()
zil_process_commit_list()
if lwb == NULL // first zil_commit since zil_open
zil_create()
if no log block pointer in ZIL header:
if feature enabled and not active:
// CASE 1
enable, COALESCE txg wait with dmu_tx that allocated the
log block
else // log block was allocated earlier than this zil_open
if feature enabled and not active:
// CASE 2
enable, EXPLICIT txg wait
else // already have an in-DRAM LWB
if feature enabled and not active:
// this happens when we enable the feature after zil_create
// CASE 3
enable, EXPLICIT txg wait
[1] https://github.com/illumos/illumos-gate/commit/da6c28aaf62fa55f0fdb8004aa40f88f23bf53f0
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Christian Schwarz <[email protected]>
Reviewed-by: Ahelenia Ziemiańska <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jitendra Patidar <[email protected]>
Closes #8768
Closes #9078
|
|
|
|
|
|
|
|
|
| |
New `zfs_type_t` value `ZFS_TYPE_INVALID` is introduced.
Variable initialization is now possible to make GCC happy.
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: szubersk <[email protected]>
Closes #12167
Closes #13103
|