| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
Enable zio_dva_throttle_enabled=1 by default. Subsequent
testing has been unable to reproduce the suspected regression.
Tested-by: kernelOfTruth [email protected]
Reviewed-by: Olaf Faaland <[email protected]>
Signed-off-by: Brian Behlendorf [email protected]
Reverts #5335
Closes #5289
Closes #5457
|
|
|
|
|
|
|
|
| |
Save and reuse ddt dspace calculation when there have been no ddt changes.
This avoids unnecessary traversal of 168KiB of ddt histograms.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Closes #5425
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It was observed that even when the txg history is disabled by
setting `zfs_txg_history=0` the txg_sync thread still fetches
the vdev stats unnecessarily.
This patch refactors the code such that vdev_get_stats() is no
longer called when `zfs_txg_history=0`. And it further reduces
the differences between upstream and the ZoL txg_sync_thread()
function.
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #5412
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dbuf_read() creates a zio_root() to track and wait for all the
zio's that may happen as part of this call. However, if the blkptr_t
for this buffer is NULL or a hole, we will not create any more zio's,
so this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.
The fix is to only create/destroy the zio_root() in dbuf_read() if
the blkptr is not NULL and not a hole.
Changes sponsored by Intel Corp.
Authored by: Matthew Ahrens <[email protected]>
Reviewed-by: Alex Zhuravlev <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Issue openzfs/openzfs#137
Closes #4803
Closes #5382
|
|
|
|
|
|
|
|
| |
This should be & and not | so is_metadata is set correctly.
Reviewed-by: Dan Kimmel <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: luozhengzheng <[email protected]>
Closes #5438
|
|
|
|
|
|
|
|
|
|
| |
It looks like this was functionality which was added in the
original SA implementation and then never needed. It can
be safely removed now and easily added back if we find a
use for it.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5440
|
|
|
|
|
|
|
|
|
| |
zio.h includes zio_impl.h but zio_impl.h also includes zio.h, so the
header files to contain each other. Get rid of the zio_impl.h include
in zio.h and update zio_inject.c to include zio.h instead of zio_impl.h.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5439
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In multiple cases zio_buf_alloc() was used instead of kmem_alloc()
or vmem_alloc(). This was often done because the allocations
could be large and it was easy to use zfs_buf_alloc() for them.
But this isn't ideal for allocations which are small or short
lived. In these cases it is better to use kmem_alloc() or
vmem_alloc(). If possible we want to avoid the case where
we have slabs allocated for kmem caches which are rarely used.
Note for small allocations vmem_alloc() will be internally
converted to kmem_alloc(). Therefore as long as large
allocations are infrequent and short lived the penalty for
using vmem_alloc() is small.
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #5409
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Convert ABD to use the Linux Kernel scatterlist implementation
instead of the hand rolled one from illumos.
* Scatter ABDs are preferentially populated with higher order
compound pages from a single zone. Allocation size is
progressively decreased until it can be satisfied without
performing reclaim or compaction.
* An alternate page allocator is provided for kernels older
than 3.6 and for CONFIG_HIGHMEM systems. This allocator
is designed as a fallback for maximum compatibility.
* Extended abdstats to provide visibility in the the allocator.
* Add cached value for PAGESIZE in userspace.
Contributions-by:
Chunwei Chen <[email protected]>
Gvozden Neskovic <[email protected]>
Jinshan Xiong <[email protected]>
Isaac Huang <[email protected]>
David Quigley <[email protected]>
Brian Behlendorf <[email protected]>
|
|
|
|
|
| |
Convert usage of kmap to kmap_atomic while correctly saving off
irq state.
|
|
|
|
|
|
| |
Port NEON implementation of RAID-Z functions to ABD.
Signed-off-by: Roomain Dolbeau <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement shift based multiplication for 512f. Higher IPC over lookup based
methods yields up to 40% better performance on the current hardware.
Results on Xeon Phi(TM) CPU 7210:
implementation gen_p gen_pq gen_pqr rec_p rec_q rec_r rec_pq rec_pr rec_qr rec_pqr
original 142232671 24411492 12948205 283053705 22348167 4215911 9171609 2265548 2378370 1648495
scalar 295711162 49851491 33253815 293198109 88179448 61866752 27941684 25764416 17384442 12138153
sse2 410055998 199642658 117973654 406240463 152688682 121092250 84968180 79291076 47473657 20779719
ssse3 411641595 199669571 117937647 406211024 137638508 117050346 81263322 76120405 46281559 32696722
avx2 616485806 311515332 188595628 605455115 260602390 230554476 148198817 138800254 92273356 62937819
avx512f 832191523 408509425 253599522 810094481 404325734 317590971 218235687 197204920 133101937 94001219
fastest avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f
Signed-off-by: Gvozden Neskovic <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Enable vectorized raidz code on ABD buffers. The avx512f,
avx512bw, neon and aarch64_neonx2 are disabled in this commit.
With the exception of avx512bw these implementations are
updated for ABD in the subsequent commits.
Signed-off-by: Gvozden Neskovic <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
* userspace: aligned buffers. Minimum of 32B alignment is
needed for AVX2. Kernel buffers are aligned 512B or more.
* add abd_get_offset_size() interface
* abd_iter_map(): fix calculation of iter_mapsize
* add abd_raidz_gen_iterate() and abd_raidz_rec_iterate()
Signed-off-by: Gvozden Neskovic <[email protected]>
|
|
|
|
| |
Signed-off-by: Isaac Huang <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
Linux kernel commit 723c038475b78 removed this field.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: DHE <[email protected]>
Closes #5393
|
|
|
|
|
|
|
| |
CID 147503: Dereference after null check (FORWARD_NULL)
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: luozhengzheng <[email protected]>
Closes #5326
|
|
|
|
|
|
|
|
|
|
|
| |
CID 147540: unsigned_compare
- Cast nsec to a int32_t to properly detect the expected overflow.
CID 147542: unsigned_compare
- intval can never be less than ZIO_FAILURE_MODE_WAIT which is
defined to be zero. Remove this useless check.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5379
|
|
|
|
|
|
|
|
|
| |
It's used by Lustre to determine if the objset can be upgraded.
The inline version doesn't work because dmu_objset_is_snapshot()
is not exported.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jinshan Xiong <[email protected]>
Closes #5385
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come
from setxattr, which will handle by the acl xattr_handler, and we already
handles that well. However, nfsd will directly calls inode->set_acl or
return error if it doesn't exists.
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Massimo Maggi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #5371
Closes #5375
|
|
|
|
|
|
|
|
|
| |
CID 147626: Type:Dereference before null check
CID 147628: Type:Dereference before null check
Reviewed-by: Brian Behlendorf <[email protected]
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5304
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The phase 2 work primarily entails the Diagnosis Engine and
the Retire Agent modules. It also includes infrastructure
to support a crude FMD environment to host these modules.
The Diagnosis Engine consumes I/O and checksum ereports and
feeds them into a SERD engine which will generate a corres-
ponding fault diagnosis when the SERD engine fires. All the
diagnosis state data is collected into cases, one case per
vdev being tracked.
The Retire Agent responds to diagnosed faults by isolating
the faulty VDEV. It will notify the ZFS kernel module of
the new VDEV state (degraded or faulted). This agent is
also responsible for managing hot spares across pools.
When it encounters a device fault or a device removal it
replaces the device with an appropriate spare if available.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes #5343
|
|
|
|
|
|
|
|
|
|
| |
CID 147575, Type:Unintentional integer overflow
CID 147577, Type:Unintentional integer overflow
CID 147578, Type:Unintentional integer overflow
CID 147579, Type:Unintentional integer overflow
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5365
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently every calls to zpl_posix_acl_release will schedule a delayed task,
and each delayed task will add a timer. This used to be fine except for
possibly bad performance impact.
However, in Linux 4.8, a new timer wheel implementation[1] is introduced. In
this new implementation, the larger the delay, the less accuracy the timer is.
So when we have a flood of timer from zpl_posix_acl_release, they will expire
at the same time. Couple with the fact that task_expire will do linear search
with lock held. This causes an extreme amount of contention inside interrupt
and would actually lockup the system.
We fix this by doing batch free to prevent a flood of delayed task. Every call
to zpl_posix_acl_release will put the posix_acl to be freed on a lockless
list. Every batch window, 1 sec, the zpl_posix_acl_free will fire up and free
every posix_acl that passed the grace period on the list. This way, we only
have one delayed task every second.
[1] https://lwn.net/Articles/646950/
Signed-off-by: Chunwei Chen <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Only restrict the maximum zio alloc size to 32-bit kernel space.
The same virtual address space limitations don't apply to user
space. This resolves a memory allocation failure in raidz_test
where it expects to be able to exercises all valid zio sizes.
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on
supported filesystem. It's basically doing open(2) and unlink(2) atomically.
The filesystem support is added through i_op->tmpfile. We basically copy the
create operation except we get rid of the link and name related stuff and add
the new node to unlinked set.
We also add support for linkat(2) to link tmpfile. However, since all previous
file operation will skip ZIL, we force a txg_wait_synced to make sure we are
sync safe.
Signed-off-by: Chunwei Chen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, doing things like fsetxattr(2) on an unlinked file will result in
ENODATA. There's two places that cause this: zfs_dirent_lock and zfs_zget.
The fix in zfs_dirent_lock is pretty straightforward. In zfs_zget though, we
need it to not return error when the zp is unlinked. This is a pretty big
change in behavior, but skimming through all the callers, I don't think this
change would cause any problem. Also there's nothing preventing z_unlinked
from being set after the z_lock mutex is dropped before but before zfs_zget
returns anyway.
The rest of the stuff is to make sure we don't log xattr stuff when owner is
unlinked.
Signed-off-by: Chunwei Chen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
avx512f should work on all AVX512 hardware, since it only uses
Foundation instructions.
avx512bw should be faster on hardware supporting the AVW512BW
extension. We can use full-width pshufb (instead of relying on the 256
bits AVX2 pshufb). As a side-effect, the code is also unrolled more.
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Gvozden Neskovic <[email protected]>
Reviewed-by: Jinshan Xiong <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Romain Dolbeau <[email protected]>
Closes #5219
|
|
|
|
|
|
|
|
|
|
|
| |
On error dsl_prop_get_all_ds() does not free the nvlist it allocates.
This behavior may have been intentional when originally written
but is atypical and often confusing. Since no callers rely on this
behavior the function has been updated to always free the nvlist
on error.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: BearBabyLiu <[email protected]>
Closes #5320
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On 32-bit Linux systems use vmem_size() to correctly size the ARC
and better determine when IO should be throttle due to low memory.
On 64-bit systems this change has no effect since the virtual
address space available far exceeds the physical memory available.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #5347
|
|
|
|
|
|
|
|
|
|
|
| |
A limit of 1TB exists for zvols on 32-bit systems. Update the code
to correctly reflect this limitation in a similar manor as the
OpenZFS implementation.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #5347
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originally the .zfs/snapshot directory was disabled for 32-bit systems
because 64-bit inode numbers were not supported. This is no longer
the case and this functionality can be enabled by default.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #5347
Closes #2002
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add the TASKQID_INVALID macros and update callers to use the macro
instead of testing against 0. There is no functional change
even though the functions in zfs_ctldir.c incorrectly used -1
instead of 0.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #5347
|
|
|
|
|
|
|
|
| |
Replace magic value 16 with ARRAY_SIZE() to correctly handle
when the sa_legacy_attrs array size changes.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5354
|
|
|
|
|
|
|
| |
CID 147553: Type:Dereference null return value
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5305
|
|
|
|
|
|
|
| |
CID 152975: Type:Dereference null return value
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5322
|
|
|
|
|
|
|
|
| |
CID 147509: Explicit null dereferenced
- l2arc_sublist_lock is fragile as relied on caller too much.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: GeLiXin <[email protected]>
Closes #5319
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ubuntu added support for checking inode permissions to lookup_bdev() in kernel
commit 193fb6a2c94fab8eb8ce70a5da4d21c7d4023bee (merged in 4.4.0-6.21).
Upstream bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1636517
This patch adds a test for Ubuntu's variant of lookup_bdev() to configure and
calls the function in the correct way.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Hajo Möller <[email protected]>
Closes #5336
|
|
|
|
|
|
|
|
|
|
| |
Until it can be determined definitively that a performance
regression wasn't introduced accidentally by 3dfb57a this
functionality is being disabled by default. It can be re-
enabled by setting zio_dva_throttle_enabled=1.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #5335
Issue #5289
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Fix autoreplace behaviour on statechange-led.sh script.
ZED sends the following events on an auto-replace:
1. statechange: Disk goes UNAVAIL->ONLINE
2. statechange: Disk goes ONLINE->UNAVAIL
3. vdev_attach: Disk goes ONLINE
Events 1-2 happen when ZED first attempts to do an auto-online. When that
fails, ZED then tries an auto-replace, generating the vdev_attach event in #3.
In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE
transition to turn off the LED. It ignored the #2 ONLINE->UNAVAIL transition,
assuming it was just the "old" VDEV going offline. This is problematic, as
a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want
to ignore that.
This new patch correctly turns on the fault LED every time a drive becomes
UNAVAIL. It also monitors vdev_attach events to trigger turning off the LED
when an auto-replaced disk comes online.
- Remove unnecessary libdevmapper warning with --with-config=kernel
This fixes an unnecessary libdevmapper warning when building
--with-config=kernel. Kernel code does not use libdevmapper, so the warning
is not needed.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #2375
Closes #5312
Closes #5331
|
|
|
|
|
|
|
|
|
| |
This is caught by kmemleak when running compress_004_pos
Reviewed-by: Tim Chase <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #5244
Closes #5330
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When creating and destroying pools in tight loop it's possible to
exhaust the number of allowed threads on a system. This results
in taskq_create() failling and a NULL dereference.
Resolve the issue by falling back to opening the vdevs all
synchronously.
Reviewed-by: Denys Rtveliashvili <[email protected]>
Reviewed-by: Håkan Johansson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes zfsonlinux/spl#521
Closes #4637
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously when a drive faulted, the statechange-led.sh script would lookup
the drive's LED sysfs entry in /sys/block/sd*/device/enclosure_device, and
turn it on. During testing we noticed that if you pulled out a drive, or if
the drive was so badly broken that it no longer appeared to Linux, that the
/sys/block/sd* path would be removed, and the script could not lookup the
LED entry.
To fix this, this patch looks up the disks's more persistent
"/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import. It then
passes that path to the statechange-led script to use, rather than having the
script look it up on the fly. This allows the script to turn on/off the slot
LEDs even when the drive is missing.
Closes #5309
Closes #2375
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The AVL tree compare function requires that either -1, 0, or 1 be
returned. However the strcmp() function only guarantees that a
negative, zero, or positive value is returned. Therefore, the
return value of strcmp() needs to be sanitized with AVL_ISIGN.
This was initially overlooked because the x86_64 implementation
of strcmp() happens to only returns the allowed values. This
was observed on an aarch64 platform which behaves correctly but
differently as described above.
Reviewed-by: Jinshan Xiong <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #5311
Closes #5313
|
|
|
|
|
|
|
|
| |
CID 153459: Null pointer dereferences (FORWARD_NULL)
Accidentally introduced by #5159.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: luozhengzheng <[email protected]>
Closes #5310
|
|
|
|
|
|
|
|
| |
CID 147551: Type:dereference null return value
CID 147552: Type:dereference null return value
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5279
|
|
|
|
|
|
|
| |
CID 147472: Type: 'Constant' variable guards dead code
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: cao.xuewen <[email protected]>
Closes #5288
|
|
|
|
|
|
|
|
|
|
|
|
| |
In torvalds/linux@31051c8 the inode_change_ok() function was
renamed setattr_prepare() and updated to take a dentry ratheri
than an inode. Update the code to call the setattr_prepare()
and add a wrapper function which call inode_change_ok() for
older kernels.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Requires-spl: refs/pull/581/head
|
|
|
|
|
|
|
|
| |
In Linux 4.9, torvalds/linux@fd50eca, iops->{set,get,remove}xattr and
generic_{set,get,remove}xattr are removed. xattr operations will directly
go through sb->s_xattr.
Signed-off-by: Chunwei Chen <[email protected]>
|