| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order to minimize any future disruption caused by the addition
and removal /dev/zfs ioctls this patch makes the following changes.
1) Sync ZoL's ioctl ordering such that it matches Illumos. For
historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID
ioctls were out of order.
2) Move Linux and FreeBSD specific ioctls in to their own reserved
ranges. This allows us to preserve the existing ordering when
new ioctls are added by either Illumos or FreeBSD. When an
ioctl is no longer needed it should be retired in place.
This change alters the ZFS user/kernel ABI so make sure you rebuild
both your user and kernel modules. However, it should allow for a
much stabler interface going forward.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #1973
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Early versions of ZFS coordinated the creation and destruction
of device minors from userspace. This was inherently racy and
in late 2009 these ioctl()s were removed leaving everything up
to the kernel. This significantly simplified the code.
However, we never picked up these changes in ZoL since we'd
already significantly adjusted this code for Linux. This patch
aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s
and moving all the functionality down in to the kernel. Since
this cleanup will change the kernel/user ABI it's being done
in the same tag as the previous libzfs_core ABI changes. This
will minimize, but not eliminate, the disruption to end users.
Once merged ZoL, Illumos, and FreeBSD will basically be back
in sync in regards to handling ZVOLs in the common code. While
each platform must have its own custom zvol.c implemenation the
interfaces provided are consistent.
NOTES:
1) This patch introduces one subtle change in behavior which
could not be easily avoided. Prior to this change callers
of 'zfs create -V ...' were guaranteed that upon exit the
/dev/zvol/ block device link would be created or an error
returned. That's no longer the case. The utilities will no
longer block waiting for the symlink to be created. Callers
are now responsible for blocking, this is why a 'udev_wait'
call was added to the 'label' function in scripts/common.sh.
2) The read-only behavior of a ZVOL now solely depends on if
the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant
policy setting in the gendisk structure was removed. This
both simplifies the code and allows us to safely leverage
set_disk_ro() to issue a KOBJ_CHANGE uevent. See the
comment in the code for futher details on this.
3) Because __zvol_create_minor() and zvol_alloc() may now be
called in a sync task they must use KM_PUSHPAGE.
References:
illumos/illumos-gate@681d9761e8516a7dc5ab6589e2dfe717777e1123
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #1969
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4121 vdev_label_init should treat request as succeeded when pool
is read only
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Saso Kiselkov <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/4121
illumos/illumos-gate@973c78e94bf9634782164382c9e291bf81161fa5
Ported-by: Brian Behlendorf <[email protected]>
Closes #1863
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime. However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.
This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.
It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.
The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1949
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The DMU zfetch code organizes streams with lists not avl trees. A
avl_node_t was mistakenly used for a list_node_t in the zstream_t
type. This is incorrect (but harmless) and when unnoticed because:
1) The list functions explicitly cast the value preventing a warning,
2) sizeof(avl_node_t) >= sizeof(list_node_t) so no overrun occurs, and
3) The calculated offset is the same regardless of the type.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1946
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The MAX when initializing arc_c_max doesn't make any sense because
it hasn't been set anywhere before. Though, arc_c_max should be
implicitly set to zero when initializing arc_stats, so the MAX
doesn't make any difference.
The MAX was mistakenly left if place when the Illumos default
values were changed for Linux.
Signed-off-by: david.chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1941
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The bug is caused by multipath output like this:
35000c50056bd77a7 dm-15 HP,MB3000FCWDH
size=2.7T features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 2:0:16:0 sdq 65:0 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 4:0:52:0 sdfp 130:176 active undef running
Note that the pipe symbols mean that the field numbering is different
between the sdq and sdfp lines. The fix edits out the pipe symbols.
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1692
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 6a7c0ccca44ad02c476a111d8f7911fc8b12fff7.
A proper fix for Issue #1648 was landed under Issue #1890, so this is no
longer needed.
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1648
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Under the right conditions sa_find_sizes() will compute an incorrect
size of the system attribute (SA) header. This causes a failed assertion
when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead
to corruption of SA data.
The bug presents itself when there are more than two variable-length SAs
of just the right size to fit in the bonus buffer of a dnode. The
existing logic fails to account for the SA header space needed to store
the sizes of all the variable-length SAs.
A reproducer was possible on Linux by setting the xattr=sa dataset
property and storing xattrs on symbolic links (Issue #1648). Note the
corrupt link target name:
$ zfs set xattr=sa tank/fish
$ cd /tank/fish
$ ln -fs 12345678901234567 link
$ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link
$ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link
$ ls -l link
lrwxrwxrwx 1 root root 17 Dec 6 15:40 link -> 90123456701234567
Commit 6a7c0ccca44ad02c476a111d8f7911fc8b12fff7 worked around this bug
by forcing xattr's on symlinks to be stored in directory format. This
change implements a proper fix, so the workaround can now be reverted.
The reference link below contains a reproducer for FreeBSD.
References:
http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html
Ported-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1890
|
|
|
|
|
|
|
|
|
|
| |
Allow verbose mounts to make is easier to monitor progress when
mounting a large number of filesystems.
This functionality is disabled by default.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1929
|
|
|
|
|
|
|
|
|
|
|
|
| |
Many people prefer to use by-id at import time instead of using
the cache file. This can be a much better solution than the cache
file in some environments so we're adding some infrastructure to
allow it.
This functionality is disabled by default.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1929
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When creating a dataset with ZoL a zsb->z_shares_dir ZAP object
will not be created because shares are unimplemented. Instead ZoL
just sets zsb->z_shares_dir to zero to indicate there are no shares.
However, if you import a pool which was created with a different
ZFS implementation then the shares ZAP object may exist. Code was
added to handle this case but it clearly wasn't sufficiently tested
with other ZFS pools.
There was a bug in the zpl_shares_getattr() function which passed
the wrong inode to zfs_getattr_fast() for the case where are shares
ZAP object does exist. This causes an EIO to be returned to stat64()
which in turn causes 'zfs diff' to fail.
This fix is the pass the correct inode after a sucessful zfs_zget().
Additionally, only put away the references if we were able to get one.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Graham Booker <https://github.com/gbooker>
Signed-off-by: timemaster67 <https://github.com/timemaster67>
Closes #1426
Closes #481
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use the standard Linux MODULE_VERSION macro to expose the installed
zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions.
This will also automatically add a checksum of the .c files and
headers in "srcversion". See:
/sys/module/zavl/version
/sys/module/zavl/srcversion
/sys/module/znvpair/version
/sys/module/znvpair/srcversion
/sys/module/zunicode/version
/sys/module/zunicode/srcversion
/sys/module/zcommon/version
/sys/module/zcommon/srcversion
/sys/module/zfs/version
/sys/module/zfs/srcversion
/sys/module/zpios/version
/sys/module/zpios/srcversion
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1923
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Ned Bass <[email protected]>
Reviewed by: Brendan Gregg <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1913
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix a lock contention issue by allowing threads not holding
ZPL locks to block when waiting to assign a transaction.
Porting Notes:
zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.
Reviewed by: George Wilson <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Reviewed by: Boris Protopopov <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/4347
illumos/illumos-gate@e722410c49fe67cbf0f639cbcc288bd6cbcf7dd1
Ported-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
| |
This broke compilation against Linux 3.13 and GCC 4.7.3.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1906
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As part of zfs_sb_teardown() there is an assertion that all inodes
which are part of the zsb->z_all_znodes list have at least one
reference on them. This is always true for the standard unmount
case but there are two other cases where it is not strictly true.
* zfs_ioc_rollback() - This is the most common case and it results
from the fact that we aren't unmounting the filesystem. During a
normal unmount the MS_ACTIVE flag will be cleared on the super block
causing iput_final() to evict the inode when its reference count
drops to zero. However, during a rollback MS_ACTIVE remains set
since we're rolling back a live filesystem and need to preserve the
existing super block. This allows inodes with a zero reference count
to stay in the cache thereby violating the assertion.
* destroy_inode() / zfs_sb_teardown() - There exists a small race
between dropping the last reference on an inode and removing it from
the zsb->z_all_znodes list. This is unlikely to occur but could also
trigger the assertion which is incorrect. The inode may safely have
a zero reference count in this case.
Since allowing a zero reference count on the inode is expected and
safe for both of these cases the simplest thing to do is remove the
ASSERT. This code is only enabled for default builds so removing
this entirely is a very safe change.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #1417
Closes #1536
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added:
Adata S396 (obtained from drive_id)
Apple MacBookAir3,1 SSD (obtained from drive_id)
Apple MacBookPro10,1 SSD (obtained from drive_id)
Intel 510 (obtained from drive_id)
Intel 710 (obtained from drive_id)
Intel DC S3500 (obtained from drive_id)
Netapp LUN (obtained from illumos user's sd.conf)
OCZ Agility 3 (obtained from drive_id)
OCZ Vertex (obtained from drive_id)
Samsung PM800 (obtained from drive_id)
Sandisk U100 (obtained from drive_id)
Sun Comstar (obtained from illumos user's sd.conf)
Notes:
1. The entries for the Intel DC S3500 were extrapolated from the 800GB
model's entry, which is "ATA INTEL SSDSC2BB80".
2. The entires for the Intel 710 were extrapolated from the 120GG
model's entry, which is "ATA INTEL SSDSA2BZ12".
3. The entires for the Intel 510 were extrapolated from the 250GB
model's entry, which is "ATA INTEL SSDSC2MH25".
4. The entires for the Apple MacBookPro10,1 SSD were extrapolated from
the 512GB model's entry, which is "ATA APPLE SSD SM512E". Google
searches suggest that this is a rebadged Samsung 830.
5. The entires for the Apple MacBookAir3,1 SSD were extrapolated from
the 128GB model's entry, which is "ATA APPLE SSD TS128C". Google
searches suggest that this is a rebadged Kingston SSDNow V+ 100 (based
on Toshiba).
6. Sun Comstar is an iSCSI Target, so we cannot tell what the correct
sector size is through this method. We list it only for reference
purposes, but it is commented out. Similarly, it is not clear what the
right thing to do for Netapp is, so we comment it out.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1907
|
|
|
|
|
|
|
|
|
|
| |
This should hopefully catch the rest of the allocations in the
user hold/release processing that were missed by commit
65c67ea86e9f112177f1ad32de8e780f10798a64.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1852
Closes #1855
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, using msync() results in the following code path:
sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage
In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.
This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.
The solution implemented here is composed of two parts:
- I added a new callback system to the ZIL, which allows the caller to
be notified when its ITX gets written to stable storage. One nice
thing is that the callback is called not only in zil_commit() but
in zil_sync() as well, which means that the caller doesn't have to
care whether the write ended up in the ZIL or the DMU: it will get
notified as soon as it's safe, period. This is an improvement over
dmu_tx_callback_register() that was used previously, which only
supports DMU writes. The rationale for this change is to allow
zpl_putpage() to be notified when a ZIL commit is completed without
having to block on zil_commit() itself.
- zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
from issuing ZIL commits. zpl_writepages() will issue the commit
itself instead of relying on zpl_putpage() to do it, thus nicely
batching the writes. Note, however, that we still have to call
write_cache_pages() again in SYNC mode because there is an edge case
documented in the implementation of write_cache_pages() whereas it
will not give us all dirty pages when running in non-SYNC mode. Thus
we need to run it at least once in SYNC mode to make sure we honor
persistency guarantees. This only happens when the pages are
modified at the same time msync() is running, which should be rare.
In most cases there won't be any additional pages and this second
call will do nothing.
Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1849
Closes #907
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Version 2.2.0.3-20 of dkms in the EPEL/Fedora repositories added the
necessary patches to support ZoL, Therefore, the zfs-dkms requirement
on dkms is set to match that version or higher. This allows us to
drop the custom dkms build in the ZoL EPEL/Fedora repositories.
References:
https://bugzilla.redhat.com/show_bug.cgi?id=1023598
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1873
|
|
|
|
|
|
|
|
|
|
|
|
| |
2583 Add -p (parsable) option to zfs list
References:
https://www.illumos.org/issues/2583
illumos/illumos-gate@43d68d68c1ce08fb35026bebfb141af422e7082e
Ported-by: Gregor Kopka <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes: #937
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Because ZFS bypasses the page cache we don't inherit per-task I/O
accounting for free. However, the Linux kernel does provide helper
functions allow us to perform our own accounting. These are most
commonly used for direct IO which also bypasses the page cache, but
they can be used for the common read/write call paths as well.
Signed-off-by: Pavel Snajdr <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #313
Closes #1275
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a first draft of a zfs-module-parameters(5) man page. I have
just extracted the parameter name and its description with modinfo,
then checked the source what type it is and its default value.
This will need more work, preferably someone that actually know these
values and what to use them for.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1856
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some platforms symbols provided by libzfs_core and used by
libzfs were not available to the linker. To avoid this issue
libzfs_core has been added to the list of required libraries
when building utilities which depend on libzfs. This should
have been handled properly by libtool and it's still not
entirely clear why it wasn't on all platforms.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1841
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4322 ZFS deadlock on dp_config_rwlock
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Ilya Usvyatsky <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/4322
illumos/illumos-gate@c50d56f667f119d78fa3d94d6bef2c298ba556f6
Ported by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1886
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's a missing semicolon and equals sign in the first hunk of this
commit in config/kernel-bdi.m4. This results in the test always
failing. The effects were noticed when rrdtool, a tool which modifies
files by mmap() and msync(), would have data never get saved to disk
in spite of the files working while the mounted filesystem remains
mounted.
Signed-off-by: DHE <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #1889
|
|
|
|
|
|
|
|
| |
Under Linux this restriction does not apply because we have access
to all the required devices.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1631
|
|
|
|
|
|
|
|
|
|
| |
Make zfs depend on the same version of zfs-kmod, rather than on same or
better. When yum repository contains a number of versions the dependency
resolution breaks on trying to install non-latest version.
Signed-off-by: Cyril Plisko <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1677
|
|
|
|
|
|
|
|
| |
Properly initialize SELinux xattrs for all inode types. The
initial implementation accidentally only did this for files.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1832
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During pool import stack overflows may still occur due to the
potentially deep recursion of traverse_visitbp(). This is most
likely to occur when additional layers are added to the block
device stack such as DM multipath. To minimize the stack usage
for this call path the following changes were made:
1) Added the keywork 'noinline' to the vdev_*_map_alloc() functions
to prevent them from being inlined by gcc. This reduced the
stack usage of vdev_raidz_io_start() from 208 to 128 bytes, and
vdev_mirror_io_start() from 144 to 128 bytes.
2) The 'saved_poolname' charater array in zfsdev_ioctl() was moved
from the stack to the heap. This reduced the stack usage of
zfsdev_ioctl() from 368 to 112 bytes.
3) The major saving came from slimming down traverse_visitbp() from
from 224 to 144 bytes. Since this function is called recursively
the 80 bytes saved per invokation adds up. The following changes
were made:
a) The 'hard' local variable was replaced by a TD_HARD() macro.
b) The 'pd' local variable was replaced by 'td->td_pfd' references.
c) The zbookmark_t was moved to the heap. This does cost us an
additional memory allocation per recursion by that cost should
still be minimal. The cost could be further reduced by adding
a dedicated zbookmark_t slab cache.
d) The variable declarations in 'if (BP_GET_LEVEL()) { }' were
restructured to use the minimum amount of stack. This includes
removing the 'cbp' local variable.
Overall for the offending use case roughly 1584 of total stack space
has been saved. This is enough to avoid overflowing the stack on
stock kernels with 8k stacks. See #1778 for additional details.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #1778
|
|
|
|
|
|
|
|
|
|
| |
Commit 95fd54a1c5b93bb2aa3e7dffc28c784b1e21a8bb restructured the
hold/release processing and moved some of the work into the sync task.
A number of nvlist allocations now need to use KM_PUSHPAGE.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1852
Closes #1855
|
|
|
|
|
|
|
|
|
|
|
| |
The Illumos #3875 patch reverted a part of ZoL's 7b3e34b which added
special-case error handling for zfs_rezget(). The error handling dealt
with the case in which an all-ones object number ended up being passed
to dnode_hold() and causing an EINVAL to be returned from zfs_rezget().
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1859
Closes #1861
|
|
|
|
|
|
|
|
| |
Future proofing for compatibility with newer versions of Python.
Signed-off-by: Matthew Thode <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1838
|
|
|
|
|
|
|
|
|
|
|
| |
Update the code to follow the pep8 style guide.
References:
http://www.python.org/dev/peps/pep-0008/
Signed-off-by: Matthew Thode <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1838
|
|
|
|
|
|
|
|
|
|
| |
Commit 0cee240 from FreeBSD dramatically speeds up 'zfs list'
performance assuming you're only interested in the dataset
names. This optimization should be mentioned in the man page
to allow end users to take advantage of it.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1847
|
|
|
|
|
|
|
|
|
|
| |
The previous pattern could accidentally match on things like
'real_root=ZFS=node02-zp00/ROOT/rootfs' due to the 'ZFS=no'
substring.
Signed-off-by: Matthew Thode <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1837
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently. Only one of the mounts will
succeed and the other will fail. The failed mounts will cause an EREMOTE
to be propagated back to the application.
This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY. The zfs code detects this condition and
treats it as if the mount had succeeded.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1819
|
|
|
|
|
| |
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1839
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The required Posix ACL interfaces are only available for kernels
with CONFIG_FS_POSIX_ACL defined. Therefore, only enable Posix
ACL support for these kernels. All major distribution kernels
enable CONFIG_FS_POSIX_ACL by default.
If your kernel does not support Posix ACLs the following warning
will be printed at ZFS module load time.
"ZFS: Posix ACLs disabled by kernel"
Signed-off-by: Massimo Maggi <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1825
|
|
|
|
|
|
|
|
|
| |
References:
delphix/delphix-os@931c8aaab74b6412933d299890894262e2ef8380
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1650
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A couple of kmem_alloc() allocations were using KM_SLEEP in
the sync thread context. These were accidentally introduced
by the recent set of Illumos patches. The solution is to
switch to KM_PUSHPAGE.
dsl_dataset_promote_sync() -> promote_hold() -> snaplist_make() ->
kmem_alloc(sizeof (*snap), KM_SLEEP);
dsl_dataset_user_hold_sync() -> dsl_onexit_hold_cleanup() ->
kmem_alloc(sizeof (*ca), KM_SLEEP)
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
| |
3995 Memory leak of compressed buffers in l2arc_write_done
References:
https://illumos.org/issues/3995
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #1688
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Eric Schrock <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/4168
https://www.illumos.org/issues/4169
https://www.illumos.org/issues/4170
illumos/illumos-gate@7fdd916c474ea52896c671bbe7b56ba34a1ca132
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Reviewed by: Eric Schrock <[email protected]>
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/4082
illumos/illumos-gate@5253393b09789ec67bec153b866d7285a1cf1645
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/3954
https://www.illumos.org/issues/4080
https://www.illumos.org/issues/4081
illumos/illumos-gate@22e30981d82a0b6dc89253596ededafae8655e00
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4046 dsl_dataset_t ds_dir->dd_lock is highly contended
Reviewed by: Eric Schrock <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
References:
https://www.illumos.org/issues/4046
illumos/illumos-gate@b62969f868a827f0823a084bc0af9c7d8b76c659
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
Porting notes:
1. This commit removed dsl_dataset_namelen in Illumos, but that
appears to have been removed from ZFSOnLinux in an earlier commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4061 libzfs: memory leak in iter_dependents_cb()
Reviewed by: Jeffry Molanus <[email protected]>
Reviewed by: Boris Protopopov <[email protected]>
Reviewed by: Andy Stormont <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/4061
illumos/illumos-gate@2fbdf8dbf01ec1c85fcd3827cdf9e9f5f46c4c8a
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4047 panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Dan McDonald <[email protected]>
References:
https://www.illumos.org/issues/4047
illumos/illumos-gate@713d6c208802cfbb806329ec0d154b641b80c355
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
Porting notes:
1. The exported symbol dmu_free_object() was renamed to
dmu_free_long_object() in Illumos.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3996 want a libzfs_core API to rollback to latest snapshot
Reviewed by: Christopher Siden <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Andy Stormont <[email protected]>
Approved by: Richard Lowe <[email protected]>
References:
https://www.illumos.org/issues/3996
illumos/illumos-gate@a7027df17fad220a20367b9d1eb251bc6300d203
Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1775
|