summaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Illumos 6844 - dnode_next_offset can detect fictional holesAlex Reece2016-04-271-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 6844 dnode_next_offset can detect fictional holes Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> dnode_next_offset is used in a variety of places to iterate over the holes or allocated blocks in a dnode. It operates under the premise that it can iterate over the blockpointers of a dnode in open context while holding only the dn_struct_rwlock as reader. Unfortunately, this premise does not hold. When we create the zio for a dbuf, we pass in the actual block pointer in the indirect block above that dbuf. When we later zero the bp in zio_write_compress, we are directly modifying the bp. The state of the bp is now inconsistent from the perspective of dnode_next_offset: the bp will appear to be a hole until zio_dva_allocate finally finishes filling it in. In the meantime, dnode_next_offset can detect a hole in the dnode when none exists. I was able to experimentally demonstrate this behavior with the following setup: 1. Create a file with 1 million dbufs. 2. Create a thread that randomly dirties L2 blocks by writing to the first L0 block under them. 3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle of a file. 4. Do dnode_next_offset in a loop until we skip over such a non-existent hole. The fix is to ensure that it is valid to iterate over the indirect blocks in a dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP and updating the actual BP in dbuf_write_ready while holding the lock. References: https://www.illumos.org/issues/6844 https://github.com/openzfs/openzfs/pull/82 DLPX-35372 Ported-by: Brian Behlendorf <[email protected]> Closes #4548
* Illumos 6659 - nvlist_free(NULL) is a no-opJosef 'Jeff' Sipek2016-04-273-7/+3
| | | | | | | | | | | | | | | 6659 nvlist_free(NULL) is a no-op Reviewed by: Toomas Soome <[email protected]> Reviewed by: Marcel Telka <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/6659 https://github.com/illumos/illumos-gate/commit/aab83bb Ported-by: David Quigley <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4566
* Use udev for partition detectionBrian Behlendorf2016-04-251-5/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ZFS partitions a block device it must wait for udev to create both a device node and all the device symlinks. This process takes a variable length of time and depends on factors such how many links must be created, the complexity of the rules, etc. Complicating the situation further it is not uncommon for udev to create and then remove a link multiple times while processing the udev rules. Given the above, the existing scheme of waiting for an expected partition to appear by name isn't 100% reliable. At this point udev may still remove and recreate think link resulting in the kernel modules being unable to open the device. In order to address this the zpool_label_disk_wait() function has been updated to use libudev. Until the registered system device acknowledges that it in fully initialized the function will wait. Once fully initialized all device links are checked and allowed to settle for 50ms. This makes it far more likely that all the device nodes will exist when the kernel modules need to open them. For systems without libudev an alternate zpool_label_disk_wait() was updated to include a settle time. In addition, the kernel modules were updated to include retry logic for this ENOENT case. Due to the improved checks in the utilities it is unlikely this logic will be invoked. However, if the rare event it is needed it will prevent a failure. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #4523 Closes #3708 Closes #4077 Closes #4144 Closes #4214 Closes #4517
* Linux 4.5 compat: Use xattr_handler->name for aclChunwei Chen2016-04-251-20/+44
| | | | | | | | | | | | | | Linux 4.5 added member "name" to xattr_handler. xattr_handler which matches to whole name rather than prefix should use "name" instead of "prefix". Otherwise, kernel will return with EINVAL when it tries to resolve handlers. Also, we remove the strcmp checks when xattr_handler has name, because xattr_resolve_name will do the check for us. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4549 Closes #4537
* Add pn_alloc()/pn_free() functionsBrian Behlendorf2016-04-216-30/+104
| | | | | | | | | | | | | | In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and pn_free() functions must be implemented. The existing illumos implementation were used for this purpose. The `flags` argument which was used in places wrapped by the HAVE_PN_UTILS condition has beed added back to zfs_remove() and zfs_link() functions. This removes a small point of divergence between the ZoL code and upstream. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4522
* Fix ZPL miswrite of default POSIX ACLNed Bass2016-04-181-1/+1
| | | | | | | | | | | | Commit 4967a3e introduced a typo that caused the ZPL to store the intended default ACL as an access ACL. Due to caching this problem may not become visible until the filesystem is remounted or the inode is evicted from the cache. Fix the typo and add a regression test. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4520
* Fix inverted logic on none elevator comparisonColin Ian King2016-04-151-1/+1
| | | | | | | | | | Commit d1d7e2689db9e03f1 ("cstyle: Resolve C style issues") inverted the logic on the none elevator comparison. Fix this and make it cstyle warning clean. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4507
* Enable lazytime semantic for atimeChunwei Chen2016-04-053-72/+110
| | | | | | | | | | | | | | | | | | | | | | Linux 4.0 introduces lazytime. The idea is that when we update the atime, we delay writing it to disk for as long as it is reasonably possible. When lazytime is enabled, dirty_inode will be called with only I_DIRTY_TIME flag whenever i_atime is updated. So under such condition, we will set z_atime_dirty. We will only write it to disk if file is closed, inode is evicted or setattr is called. Ideally, we should also write it whenever SA is going to be updated, but it is left for future improvement. There's one thing that we should take care of now that we allow i_atime to be dirty. In original implementation, whenever SA is modified, zfs_inode_update will be called to overwrite every thing in inode. This will cause dirty i_atime to be discarded. We fix this by don't overwrite i_atime in zfs_inode_update. We only overwrite i_atime when allocating new inode or doing zfs_rezget with zfs_inode_update_new. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4482
* Fix atime handling and relatimeChunwei Chen2016-04-057-97/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem for atime: We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its handling is a mess. A huge part of mess regarding atime comes from zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave inconsistently with those three values. zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you don't pass ATTR_ATIME. Which means every write(2) operation which only updates ctime and mtime will cause atime changes to not be written to disk. Also zfs_inode_update from write(2) will replace inode->i_atime with what's inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2). You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0. Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new), SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll leave with a stale atime. The problem for relatime: We do have a relatime config inside ZFS dataset, but how it should interact with the mount flag MS_RELATIME is not well defined. It seems it wanted relatime mount option to override the dataset config by showing it as temporary in `zfs get`. But at the same time, `zfs set relatime=on|off` would also seems to want to override the mount option. Not to mention that MS_RELATIME flag is actually never passed into ZFS, so it never really worked. How Linux handles atime: The Linux kernel actually handles atime completely in VFS, except for writing it to disk. So if we remove the atime handling in ZFS, things would just work, no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever VFS updates the i_atime, it will notify the underlying filesystem via sb->dirty_inode(). And also there's one thing to note about atime flags like MS_RELATIME and other flags like MS_NODEV, etc. They are mount point flags rather than filesystem(sb) flags. Since native linux filesystem can be mounted at multiple places at the same time, they can all have different atime settings. So these flags are never passed down to filesystem drivers. What this patch tries to do: We remove znode->z_atime, since we won't gain anything from it. We remove most of the atime handling and leave it to VFS. The only thing we do with atime is to write it when dirty_inode() or setattr() is called. We also add file_accessed() in zpl_read() since it's not provided in vfs_read(). After this patch, only the MS_RELATIME flag will have effect. The setting in dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME set according to the setting in dataset in future patch. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4482
* Linux 4.6 compat: PAGE_CACHE_SIZE removalBrian Behlendorf2016-04-052-24/+23
| | | | | | | | | | | | | | | | | As described in torvalds/linux@4a2d057e the macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced to make it possible to add bigger chunks to the page cache. This never panned out and it has therefore been removed from the kernel. ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros and calls to page_cache_release() have been replaced with put_page(). There was no need to introduce a configure check for this because these interfaces have existed for a very long time. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4489
* Add support 32 bit FS_IOC32_{GET|SET}FLAGS compat ioctlsColin Ian King2016-03-311-1/+14
| | | | | | | | | | | We need 32 bit userspace FS_IOC32_GETFLAGS and FS_IOC32_SETFLAGS compat ioctls for systems such as powerpc64. We use the normal compat ioctl idiom as used by a variety of file systems to provide this support. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4477
* gcc build error: -Wbool-compare in metaslab.cDHE2016-03-301-1/+2
| | | | | | | | | | | When debugging is enabled on a very recent version of gcc (tested with 5.3.0), DVA_SET_GANG(dva, !!(flags)) fails because an assertion causes a comparison between what is technically a boolean and an integer. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4465
* Fix zpool_scrub_* test casesBrian Behlendorf2016-03-301-0/+4
| | | | | | | | | | | | | The zpool_scrub_002, zpool_scrub_003, zpool_scrub_004 test cases fail reliably when running against small pools or fast storage. This occurs because the scrub/resilver operation completes before subsequent commands can be run. A one second delay has been added to 10% of zio's in order to ensure the scrub/resilver operation will run for at least several seconds. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4450
* Illumos 6681 - zfs list burning lots of time in dodefault() via dsl_prop_*Alex Wilson2016-03-151-7/+6
| | | | | | | | | | | | | | | | 6681 zfs list burning lots of time in dodefault() via dsl_prop_* Reviewed by: Patrick Mooney <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Matthew Ahrens <[email protected]> References: https://www.illumos.org/issues/6681 https://github.com/illumos/illumos-gate/commit/d09e447 Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #4406
* Illumos 6370 - ZFS send fails to transmit some holesPaul Dagnelie2016-03-102-14/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 6370 ZFS send fails to transmit some holes Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Chris Williamson <[email protected]> Reviewed by: Stefan Ring <[email protected]> Reviewed by: Steven Burgess <[email protected]> Reviewed by: Arne Jansen <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/6370 https://github.com/illumos/illumos-gate/commit/286ef71 In certain circumstances, "zfs send -i" (incremental send) can produce a stream which will result in incorrect sparse file contents on the target. The problem manifests as regions of the received file that should be sparse (and read a zero-filled) actually contain data from a file that was deleted (and which happened to share this file's object ID). Note: this can happen only with filesystems (not zvols, because they do not free (and thus can not reuse) object IDs). Note: This can happen only if, since the incremental source (FromSnap), a file was deleted and then another file was created, and the new file is sparse (i.e. has areas that were never written to and should be implicitly zero-filled). We suspect that this was introduced by 4370 (applies only if hole_birth feature is enabled), and made worse by 5243 (applies if hole_birth feature is disabled, and we never send any holes). The bug is caused by the hole birth feature. When an object is deleted and replaced, all the holes in the object have birth time zero. However, zfs send cannot tell that the holes are new since the file was replaced, so it doesn't send them in an incremental. As a result, you can end up with invalid data when you receive incremental send streams. As a short-term fix, we can always send holes with birth time 0 (unless it's a zvol or a dataset where we can guarantee that no objects have been reused). Ported-by: Steven Burgess <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4369 Closes #4050
* Fix lock order inversion with zvol_open()Boris Protopopov2016-03-101-43/+20
| | | | | | | | | | | | | | | zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3681
* Add support for asynchronous zvol minor operationsBoris Protopopov2016-03-108-202/+436
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zfsonlinux issue #2217 - zvol minor operations: check snapdev property before traversing snapshots of a dataset zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Create a per-pool zvol taskq for asynchronous zvol tasks. There are a few key design decisions to be aware of. * Each taskq must be single threaded to ensure tasks are always processed in the order in which they were dispatched. * There is a taskq per-pool in order to keep the pools independent. This way if one pool is suspended it will not impact another. * The preferred location to dispatch a zvol minor task is a sync task. In this context there is easy access to the spa_t and minimal error handling is required because the sync task must succeed. Support for asynchronous zvol minor operations address issue #3681. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2217 Closes #3678 Closes #3681
* Change KM_SLEEP to TQ_SLEEP in spa_deadman()Tim Chase2016-03-091-1/+1
| | | | | | | | | | Since they both evaluate to zero, this is a semi-cosmetic change but the latter is the proper value to use as an argument to taskq_dispatch_delay(). Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4393
* Make zvol update volsize operation synchronous.ab-oe2016-02-291-0/+4
| | | | | | | | | | | | | There is a race condition when new transaction group is added to dp->dp_dirty_datasets list by the zap_update in the zvol_update_volsize. Meanwhile, before these dirty data are synchronized, the receive process can cause that dmu_recv_end_sync is executed. Then finally dirty data are going to be synchronized but the synchronization ends with the NULL pointer dereference error. Signed-off-by: ab-oe <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4116
* FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and ↵smh2016-02-262-111/+240
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | locality information. The existing algorithm selects a preferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the devices are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. The new algorithm takes into the following additional factors: * Load of the vdevs (number outstanding I/O requests) * The locality of last queued I/O vs the new I/O request. Within the locality calculation additional knowledge about the underlying vdev is considered such as; is the device backing the vdev a rotating media device. This results in performance increases across the board as well as significant increases for predominantly streaming loads and for configurations which don't have evenly performing devices. The following are results from a setup with 3 Way Mirror with 2 x HD's and 1 x SSD from a basic test running multiple parrallel dd's. With pre-fetch disabled (vfs.zfs.prefetch_disable=1): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s With pre-fetch enabled (vfs.zfs.prefetch_disable=0): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s In addition to the performance changes the code was also restructured, with the help of Justin Gibbs, to provide a more logical flow which also ensures vdevs loads are only calculated from the set of valid candidates. The following additional sysctls where added to allow the administrator to tune the behaviour of the load algorithm: * vfs.zfs.vdev.mirror.rotating_inc * vfs.zfs.vdev.mirror.rotating_seek_inc * vfs.zfs.vdev.mirror.rotating_seek_offset * vfs.zfs.vdev.mirror.non_rotating_inc * vfs.zfs.vdev.mirror.non_rotating_seek_inc These changes where based on work started by the zfsonlinux developers: https://github.com/zfsonlinux/zfs/pull/1487 Reviewed by: gibbs, mav, will MFC after: 2 weeks Sponsored by: Multiplay References: https://github.com/freebsd/freebsd@5c7a6f5d https://github.com/freebsd/freebsd@31b7f68d https://github.com/freebsd/freebsd@e186f564 Performance Testing: https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141 Porting notes: - The tunables were adjusted to have ZoL-style names. - The code was modified to use ZoL's vd_nonrot. - Fixes were done to make cstyle.pl happy - Merge conflicts were handled manually - freebsd/freebsd@e186f564bc946f82c76e0b34c2f0370ed9aea022 by my collegue Andriy Gapon has been included. It applied perfectly, but added a cstyle regression. - This replaces 556011dbec2d10579819078559a77630fc559112 entirely. - A typo "IO'a" has been corrected to say "IO's" - Descriptions of new tunables were added to man/man5/zfs-module-parameters.5. Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4334
* Add l2arc_max_block_size tunableBrian Behlendorf2016-02-251-1/+33
| | | | | | | | | | | Set a limit for the largest compressed block which can be written to an L2ARC device. By default this limit is set to 16M so there is no change in behavior. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Elling <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #4323
* Make zvol minor functionality more robustBoris Protopopov2016-02-241-16/+77
| | | | | | | | | | | | | | | | | | | | | | | | Close the race window in zvol_open() to prevent removal of zvol_state in the 'first open' code path. Move the call to check_disk_change() under zvol_state_lock to make sure the zvol_media_changed() and zvol_revalidate_disk() called by check_disk_change() are invoked with positive zv_open_count. Skip opened zvols when removing minors and set private_data to NULL for zvols that are not in use whose minors are being removed, to indicate to zvol_open() that the state is gone. Skip opened zvols when renaming minors to avoid modifying zv_name that might be in use, e.g. in zvol_ioctl(). Drop zvol_state_lock before calling add_disk() when creating minors to avoid deadlocks with zvol_open(). Wrap dmu_objset_find() with spl_fstran_mark()/unmark(). Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #4344
* Call dmu_read_uio_dbuf() in zvol_read()Richard Yao2016-02-181-1/+1
| | | | | | | | | | | | | | | | | The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is that the former takes a hold while the latter uses an existing hold. `zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no apparent reason, we inherited a `zvol_read()` function from OpenSolaris that does `dmu_read_uio()`. illumos-gate also still uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`, which is more performant. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4316
* Clean up zvol request processing to pass uio and fix porting regressionsRichard Yao2016-02-181-76/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | In illumos-gate, `zvol_read` and `zvol_write` are both passed uio_t rather than bio_t. Since we are translating from bio to uio for both, we might as well unify the logic and have code more similar to its illumos counterpart. At the same time, we can fix some regressions that occurred versus the original code from illumos-gate. We refactor zvol_write to take uio and also correct the following problems: 1. We did `dnode_hold()` on each IO when we already had a hold. 2. We would attempt to send writes that exceeded `DMU_MAX_ACCESS` to the DMU. 3. We could call `zil_commit()` twice. In this case, this is because Linux uses the `->write` function to send flushes and can aggregate the flush with a write. If a synchronous write occurred with the flush, we effectively flushed twice when there is no need to do that. zvol_read also suffers from the first two problems. Other platforms suffer from the first, so we leave that for a second patch so that there is a discrete patch for them to cherry-pick. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4316
* Remove wrong ASSERT in annotate_ecksumChunwei Chen2016-02-171-2/+2
| | | | | | | | | | When using large blocks like 1M, there will be more than UINT16_MAX qwords in one block, so this ASSERT would go off. Also, it is possible for the histogram to overflow. We cap them to UINT16_MAX to prevent this. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4257
* Illumos 5809 - Blowaway full receive in v1 pool causes kernel panicPaul Dagnelie2016-02-081-1/+2
| | | | | | | | | | | | | | 5809 Blowaway full receive in v1 pool causes kernel panic Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Will Andrews <[email protected]> Approved by: Gordon Ross <[email protected]> References: https://www.illumos.org/issues/5809 https://github.com/illumos/illumos-gate/commit/f40b29c Ported-by: Brian Behlendorf <[email protected]>
* Illumos 6537 - Panic on zpool scrub with DEBUG kernelGary Mills2016-02-051-2/+4
| | | | | | | | | | | | | | | 6537 Panic on zpool scrub with DEBUG kernel Reviewed by: Steve Gonczi <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Matthew Ahrens <[email protected]> References: https://www.illumos.org/issues/6537 https://github.com/illumos/illumos-gate/commit/8c04a1f Ported-by: Brian Behlendorf <[email protected]>
* Illumos 6096 - ZFS_SMB_ACL_RENAME needs to cleanup betterDan McDonald2016-02-051-0/+1
| | | | | | | | | | | | | | 6096 ZFS_SMB_ACL_RENAME needs to cleanup better Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Gordon Ross <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/6096 https://github.com/illumos/illumos-gate/commit/8f5190a5 Ported-by: Brian Behlendorf <[email protected]>
* Illumos 6450 - scrub/resilver unnecessarily traverses snapshotsMatthew Ahrens2016-02-021-4/+50
| | | | | | | | | | | | | | | | 6450 scrub/resilver unnecessarily traverses snapshots created after the scrub started Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/6450 https://github.com/illumos/illumos-gate/commit/38d6103 Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>
* Handling negative dentries in a CI file system.Richard Sharpe2016-02-021-2/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a Case Insensitive file system we must avoid creating negative entries in the dentry cache. We must also pass the FIGNORECASE into zfs_lookup so that special files are handled correctly. We must also prevent negative dentries from being created when files are unlinked. Tested by running fsstress from LTP (10 loops, 10 processes, 10,000 ops.) Also tested with printks (now removed) to ensure that lookups come to zpl_lookup when negative should not exist. Tests: 1. ls Some-file.txt; touch some-file.txt; ls Some-file.txt and ensure no errors. 2. touch Some-file.txt; rm some-file.txt; ls Some-file.txt and ensure that the last ls shows log messages showing the lookup went all the way to zpl_lookup. Thanks to tuxoko for helping me get this correct. Signed-off-by: Richard Sharpe <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4243
* Illumos 6527 - Possible access beyond end of string in zpool commentJorgen Lundman2016-01-281-1/+0
| | | | | | | | | | | | | | | 6527 Possible access beyond end of string in zpool comment Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Gordon Ross <[email protected]> References: https://www.illumos.org/issues/6527 https://github.com/illumos/illumos-gate/commit/2bd7a8d Ported-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]>
* Illumos 6495 - Fix mutex leak in dmu_objset_find_dpBrian Behlendorf2016-01-281-0/+3
| | | | | | | | | | | | | | 6495 Fix mutex leak in dmu_objset_find_dp Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Albert Lee <[email protected]> References: https://www.illumos.org/issues/6495 https://github.com/illumos/illumos-gate/commit/2bad225 Ported-by: Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]>
* Illumos 6414 - vdev_config_sync could be simplerBrian Behlendorf2016-01-282-14/+15
| | | | | | | | | | | | | | 6414 vdev_config_sync could be simpler Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/6414 https://github.com/illumos/illumos-gate/commit/eb5bb58 Ported-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]>
* llumos 6334 - Cannot unlink files when over quotaSimon Klinkert2016-01-261-5/+2
| | | | | | | | | | | | | 6334 Cannot unlink files when over quota Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Toomas Soome <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/6334 https://github.com/illumos/illumos-gate/commit/6575bca Ported-by: Brian Behlendorf <[email protected]>
* Reintroduce zfs_remove() synchronous deleteskernelOfTruth2016-01-261-12/+66
| | | | | | | | | | | | | | | | | | | | | | | | Reintroduce a slightly adapted version of the Illumos logic for synchronous unlinks. The basic idea here is that only files smaller than zfs_delete_blocks (20480) blocks should be deleted synchronously. Unlinking larger files should be handled asynchronously to minimize impact to the caller. To accomplish this iput() which is responsible for calling zfs_znode_delete() on Linux is only called in the delete_now path. Otherwise zfs_async_iput() is used which allows the last reference to be dropped by a taskq thread effectively making the removal asynchronous. Porting notes: - Add zfs_delete_blocks module option for performance analysis. The default value is DMU_MAX_DELETEBLKCNT which is the same as upstream. Reducing this value means that smaller files will be unlinked asynchronously like large files. - All occurrences of zfsvfs changes to zsb. Ported-by: KernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>
* Log zvol truncate/discard operationsDan McDonald2016-01-261-8/+59
| | | | | | | | | | | | | | As the comments in zvol_discard() suggested, the discard operation could be logged to the zil. This is a port of the relevant code from Nexenta as it was added in "701 UNMAP support for COMSTAR" and has been attributed to the author of that commit. References: https://github.com/Nexenta/illumos-nexenta/commit/b77b923 https://github.com/zfsonlinux/zfs/blob/089fa91b/module/zfs/zvol.c#L637 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 6251 - add tunable to disable free_bpobj processingGeorge Wilson2016-01-251-2/+11
| | | | | | | | | | | | | | | | | | | | | 6251 - add tunable to disable free_bpobj processing Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Simon Klinkert <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Albert Lee <[email protected]> Reviewed by: Xin Li <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/6251 https://github.com/illumos/illumos-gate/commit/139510f Porting notes: - Added as module option declaration. - Added to zfs-module-parameters.5 man page. Ported-by: Signed-off-by: Brian Behlendorf <[email protected]>
* Set arc_c_min properly in userland buildsTim Chase2016-01-251-4/+4
| | | | | | | | | | Since it's set to arc_c_max / 2, it must be set after arc_c_max is set. Also added protection against it falling below 2 * maxblocksize in userland builds. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4268
* Prevent arc_c collapseTim Chase2016-01-251-8/+8
| | | | | | | | | | | | | | | Adjusting arc_c directly is racy because it can happen in the context of multiple threads. It should always be >= 2 * maxblocksize. Set it to a known valid value rather than adjusting it directly. In addition refactor arc_shrink() to a simpler structure, protect against underflow in the calculation of the new arc_c value. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Reverts: 935434ef Closes: #3904 Closes: #4161
* Illumos 4950 - files sometimes can't be removed from a full filesystemMatthew Ahrens2016-01-216-7/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4950 files sometimes can't be removed from a full filesystem Reviewed by: Adam Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Boris Protopopov <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4950 https://github.com/illumos/illumos-gate/commit/4bb7380 Porting notes: - ZoL currently does not log discards to zvols, so the portion of this patch that modifies the discard logging to mark it as freeing space has been discarded. 2. may_delete_now had been removed from zfs_remove() in ZoL. It has been reintroduced. 3. We do not try to emulate vnodes, so the following lines are not valid on Linux: mutex_enter(&vp->v_lock); may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp); mutex_exit(&vp->v_lock); This has been replaced with: mutex_enter(&zp->z_lock); may_delete_now = atomic_read(&ip->i_count) == 1 && !(zp->z_is_mapped); mutex_exit(&zp->z_lock); Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Close possible zfs_znode_held() raceBrian Behlendorf2016-01-201-4/+3
| | | | | | | | | | | | Check if the lock is held while holding the z_hold_locks() lock. This prevents a possible use-after-free bug for callers which are not holding the lock. There currently are no such callers so this can't cause a problem today but it has been fixed regardless. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4244 Issue #4124
* Linux 4.5 compat: xattr list handlerBrian Behlendorf2016-01-201-197/+255
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The registered xattr .list handler was simplified in the 4.5 kernel to only perform a permission check. Given a dentry for the file it must return a boolean indicating if the name is visible. This differs slightly from the previous APIs which also required the function to copy the name in to the provided list and return its size. That is now all the responsibility of the caller. This should be straight forward change to make to ZoL since we've always required the caller to make the copy. However, this was slightly complicated by the need to support 3 older APIs. Yes, between 2.6.32 and 4.5 there are 4 versions of this interface! Therefore, while the functional change in this patch is small it includes significant cleanup to make the code understandable and maintainable. These changes include: - Improved configure checks for .list, .get, and .set interfaces. - Interfaces checked from newest to oldest. - Strict checking for each possible known interface. - Configure fails when no known interface is available. - HAVE_*_XATTR_LIST renamed HAVE_XATTR_LIST_* for consistency with similar iops and fops configure checks. - POSIX_ACL_XATTR_{DEFAULT|ACCESS} were removed forcing callers to move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT|ACCESS}. Compatibility wrapper were added for old kernels. - ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing ZPL_XATTR_{GET|SET} WRAPPERs. Only the inode is guaranteed to be a valid pointer, passing NULL for the 'list' and 'name' variables is allowed and must be checked for. All .list functions were updated to use the wrapper to aid readability. - zpl_xattr_filldir() updated to use the .list function for its permission check which is consistent with the updated Linux 4.5 interface. If a .list function is registered it should return 0 to indicate a name should be skipped, if there is no registered function the name will be added. - Additional documentation from xattr(7) describing the correct behavior for each namespace was added before the relevant handlers. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Issue #4228
* Linux 4.5 compat: get_link() / put_link()Brian Behlendorf2016-01-201-33/+92
| | | | | | | | | | | | | | | | | | | | The follow_link() interface was retired in favor of get_link(). In the process of phasing in get_link() the Linux kernel went through two different versions. The first of which depended on put_link() and the final version on a delayed done function. - Improved configure checks for .follow_link, .get_link, .put_link. - Interfaces checked from newest to oldest. - Strict checking for each possible known interface. - Configure fails when no known interface is available. - Both versions .get_link are detected and supported as well two previous versions of .follow_link. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Issue #4228
* Illumos 5045 - use atomic_{inc,dec}_* instead of atomic_add_*Josef 'Jeff' Sipek2016-01-156-71/+68
| | | | | | | | | | | | | | | | | | 5045 use atomic_{inc,dec}_* instead of atomic_add_* Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5045 https://github.com/illumos/illumos-gate/commit/1a5e258 Porting notes: - All changes to non-ZFS files dropped. - Changes to zfs_vfsops.c dropped because they were Illumos specific. Ported-by: Brian Behlendorf <[email protected]> Closes #4220
* Illumos 4039 - zfs_rename()/zfs_link() needs stronger test for XDEVMarcel Telka2016-01-151-5/+14
| | | | | | | | | | | | | | | | | | | | | 4039 zfs_rename()/zfs_link() needs stronger test for XDEV Reviewed by: Gordon Ross <[email protected]> Reviewed by: Kevin Crowe <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4039 https://github.com/illumos/illumos-gate/commit/18e6497 Porting notes: - This check was updated in Linux in a similar fashion early on in the port. Therefore, this patch just reorders the function and updates the comment so it flows the same way as the upstream code. Ported-by: Brian Behlendorf <[email protected]> Closes #4218
* Illumos 3557, 3558, 3559, 3560George Wilson2016-01-151-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | 3557 dumpvp_size is not updated correctly when a dump zvol's size is changed 3558 setting the volsize on a dump device does not return back ENOSPC 3559 setting a volsize larger than the space available sometimes succeeds 3560 dumpadm should be able to remove a dump device Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Albert Lee <[email protected]> References: https://www.illumos.org/issues/3559 https://github.com/illumos/illumos-gate/commit/c61ea56 Porting notes: - Internal zvol.c changes not applied due to implementation differences. The external interface and behavior was already consistent with the latest upstream code. - Retired 2.6.28 HAVE_CHECK_DISK_SIZE_CHANGE configure check. All supported kernels (2.6.32 and newer) provide this interface. Ported-by: Brian Behlendorf <[email protected]> Closes #4217
* Prevent duplicated xattr between SA and dirChunwei Chen2016-01-151-5/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When replacing an xattr would cause overflowing in SA, we would fallback to xattr dir. However, current implementation don't clear the one in SA, so we would end up with duplicated SA. For example, running the following script on an xattr=sa filesystem would cause duplicated "user.1". -- dup_xattr.sh begin -- randbase64() { dd if=/dev/urandom bs=1 count=$1 2>/dev/null | openssl enc -a -A } file=$1 touch $file setfattr -h -n user.1 -v `randbase64 5000` $file setfattr -h -n user.2 -v `randbase64 20000` $file setfattr -h -n user.3 -v `randbase64 20000` $file setfattr -h -n user.1 -v `randbase64 20000` $file getfattr -m. -d $file -- dup_xattr.sh end -- Also, when a filesystem is switch from xattr=sa to xattr=on, it will never modify those in SA. This would cause strange behavior like, you cannot delete an xattr, or setxattr would cause duplicate and the result would not match when you getxattr. For example, the following shell sequence. -- shell begin -- $ sudo zfs set xattr=sa pp/fs0 $ touch zzz $ setfattr -n user.test -v asdf zzz $ sudo zfs set xattr=on pp/fs0 $ setfattr -x user.test zzz setfattr: zzz: No such attribute $ getfattr -d zzz user.test="asdf" $ setfattr -n user.test -v zxcv zzz $ getfattr -d zzz user.test="asdf" user.test="asdf" -- shell end -- We fix this behavior, by first finding where the xattr resides before setxattr. Then, after we successfully updated the xattr in one location, we will clear the other location. Note that, because update and clear are not in single tx, we could still end up with duplicated xattr. But by doing setxattr again, it can be fixed. Signed-off-by: Chunwei Chen <[email protected]> Closes #3472 Closes #4153
* Remove fastwrite mutexRichard Yao2016-01-151-9/+0
| | | | | | | | | | | | | | | | The fast write mutex is intended to protect accounting, but it is redundant because all accounting is performed through atomic operations. It also serializes all metaslab IO behind a mutex, which introduces a theoretical scaling regression that the Illumos developers did not like when we showed this to them. Removing it makes the selection of the metaslab_group lock free as it is on Illumos. The selection is not quite the same without the lock because the loop races with IO completions, but any imbalances caused by this are likely to be corrected by subsequent metaslab group selections. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3643
* Fix zsb->z_hold_mtx deadlockBrian Behlendorf2016-01-152-47/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to serialize access to a znode and its SA buffer while the object is being created or destroyed. This kind of locking would normally reside in the znode itself but in this case that's impossible because the znode and SA buffer may not yet exist. Therefore the locking is handled externally with an array of mutexs and AVLs trees which contain per-object locks. In zfs_znode_hold_enter() a per-object lock is created as needed, inserted in to the correct AVL tree and finally the per-object lock is held. In zfs_znode_hold_exit() the process is reversed. The per-object lock is released, removed from the AVL tree and destroyed if there are no waiters. This scheme has two important properties: 1) No memory allocations are performed while holding one of the z_hold_locks. This ensures evict(), which can be called from direct memory reclaim, will never block waiting on a z_hold_locks which just happens to have hashed to the same index. 2) All locks used to serialize access to an object are per-object and never shared. This minimizes lock contention without creating a large number of dedicated locks. On the downside it does require znode_lock_t structures to be frequently allocated and freed. However, because these are backed by a kmem cache and very short lived this cost is minimal. Signed-off-by: Brian Behlendorf <[email protected]> Issue #4106
* Add zfs_object_mutex_size module optionBrian Behlendorf2016-01-152-9/+17
| | | | | | | | | | | Add a zfs_object_mutex_size module option to facilitate resizing the the per-dataset znode mutex array. Increasing this value may help make the deadlock described in #4106 less common, but this is not a proper fix. This patch is primarily to aid debugging and analysis. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #4106