aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Enable lazytime semantic for atimeChunwei Chen2016-04-053-72/+110
| | | | | | | | | | | | | | | | | | | | | | Linux 4.0 introduces lazytime. The idea is that when we update the atime, we delay writing it to disk for as long as it is reasonably possible. When lazytime is enabled, dirty_inode will be called with only I_DIRTY_TIME flag whenever i_atime is updated. So under such condition, we will set z_atime_dirty. We will only write it to disk if file is closed, inode is evicted or setattr is called. Ideally, we should also write it whenever SA is going to be updated, but it is left for future improvement. There's one thing that we should take care of now that we allow i_atime to be dirty. In original implementation, whenever SA is modified, zfs_inode_update will be called to overwrite every thing in inode. This will cause dirty i_atime to be discarded. We fix this by don't overwrite i_atime in zfs_inode_update. We only overwrite i_atime when allocating new inode or doing zfs_rezget with zfs_inode_update_new. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4482
* Fix atime handling and relatimeChunwei Chen2016-04-059-108/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem for atime: We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its handling is a mess. A huge part of mess regarding atime comes from zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave inconsistently with those three values. zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you don't pass ATTR_ATIME. Which means every write(2) operation which only updates ctime and mtime will cause atime changes to not be written to disk. Also zfs_inode_update from write(2) will replace inode->i_atime with what's inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2). You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0. Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new), SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll leave with a stale atime. The problem for relatime: We do have a relatime config inside ZFS dataset, but how it should interact with the mount flag MS_RELATIME is not well defined. It seems it wanted relatime mount option to override the dataset config by showing it as temporary in `zfs get`. But at the same time, `zfs set relatime=on|off` would also seems to want to override the mount option. Not to mention that MS_RELATIME flag is actually never passed into ZFS, so it never really worked. How Linux handles atime: The Linux kernel actually handles atime completely in VFS, except for writing it to disk. So if we remove the atime handling in ZFS, things would just work, no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever VFS updates the i_atime, it will notify the underlying filesystem via sb->dirty_inode(). And also there's one thing to note about atime flags like MS_RELATIME and other flags like MS_NODEV, etc. They are mount point flags rather than filesystem(sb) flags. Since native linux filesystem can be mounted at multiple places at the same time, they can all have different atime settings. So these flags are never passed down to filesystem drivers. What this patch tries to do: We remove znode->z_atime, since we won't gain anything from it. We remove most of the atime handling and leave it to VFS. The only thing we do with atime is to write it when dirty_inode() or setattr() is called. We also add file_accessed() in zpl_read() since it's not provided in vfs_read(). After this patch, only the MS_RELATIME flag will have effect. The setting in dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME set according to the setting in dataset in future patch. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4482
* Linux 4.6 compat: PAGE_CACHE_SIZE removalBrian Behlendorf2016-04-052-24/+23
| | | | | | | | | | | | | | | | | As described in torvalds/linux@4a2d057e the macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced to make it possible to add bigger chunks to the page cache. This never panned out and it has therefore been removed from the kernel. ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros and calls to page_cache_release() have been replaced with put_page(). There was no need to introduce a configure check for this because these interfaces have existed for a very long time. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4489
* Fix WANT_DEVNAME2DEVID configure errorBrian Behlendorf2016-04-012-6/+7
| | | | | | | | | | Accidentally introduced by commit e4023e4. The AM_CONDITIONAL cannot be located where it can be invoked conditionally, as in the `--with-config=user` case. Relocate it to the top level ZFS_AC_CONFIG macro along with the other AM_CONDITIONALs. Signed-off-by: Brian Behlendorf <[email protected]> Issue #4416
* Add support 32 bit FS_IOC32_{GET|SET}FLAGS compat ioctlsColin Ian King2016-03-311-1/+14
| | | | | | | | | | | We need 32 bit userspace FS_IOC32_GETFLAGS and FS_IOC32_SETFLAGS compat ioctls for systems such as powerpc64. We use the normal compat ioctl idiom as used by a variety of file systems to provide this support. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4477
* Only build devname2devid when libudev headers are availableBrian Behlendorf2016-03-312-3/+12
| | | | | | | | | Accidentally introduced by commit 39fc0cb. The devname2devid utility which depends on libudev must only be built when libudev headers are available. This is accomplished through an AM_CONDITIONAL. Signed-off-by: Brian Behlendorf <[email protected]> Issue #4416
* Add support for devid and phys_path keys in vdev disk labelsDon Brady2016-03-3111-163/+476
| | | | | | | | | | | | | | | | | | | | | | | | | | | This is foundational work for ZED. Updates a leaf vdev's persistent device strings on Linux platform * only applies for a dedicated leaf vdev (aka whole disk) * updated during pool create|add|attach|import * used for matching device matching during auto-{online,expand,replace} * stored in a leaf disk config label (i.e. alongside 'path' NVP) * can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES Some examples: path: '/dev/sdb1' devid: 'scsi-350000394a8ca4fbc-part1' phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0' path: '/dev/mapper/mpatha' devid: 'dm-uuid-mpath-35000c5006304de3f' Signed-off-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2856 Closes #3978 Closes #4416
* Expand EDQUOT variableAndriy Gapon2016-03-311-3/+3
| | | | | | | | | | Results in failures with ksh version 93v- 2014-06-25. This appears to not be an issue with ksh version 93u+ 2012-08-01. The expanded versions works correctly for both. Signed-off-by: Andriy Gapon <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4452
* zfs_main: fix `zfs userspace` squashing unresolved entriesPavel Boldin2016-03-301-4/+7
| | | | | | | | | | | | | | The `zfs userspace` squashes all entries with unresolved numeric values into a single output entry due to the comparsion always made by the string name which is empty in case of unresolved IDs. Fix this by falling to a numerical comparison when either one of string values is not found. This then compares any numerical values after all with a name resolved. Signed-off-by: Pavel Boldin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4440
* Remove complicated libspl assert wrappersBrian Behlendorf2016-03-302-51/+36
| | | | | | | | | | Effectively provide our own version of assert()/verify() for use in user space. This minimizes our dependencies and aligns the user space assertion handling with what's used in the kernel. Signed-off-by: Carlo Landmeter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4449
* gcc build error: -Wbool-compare in metaslab.cDHE2016-03-301-1/+2
| | | | | | | | | | | When debugging is enabled on a very recent version of gcc (tested with 5.3.0), DVA_SET_GANG(dva, !!(flags)) fails because an assertion causes a comparison between what is technically a boolean and an integer. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4465
* Fix zpool_scrub_* test casesBrian Behlendorf2016-03-306-5/+29
| | | | | | | | | | | | | The zpool_scrub_002, zpool_scrub_003, zpool_scrub_004 test cases fail reliably when running against small pools or fast storage. This occurs because the scrub/resilver operation completes before subsequent commands can be run. A one second delay has been added to 10% of zio's in order to ensure the scrub/resilver operation will run for at least several seconds. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4450
* Use the correct macro to include backtraceCarlo Landmeter2016-03-291-2/+2
| | | | | | | | | | | execinfo.h and backtrace() are GNU extensions provided by glibc and not by gcc, see: http://www.gnu.org/software/libc/manual/html_mono/libc.html#Backtraces Signed-off-by: Carlo Landmeter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4453
* Move hrtime_t timestruc_t and timespec_tCarlo Landmeter2016-03-292-4/+5
| | | | | | | | | | | hrtime_t timestruc_t and timespec_t should have originally been included in sys/time.h so lets move them. longlong_t is not defined by any standard so change it to long long Signed-off-by: Carlo Landmeter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4459
* Set _DATE_FMT to '%+' if not defined in libspl/timestamp.cCarlo Landmeter2016-03-291-0/+4
| | | | | | Signed-off-by: Carlo Landmeter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4458
* Ensure correct return value typeCarlo Landmeter2016-03-291-1/+1
| | | | | | | | When compiling with musl libc the return type will be incorrect. Signed-off-by: Carlo Landmeter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4454
* Add missing fcntl.h to includes in mount_zfs.cCarlo Landmeter2016-03-291-0/+1
| | | | | | | | This is needed for musl libc Signed-off-by: Carlo Landmeter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4456
* Include sys/types.h in devid.hCarlo Landmeter2016-03-291-0/+1
| | | | | | | | This is needed for musl libc Signed-off-by: Carlo Landmeter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4454
* Correct typo in spa_load_verify_metadata docsRichard Laager2016-03-291-1/+1
| | | | | | Signed-off-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4471
* zloop.sh requires bashBrian Behlendorf2016-03-251-1/+1
| | | | | | | | | The zloop.sh script requires bash. It will require further improvements to be compatible with the alternatives such as dash. This resolves the ztest failures observed under Ubuntu in the automated tested. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4441
* write_dirs: set_partition expects zero-based partition indecesAndriy Gapon2016-03-251-1/+4
| | | | | | | | ... despite partition names based 1-based. Signed-off-by: Andriy Gapon <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4446
* zfs_copies: do_vol_test must wait for deviceBrian Behlendorf2016-03-251-0/+2
| | | | | | | Occasionally zfs_copies_* tests which rely on do_vol_test() will fail because udev hasn't yet created the minor device. Wait for it. Signed-off-by: Brian Behlendorf <[email protected]>
* Add zloop.sh test scriptBrian Behlendorf2016-03-234-0/+216
| | | | | | | | | | | | | | | | | | | Add Chris Williamson's "new" zloop script so that it may be intergated with ZoLs automated testing. The original script may be found in the openzfs-build repository on Github. Minor modifications were made to the script so it can be run directly from the ZoL source tree or from installed packages. Additionally it was updated to use gdb instead of mdb to extact debugging information from a core dump. References: https://github.com/openzfs/openzfs-build/commit/7fb5d8b https://github.com/openzfs/openzfs-build/blob/master/ansible/roles/openzfs-jenkins-slave/files/usr/local/zloop.sh Signed-off-by: Brian Behlendorf <[email protected]> Closes #4441
* Fix zdb -e and zhack thread_init()Brian Behlendorf2016-03-213-18/+23
| | | | | | | | | | | | | | | | | This issue was caused by calling `thread_init()` and `thread_fini()` multiple times resulting in `kthread_key` being invalid. To resolve the issue the explicit calls to `thread_init()` and `thread_fini()` required by the `zpool` command have been moved in to the command. Consumers such as `zdb` and `zhack` perform the same initialized through `kernel_init()` and `kernel_fini()`. Resolving this issue allows multiple additional test cases to be enabled. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #4331
* Support for vectorized algorithms on x86Gvozden Neskovic2016-03-216-2/+583
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is initial support for x86 vectorized implementations of ZFS parity and checksum algorithms. For the compilation phase, configure step checks if toolchain supports relevant instruction sets. Each implementation must ensure that the code is not passed to compiler if relevant instruction set is not supported. For this purpose, following new defines are provided if instruction set is supported: - HAVE_SSE, - HAVE_SSE2, - HAVE_SSE3, - HAVE_SSSE3, - HAVE_SSE4_1, - HAVE_SSE4_2, - HAVE_AVX, - HAVE_AVX2. For detecting if an instruction set can be used in runtime, following functions are provided in (include/linux/simd_x86.h): - zfs_sse_available() - zfs_sse2_available() - zfs_sse3_available() - zfs_ssse3_available() - zfs_sse4_1_available() - zfs_sse4_2_available() - zfs_avx_available() - zfs_avx2_available() - zfs_bmi1_available() - zfs_bmi2_available() These function should be called once, on module load, or initialization. They are safe to use from user and kernel space. If an implementation is using more than single instruction set, both compiler and runtime support for all relevant instruction sets should be checked. Kernel fpu methods: - kfpu_begin() - kfpu_end() Use __get_cpuid_max and __cpuid_count from <cpuid.h> Both gcc and clang have support for these. They also handle ebx register in case it is used for PIC code. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4381
* Cleanup linkingRichard Yao2016-03-189-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I noticed during code review of zfsonlinux/zfs#4385 that the author of a commit had peppered the various Makefile.am files with `$(TIRPC_LIBS)` when putting it into `lib/libspl/Makefile.am` should have sufficed. Upon further examination, it seems that he had copied what we do with `$(ZLIB)`. We also have a bit of that with `-ldl` too. Unfortunately, what we do is wrong, so lets fix it to set a good example for future contributors. In addition, we have multiple `-lz` and `-luuid` passed to the compiler because each `AC_CHECK_LIB` adds it to `$LIBS`. That is somewhat annoying to see, so we switch to `AC_SEARCH_LIBS` to avoid it. This is consistent with the recommendation to use `AC_SEARCH_LIBS` over `AC_CHECK_LIB` by autotools upstream: https://www.gnu.org/software/autoconf/manual/autoconf-2.66/html_node/Libraries.html In an ideal world, this would translate into improvements in ELF's `DT_NEEDED` entries, but that is not the case because of a couple of bugs in libtool. The first bug causes libtool to overlink by using static link dependencies for dynamic linking: https://wiki.mageia.org/en/Overlinking_issues_in_packaging#libtool_issues The workaround for this should be to pass `-Wl,--as-needed` in `LDFLAGS`. That leads us to the second bug, where libtool passes `LDFLAGS` after the libraries are specified and `ld` will only honor `--as-needed` on libraries specified before it: https://sigquit.wordpress.com/2011/02/16/why-asneeded-doesnt-work-as-expected-for-your-libraries-on-your-autotools-project/ There are a few possible workarounds for the second bug. One is to either patch the compiler spec file to specify `-Wl,--as-needed` or pass `-Wl,--as-needed` via `CC` like `CC='gcc -Wl,--as-needed'` so that it is specified early. Another is to patch ltmain.sh like Gentoo does: https://gitweb.gentoo.org/repo/gentoo.git/tree/eclass/ELT-patches/as-needed Without one of those workarounds, this cleanup provides no benefit in terms of `DT_NEEDED` entry generation. It should still be an improvement because it nicely simplifies the code while encouraging good habits when patching autotools scripts. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4426
* Add support for s390[x].Dimitri John Ledkov2016-03-172-2/+17
| | | | | | | Signed-off-by: Dimitri John Ledkov <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4425
* Disable zpool_add_004_pos test caseBrian Behlendorf2016-03-171-1/+1
| | | | | | | This test case add a zvol to as a vdev to an existing pool. This use case is currently known to be racy. Signed-off-by: Brian Behlendorf <[email protected]>
* Add the ZFS Test SuiteBrian Behlendorf2016-03-161243-1042/+89497
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the ZFS Test Suite and test-runner framework from illumos. This is a continuation of the work done by Turbo Fredriksson to port the ZFS Test Suite to Linux. While this work was originally conceived as a stand alone project integrating it directly with the ZoL source tree has several advantages: * Allows the ZFS Test Suite to be packaged in zfs-test package. * Facilitates easy integration with the CI testing. * Users can locally run the ZFS Test Suite to validate ZFS. This testing should ONLY be done on a dedicated test system because the ZFS Test Suite in its current form is destructive. * Allows the ZFS Test Suite to be run directly in the ZoL source tree enabled developers to iterate quickly during development. * Developers can easily add/modify tests in the framework as features are added or functionality is changed. The tests will then always be in sync with the implementation. Full documentation for how to run the ZFS Test Suite is available in the tests/README.md file. Warning: This test suite is designed to be run on a dedicated test system. It will make modifications to the system including, but not limited to, the following. * Adding new users * Adding new groups * Modifying the following /proc files: * /proc/sys/kernel/core_pattern * /proc/sys/kernel/core_uses_pid * Creating directories under / Notes: * Not all of the test cases are expected to pass and by default these test cases are disabled. The failures are primarily due to assumption made for illumos which are invalid under Linux. * When updating these test cases it should be done in as generic a way as possible so the patch can be submitted back upstream. Most existing library functions have been updated to be Linux aware, and the following functions and variables have been added. * Functions: * is_linux - Used to wrap a Linux specific section. * block_device_wait - Waits for block devices to be added to /dev/. * Variables: Linux Illumos * ZVOL_DEVDIR "/dev/zvol" "/dev/zvol/dsk" * ZVOL_RDEVDIR "/dev/zvol" "/dev/zvol/rdsk" * DEV_DSKDIR "/dev" "/dev/dsk" * DEV_RDSKDIR "/dev" "/dev/rdsk" * NEWFS_DEFAULT_FS "ext2" "ufs" * Many of the disabled test cases fail because 'zfs/zpool destroy' returns EBUSY. This is largely causes by the asynchronous nature of device handling on Linux and is expected, the impacted test cases will need to be updated to handle this. * There are several test cases which have been disabled because they can trigger a deadlock. A primary example of this is to recursively create zpools within zpools. These tests have been disabled until the root issue can be addressed. * Illumos specific utilities such as (mkfile) should be added to the tests/zfs-tests/cmd/ directory. Custom programs required by the test scripts can also be added here. * SELinux should be either is permissive mode or disabled when running the tests. The test cases should be updated to conform to a standard policy. * Redundant test functionality has been removed (zfault.sh). * Existing test scripts (zconfig.sh) should be migrated to use the framework for consistency and ease of testing. * The DISKS environment variable currently only supports loopback devices because of how the ZFS Test Suite expects partitions to be named (p1, p2, etc). Support must be added to generate the correct partition name based on the device location and name. * The ZFS Test Suite is part of the illumos code base at: https://github.com/illumos/illumos-gate/tree/master/usr/src/test Original-patch-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #6 Closes #1534
* Illumos 6681 - zfs list burning lots of time in dodefault() via dsl_prop_*Alex Wilson2016-03-151-7/+6
| | | | | | | | | | | | | | | | 6681 zfs list burning lots of time in dodefault() via dsl_prop_* Reviewed by: Patrick Mooney <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Matthew Ahrens <[email protected]> References: https://www.illumos.org/issues/6681 https://github.com/illumos/illumos-gate/commit/d09e447 Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #4406
* Fix aarch64 compilationGordan Bobic2016-03-151-1/+2
| | | | | | | | | | sys/param.h depends on types defined in sys/types.h (hrtime_t & timestruc_t). Signed-off-by: Gordan Bobic <[email protected]> Signed-off-by: Christopher J. Morrone <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4420
* Illumos 6370 - ZFS send fails to transmit some holesPaul Dagnelie2016-03-102-14/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 6370 ZFS send fails to transmit some holes Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Chris Williamson <[email protected]> Reviewed by: Stefan Ring <[email protected]> Reviewed by: Steven Burgess <[email protected]> Reviewed by: Arne Jansen <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/6370 https://github.com/illumos/illumos-gate/commit/286ef71 In certain circumstances, "zfs send -i" (incremental send) can produce a stream which will result in incorrect sparse file contents on the target. The problem manifests as regions of the received file that should be sparse (and read a zero-filled) actually contain data from a file that was deleted (and which happened to share this file's object ID). Note: this can happen only with filesystems (not zvols, because they do not free (and thus can not reuse) object IDs). Note: This can happen only if, since the incremental source (FromSnap), a file was deleted and then another file was created, and the new file is sparse (i.e. has areas that were never written to and should be implicitly zero-filled). We suspect that this was introduced by 4370 (applies only if hole_birth feature is enabled), and made worse by 5243 (applies if hole_birth feature is disabled, and we never send any holes). The bug is caused by the hole birth feature. When an object is deleted and replaced, all the holes in the object have birth time zero. However, zfs send cannot tell that the holes are new since the file was replaced, so it doesn't send them in an incremental. As a result, you can end up with invalid data when you receive incremental send streams. As a short-term fix, we can always send holes with birth time 0 (unless it's a zvol or a dataset where we can guarantee that no objects have been reused). Ported-by: Steven Burgess <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4369 Closes #4050
* Relax MBR partition scanning requirementBrian Behlendorf2016-03-101-21/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When checking a whole disk to see if it can be safely added to the pool a variety of checks are done. One of those checks is to attempt to determine the partition information and scan all the partitions for existing filesystems. Since ZoL contains a EFI library this partition scanning is easy to do for GPT partitioned disks. However, for non-GPT partitioned disks (MBR/EBR) things are a bit harder. The lack of a convenient library means non-GPT partitioned disks will not have all their partitions checked. For this reason, the default behavior was to require the force option. For example: invalid vdev specification use '-f' to override the following errors: /dev/vdb does not contain an GPT label but it may contain partition information in the MBR. However in practice requiring the force option for this case is counter-intuitively less safe. The reason is because only the first error is returned. By passing the force option it will suppress this first warning and potentially others you were not aware of. Therefore this patch inverts the default behavior for non-GPT formated disks (unformatted, MBR/EBR, etc). If no GPT table is detected and there is no file system detected on the provided block device. Then it will be assumed that block device is safe to use. Longer term it would be nice to see MBR/EBR scanning added to the utilities. This should be fairly straight forward to do. However these days it's somewhat less critical because Linux defaults to GPT partition tables for devices 2TB or larger. Signed-off-by: Brian Behlendorf <[email protected]> Closes #2660 Closes #2274
* Fix lock order inversion with zvol_open()Boris Protopopov2016-03-101-43/+20
| | | | | | | | | | | | | | | zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3681
* Add support for asynchronous zvol minor operationsBoris Protopopov2016-03-1012-214/+482
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zfsonlinux issue #2217 - zvol minor operations: check snapdev property before traversing snapshots of a dataset zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Create a per-pool zvol taskq for asynchronous zvol tasks. There are a few key design decisions to be aware of. * Each taskq must be single threaded to ensure tasks are always processed in the order in which they were dispatched. * There is a taskq per-pool in order to keep the pools independent. This way if one pool is suspended it will not impact another. * The preferred location to dispatch a zvol minor task is a sync task. In this context there is easy access to the spa_t and minimal error handling is required because the sync task must succeed. Support for asynchronous zvol minor operations address issue #3681. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2217 Closes #3678 Closes #3681
* Remove RPM package restrictionBrian Behlendorf2016-03-101-5/+0
| | | | | | | | ZFS on Linux is regularly tested on arm, ppc, ppc64, i686 and x86_64 architectures. Given this the artificial architecture restriction in the packaging has been removed. Signed-off-by: Brian Behlendorf <[email protected]>
* Change KM_SLEEP to TQ_SLEEP in spa_deadman()Tim Chase2016-03-091-1/+1
| | | | | | | | | | Since they both evaluate to zero, this is a semi-cosmetic change but the latter is the proper value to use as an argument to taskq_dispatch_delay(). Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4393
* Updated paths to scan when importing zpool(s)Thijs Cramer2016-03-092-2/+4
| | | | | | | | | Added by-partlabel and by-partuuid to the default device search path. Made made device names in by-label more preferable. Signed-off-by: Thijs Cramer <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3892
* Require libblkidBrian Behlendorf2016-03-097-147/+49
| | | | | | | | | | | | | | | | | | | | | | Historically libblkid support was detected as part of configure and optionally enabled. This was done because at the time support for detecting ZFS pool vdevs had just be added to libblkid and those updated packages were not yet part of many distributions. This is no longer the case and any reasonably current distribution will ship a version of libblkid which can detect ZFS pool vdevs. This patch makes libblkid mandatory at build time and libblkid the preferred method of scanning for ZFS pools. For distributions which include a modern version of libblkid there is no change in behavior. Explicitly scanning the default search paths is still supported and can be enabled with the '-s' command line option. Additionally making libblkid mandatory means that the 'zpool create' command can reliably detect if a specified device has an existing non-ZFS filesystem (ext4, xfs) and print a warning. Signed-off-by: Brian Behlendorf <[email protected]> Closes #2448
* Ensure zed _finish_daemonize() leaves fds 0-2 openChris Dunlap2016-03-081-1/+1
| | | | | | | | | | | | | | | In zed's _finish_daemonize(), /dev/null is open()d onto a temporary file descriptor which is then dup()d onto stdin, stdout, and stderr. But if file descriptors 0, 1, or 2 are not already open at the start of this function, then the temporary file descriptor will fall within this range and be inadvertently closed when the function cleans up. This commit adds a check to prevent inadvertently closing this (presumably temporary) file descriptor when it shouldn't. Signed-off-by: Chris Dunlap <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4384
* Fix zpool iostat bandwidth/ops calculationTony Hutter2016-03-081-13/+1
| | | | | | | | | | | | | | print_vdev_stats() subtracts the old bandwidth/ops stats from the new stats to calculate the bandwidth/ops numbers in "zpool iostat". However when the TXG numbers change between stats, zpool_refresh_stats() will incorrectly assign a NULL to the old stats. This causes print_vdev_stats() to use zeroes for the old bandwidth/ops numbers, resulting in an inaccurate calculation. This fix allows the calculation to happen even when TXGs change. Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4387
* Add support for alpine linuxCarlo Landmeter2016-03-086-6/+11
| | | | | | | | | Both Alpine Linux and Gentoo use OpenRC so we share its logic Signed-off-by: Carlo Landmeter <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4386
* Make zvol update volsize operation synchronous.ab-oe2016-02-291-0/+4
| | | | | | | | | | | | | There is a race condition when new transaction group is added to dp->dp_dirty_datasets list by the zap_update in the zvol_update_volsize. Meanwhile, before these dirty data are synchronized, the receive process can cause that dmu_recv_end_sync is executed. Then finally dirty data are going to be synchronized but the synchronization ends with the NULL pointer dereference error. Signed-off-by: ab-oe <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4116
* FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and ↵smh2016-02-265-114/+307
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | locality information. The existing algorithm selects a preferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the devices are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. The new algorithm takes into the following additional factors: * Load of the vdevs (number outstanding I/O requests) * The locality of last queued I/O vs the new I/O request. Within the locality calculation additional knowledge about the underlying vdev is considered such as; is the device backing the vdev a rotating media device. This results in performance increases across the board as well as significant increases for predominantly streaming loads and for configurations which don't have evenly performing devices. The following are results from a setup with 3 Way Mirror with 2 x HD's and 1 x SSD from a basic test running multiple parrallel dd's. With pre-fetch disabled (vfs.zfs.prefetch_disable=1): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s With pre-fetch enabled (vfs.zfs.prefetch_disable=0): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s In addition to the performance changes the code was also restructured, with the help of Justin Gibbs, to provide a more logical flow which also ensures vdevs loads are only calculated from the set of valid candidates. The following additional sysctls where added to allow the administrator to tune the behaviour of the load algorithm: * vfs.zfs.vdev.mirror.rotating_inc * vfs.zfs.vdev.mirror.rotating_seek_inc * vfs.zfs.vdev.mirror.rotating_seek_offset * vfs.zfs.vdev.mirror.non_rotating_inc * vfs.zfs.vdev.mirror.non_rotating_seek_inc These changes where based on work started by the zfsonlinux developers: https://github.com/zfsonlinux/zfs/pull/1487 Reviewed by: gibbs, mav, will MFC after: 2 weeks Sponsored by: Multiplay References: https://github.com/freebsd/freebsd@5c7a6f5d https://github.com/freebsd/freebsd@31b7f68d https://github.com/freebsd/freebsd@e186f564 Performance Testing: https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141 Porting notes: - The tunables were adjusted to have ZoL-style names. - The code was modified to use ZoL's vd_nonrot. - Fixes were done to make cstyle.pl happy - Merge conflicts were handled manually - freebsd/freebsd@e186f564bc946f82c76e0b34c2f0370ed9aea022 by my collegue Andriy Gapon has been included. It applied perfectly, but added a cstyle regression. - This replaces 556011dbec2d10579819078559a77630fc559112 entirely. - A typo "IO'a" has been corrected to say "IO's" - Descriptions of new tunables were added to man/man5/zfs-module-parameters.5. Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4334
* Change full path subcommand flag from -p to -PBrian Behlendorf2016-02-262-40/+40
| | | | | | | | | | | Commit d2f3e29 introduced the -p option which outputs full paths for vdevs to multiple zpool subcommands. When this was merged there was no conflict for this flag letter. However it's certain there will be a conflict with the -p (parsable) flag used by other subcommands. Therefore, -p is being changed to -P to avoid this. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4368
* Add -gLp to zpool subcommands for alt vdev namesRichard Yao2016-02-254-104/+407
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following options have been added to the zpool add, iostat, list, status, and split subcommands. The default behavior was not modified, from zfs(8). -g Display vdev GUIDs instead of the normal short device names. These GUIDs can be used in-place of device names for the zpool detach/off‐ line/remove/replace commands. -L Display real paths for vdevs resolving all symbolic links. This can be used to lookup the current block device name regardless of the /dev/disk/ path used to open it. -p Display full paths for vdevs instead of only the last component of the path. This can be used in conjunction with the -L flag. This behavior may also be enabled using the following environment variables. ZPOOL_VDEV_NAME_GUID ZPOOL_VDEV_NAME_FOLLOW_LINKS ZPOOL_VDEV_NAME_PATH This change is based on worked originally started by Richard Yao to add a -g option. Then extended by @ilovezfs to add a -L option for openzfsonosx. Those changes have been merged, re-factored, a -p option added and extended to all relevant zpool subcommands. Original-patch-by: Richard Yao <[email protected]> Extended-by: ilovezfs <[email protected]> Extended-by: Brian Behlendorf <[email protected]> Signed-off-by: ilovezfs <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2011 Closes #4341
* Add nfs-kernel-server for DebianGrischa Zengel2016-02-251-2/+2
| | | | | | | | | | Debian based systems use nfs-kernel-server as the service name. List both nfs-server.service and nfs-kernel-server.service so this service will work on multiple distributions. Signed-off-by: Grischa Zengel <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4350
* Add l2arc_max_block_size tunableBrian Behlendorf2016-02-252-1/+47
| | | | | | | | | | | Set a limit for the largest compressed block which can be written to an L2ARC device. By default this limit is set to 16M so there is no change in behavior. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Elling <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #4323
* Make zvol minor functionality more robustBoris Protopopov2016-02-241-16/+77
| | | | | | | | | | | | | | | | | | | | | | | | Close the race window in zvol_open() to prevent removal of zvol_state in the 'first open' code path. Move the call to check_disk_change() under zvol_state_lock to make sure the zvol_media_changed() and zvol_revalidate_disk() called by check_disk_change() are invoked with positive zv_open_count. Skip opened zvols when removing minors and set private_data to NULL for zvols that are not in use whose minors are being removed, to indicate to zvol_open() that the state is gone. Skip opened zvols when renaming minors to avoid modifying zv_name that might be in use, e.g. in zvol_ioctl(). Drop zvol_state_lock before calling add_disk() when creating minors to avoid deadlocks with zvol_open(). Wrap dmu_objset_find() with spl_fstran_mark()/unmark(). Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #4344
* Call dmu_read_uio_dbuf() in zvol_read()Richard Yao2016-02-181-1/+1
| | | | | | | | | | | | | | | | | The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is that the former takes a hold while the latter uses an existing hold. `zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no apparent reason, we inherited a `zvol_read()` function from OpenSolaris that does `dmu_read_uio()`. illumos-gate also still uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`, which is more performant. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4316