aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* Fix incorrect assertions in ddt_phys_decref and ddt_sync_entryYing Zhu2013-05-061-2/+1
| | | | | | | | | | | The assertions in ddt_phys_decref and ddt_sync_entry cast ddp->ddp_refcnt from uint64_t to int64_t, with a reference count bigger than 2^63, e.g. the reference count of zero blocks commonly available in spare files, we may mistakenly hit these assertations, so drop the type conversions here. Signed-off-by: Ying Zhu <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1436
* Use taskq for dump_bytes()Brian Behlendorf2013-05-062-6/+59
| | | | | | | | | | | | | | | | | | | The vn_rdwr() function performs I/O by calling the vfs_write() or vfs_read() functions. These functions reside just below the system call layer and the expectation is they have almost the entire 8k of stack space to work with. In fact, certain layered configurations such as ext+lvm+md+multipath require the majority of this stack to avoid stack overflows. To avoid this posibility the vn_rdwr() call in dump_bytes() has been moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be performed with the majority of the stack space available. This ends up being very similiar to as if the I/O were issued via sys_write() or sys_read(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #1399 Closes #1423
* Illumos #3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock contentionAdam Leventhal2013-05-063-75/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Gordon Ross <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@ec94d32 https://illumos.org/issues/3581 Notes for Linux port: Earlier commit 08d08eb reduced contention on this taskq lock by simply reducing the number of z_fr_iss threads from 100 to one-per-CPU. We also optimized the taskq implementation in zfsonlinux/spl@3c6ed54. These changes significantly improved unlink performance to acceptable levels. This patch further reduces time spent spinning on this lock by randomly dispatching the work items over multiple independent task queues. The Illumos ZFS developers stated that this lock contention only arose after "3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()" was landed. It's not clear if 3329 affects the Linux port or not. I didn't see spa_free_sync_cb() show up in oprofile sessions while unlinking large files, but I may just not have used the right test case. I tested unlinking a 1 TB of data with and without the patch and didn't observe a meaningful difference in elapsed time. However, oprofile showed that the percent time spent in taskq_thread() was reduced from about 16% to about 5%. Aside from a possible slight performance benefit this may be worth landing if only for the sake of maintaining consistency with upstream. Ported-by: Ned Bass <[email protected]> Closes #1327
* Illumos #3329, #3330, #3331, #3335George Wilson2013-05-065-14/+59
| | | | | | | | | | | | | | | | | | | | | | | | 3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb() 3330 space_seg_t should have its own kmem_cache 3331 deferred frees should happen after sync_pass 1 3335 make SYNC_PASS_* constants tunable Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@01f55e48fb4d524eaf70687728aa51b7762e2e97 https://www.illumos.org/issues/3329 https://www.illumos.org/issues/3330 https://www.illumos.org/issues/3331 https://www.illumos.org/issues/3335 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #3306, #3321George Wilson2013-05-033-20/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3306 zdb should be able to issue reads in parallel 3321 'zpool reopen' command should be documented in the man page and help Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: illumos/illumos-gate@31d7e8fa33fae995f558673adb22641b5aa8b6e1 https://www.illumos.org/issues/3306 https://www.illumos.org/issues/3321 The vdev_file.c implementation in this patch diverges significantly from the upstream version. For consistenty with the vdev_disk.c code the upstream version leverages the Illumos bio interfaces. This makes sense for Illumos but not for ZoL for two reasons. 1) The vdev_disk.c code in ZoL has been rewritten to use the Linux block device interfaces which differ significantly from those in Illumos. Therefore, updating the vdev_file.c to use the Illumos interfaces doesn't get you consistency with vdev_disk.c. 2) Using the upstream patch as is would requiring implementing compatibility code for those Solaris block device interfaces in user and kernel space. That additional complexity could lead to confusion and doesn't buy us anything. For these reasons I've opted to simply move the existing vn_rdwr() as is in to the taskq function. This has the advantage of being low risk and easy to understand. Moving the vn_rdwr() function in to its own taskq thread also neatly avoids the possibility of a stack overflow. Finally, because of the additional work which is being handled by the free taskq the number of threads has been increased. The thread count under Illumos defaults to 100 but was decreased to 2 in commit 08d08e due to contention. We increase it to 8 until the contention can be address by porting Illumos #3581. Ported-by: Brian Behlendorf <[email protected]> Closes #1354
* 3246 ZFS I/O deadman threadGeorge.Wilson2013-05-018-35/+179
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> NOTES: This patch has been reworked from the original in the following ways to accomidate Linux ZFS implementation *) Usage of the cyclic interface was replaced by the delayed taskq interface. This avoids the need to implement new compatibility code and allows us to rely on the existing taskq implementation. *) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h because declaring externs in source files as was done in the original patch is just plain wrong. *) Instead of panicing the system when the deadman triggers a zevent describing the blocked vdev and the first pending I/O is posted. If the panic behavior is desired Linux provides other generic methods to panic the system when threads are observed to hang. *) For reference, to delay zios by 30 seconds for testing you can use zinject as follows: 'zinject -d <vdev> -D30 <pool>' References: illumos/illumos-gate@283b84606b6fc326692c03273de1774e8c122f9a https://www.illumos.org/issues/3246 Ported-by: Brian Behlendorf <[email protected]> Closes #1396
* Fix txg_quiesce thread deadlockBrian Behlendorf2013-04-262-8/+8
| | | | | | | | | | | | | | | | | | A deadlock was accidentally introduced by commit e95853a which can occur when the system is under memory pressure. What happens is that while the txg_quiesce thread is holding the tx->tx_cpu locks it enters memory reclaim. In the context of this memory reclaim it then issues synchronous I/O to a ZVOL swap device. Because the txg_quiesce thread is holding the tx->tx_cpu locks a new txg cannot be opened to handle the I/O. Deadlock. The fix is straight forward. Move the memory allocation outside the critical region where the tx->tx_cpu locks are held. And for good measure change the offending allocation to KM_PUSHPAGE to ensure it never attempts to issue I/O during reclaim. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1274
* Correctly return ERANGE in getxattr(2)Brian Behlendorf2013-04-241-2/+10
| | | | | | | | | | According to the getxattr(2) man page the ERANGE errno should be returned when the size of the value buffer is to small to hold the result. Prior to this patch the implementation would just truncate the value to size bytes. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1408
* Trivial spelling fixChris Dunlop2013-04-191-1/+1
| | | | | Signed-off-by: Brian Behlendorf <[email protected]> Closes #1411
* Remove .readdir from zpl_file_operations tableCaleb James DeLisle2013-04-191-1/+0
| | | | | | | | | | | The zpl_readdir() function shouldn't be registered as part of the zpl_file_operations table, it must only be part of the zpl_dir_file_operations table. By removing this callback the VFS will now correctly return ENOTDIR when calling getdents() on a file. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1404
* Allow setting a lower ashift with -o ashiftMartin Matuska2013-04-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1381 Closes #1328 Issue #967 Issue #548
* Illumos #3422, #3425George Wilson2013-04-122-10/+5
| | | | | | | | | | | | | | | | | 3422 zpool create/syseventd race yield non-importable pool 3425 first write to a new zvol can fail with EFBIG Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: illumos/illumos-gate@bda8819455defbccd06981d9a13b240b682a3d50 https://www.illumos.org/issues/3422 https://www.illumos.org/issues/3425 Ported-by: Brian Behlendorf <[email protected]> Closes #1390
* build: resolve orthographic and other grammatical errorsJan Engelhardt2013-04-021-3/+3
| | | | | Signed-off-by: Jan Engelhardt <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Add zio_ddt_free()+ddt_phys_decref() error handlingBrian Behlendorf2013-03-192-4/+9
| | | | | | | | | | | | | | | | | The assumption in zio_ddt_free() is that ddt_phys_select() must always find a match. However, if that fails due to a damaged DDT or some other reason the code will NULL dereference in ddt_phys_decref(). While this should never happen it has been observed on various platforms. The result is that unless your willing to patch the ZFS code the pool is inaccessible. Therefore, we're choosing to more gracefully handle this case rather than leave it fatal. http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html Signed-off-by: Brian Behlendorf <[email protected]> Closes #1308
* Add metaslab_debug optionBrian Behlendorf2013-03-181-1/+6
| | | | | | | | | Enabling metaslab debugging will prevent space maps from being automatically unloaded. This can significantly increase the memory footprint but being able to dynamically control this is helpful for debugging and certain performance testing. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.9 compat: Undefine GCC_VERSIONRichard Yao2013-03-061-0/+8
| | | | | | | | | | | | The mainline kernel started defining GCC_VERSION with commit torvalds/linux@3f3f8d2f48acfd8ed3b8e6b7377935da57b27b16. Unfortunately, LZ4 also defines this macro, but the two defintions are incompatible. We undefine GCC_VERSION in lz4.c to handle this. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1339
* Add snapdev=[hidden|visible] dataset propertyEric Dillmann2013-03-052-3/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new snapdev dataset property may be set to control the visibility of zvol snapshot devices. By default this value is set to 'hidden' which will prevent zvol snapshots from appearing under /dev/zvol/ and /dev/<dataset>/. When set to 'visible' all zvol snapshots for the dataset will be visible. This functionality was largely added because when automatic snapshoting is enabled large numbers of read-only zvol snapshots will be created. When creating these devices the kernel will attempt to read their partition tables, and blkid will attempt to identify any filesystems on those partitions. This leads to a variety of issues: 1) The zvol partition tables will be read in the context of the `modprobe zfs` for automatically imported pools. This is undesirable and should be done asynchronously, but for now reducing the number of visible devices helps. 2) Udev expects to be able to complete its work for a new block devices fairly quickly. When many zvol devices are added at the same time this is no longer be true. It can lead to udev timeouts and missing /dev/zvol links. 3) Simply having lots of devices in /dev/ can be aukward from a management standpoint. Hidding the devices your unlikely to ever use helps with this. Any snapshot device which is needed can be made visible by changing the snapdev property. NOTE: This patch changes the default behavior for zvols which was effectively 'snapdev=visible'. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1235 Closes #945 Issue #956 Issue #756
* Merge zvol.c changes from PSARC 2010/306 Read-only ZFS poolsGeorge Wilson2013-03-041-5/+8
| | | | | | | | | | | | | | | | | | The changes to zvol.c were never merged from the last onnv_147 bulk update. This was because zvol.c was largely rewritten for Linux making it fairly easy to miss these sorts of changes. This causes a regression when importing a zpool with zvols read-only. This does not impact pool which only contain filesystem datasets. References: illumos/illumos-gate@f9af39b Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1332 Closes #1333
* Constify structures containing function pointersRichard Yao2013-03-045-43/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PaX team modified the kernel's modpost to report writeable function pointers as section mismatches because they are potential exploit targets. We could ignore the warnings, but their presence can obscure actual issues. Proper const correctness can also catch programming mistakes. Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2 kernel reports 133 section mismatches prior to this patch. This patch eliminates 130 of them. The quantity of writeable function pointers eliminated by constifying each structure is as follows: vdev_opts_t 52 zil_replay_func_t 24 zio_compress_info_t 24 zio_checksum_info_t 9 space_map_ops_t 7 arc_byteswap_func_t 5 The remaining 3 writeable function pointers cannot be addressed by this patch. 2 of them are in zpl_fs_type. The kernel's sget function requires that this be non-const. The final writeable function pointer is created by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and remove_shrinker() functions also require that this be non-const. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1300
* Fix hot sparesBrian Behlendorf2013-03-011-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue with hot spares in ZoL is because it opens all leaf vdevs exclusively (O_EXCL). On Linux, exclusive opens cause subsequent exclusive opens to fail with EBUSY. This could be resolved by not opening any of the devices exclusively, which is what Illumos does, but the additional protection offered by exclusive opens is desirable. It cleanly prevents you from accidentally adding an in-use non-ZFS device to your pool. To fix this we very slightly relaxed the usage of O_EXCL in the following ways. 1) Functions which open the device but only read had the O_EXCL flag removed and were updated to use O_RDONLY. 2) A common holder was added to the vdev disk code. This allow the ZFS code to internally open the device multiple times but non-ZFS callers may not. 3) An exception was added to make_disks() for hot spare when creating partition tables. For hot spare devices which are already opened exclusively we skip creating the partition table because this must already have been done when the disk was originally added as a hot spare. Additional minor changes include fixing check_in_use() to use a partition instead of a slice suffix. And is_spare() was moved above make_disks() to avoid adding a forward reference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #250
* Remove wholedisk check from vdev_disk_open()Brian Behlendorf2013-02-281-14/+0
| | | | | | | | | As described by the comment and enforced the by assertion the v->vdev_wholedisk will never be -1. The wholedisk handling is performed by the user space utilities. To prevent confusion this dead code is being removed. Signed-off-by: Brian Behlendorf <[email protected]>
* Leaf vdevs should not be reopenedBrian Behlendorf2013-02-281-3/+16
| | | | | | | | | | | | | | | | | | | When vdev_disk.c was implemented for Linux we failed to handle the reopen case. According to the vdev_reopen() comment leaf vdevs should not be closed or opened when v->vdev_reopening is set. Under Linux we would always close and open the device. This issue was only noticed when a 'zpool scrub' command was run while the leaf vdev device names in /dev/disk/by-vdev were missing. The scrub command calls vdev_reopen() which caused the vdevs to be closed but they couldn't be reopened due to the missing links. The result was that all the vdevs were marked unavailable and the pool was halted due to failmode=wait. This patch adds the missing functionality in a similiar fashion to to the Illumos code. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove the bio_empty_barrier() check.Etienne Dechamps2013-02-241-9/+0
| | | | | | | | | | | | | | | | | | | | | | | To determine whether the kernel is capable of handling empty barrier BIOs, we check for the presence of the bio_empty_barrier() macro, which was introduced in 2.6.24. If this macro is defined, then we can flush disk vdevs; if it isn't, then flushing is disabled. Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37, even though the kernel is still capable of handling empty barrier BIOs. As a result, flushing is effectively disabled on kernels >= 2.6.37, meaning that starting from this kernel version, zfs doesn't use barriers to guarantee on-disk data consistency. This is quite bad and can lead to potential data corruption on power failures. This patch fixes the issue by removing the configure check for bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore. Thanks to Richard Kojedzinszky for catching this nasty bug. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1318
* Enable zfs_arc_memory_throttle_disable by defaultBrian Behlendorf2013-02-211-1/+1
| | | | | | | | | | | | | | | | | | | The zfs_arc_memory_throttle_disable module option was introduced by commit 0c5493d47059f25ce9dbf20c9fe87655f55102a1 to resolve a memory miscalculation which could result in the txg_sync thread spinning. When this was first introduced the default behavior was left unchanged until enough real world usage confirmed there were no unexpected issues. We've now reached that point. Linux's direct reclaim is working as expected so we're enabling this behavior by default. This helps pave the way to retire the spl_kmem_availrmem() functionality in the SPL layer. This was the only caller. Signed-off-by: Brian Behlendorf <[email protected]> Issue #938
* Make spa.c assertions catch unsupported pre-feature flag pool versionsRichard Yao2013-02-121-2/+2
| | | | | | | | | | | | | A couple of assertions in spa.c were designed to prevent the use of invalid pool versions. They were written under the assumption that all valid pools are less than SPA_VERSION. Since feature flags jumped from 28 to 5000, any numbers in the range 28 to 5000 non-inclusive will fail to trigger them. We switch to the new SPA_VERSION_IS_SUPPORTED macro to correct this. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1282
* Add explicit MAXNAMELEN checkBrian Behlendorf2013-02-121-0/+3
| | | | | | | | | | | | | It turns out that the Linux VFS doesn't strictly handle all cases where a component path name exceeds MAXNAMELEN. It does however appear to correctly handle MAXPATHLEN for us. The right way to handle this appears to be to add an explicit check to the zpl_lookup() function. Several in-tree filesystems handle this case the same way. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1279
* Switch KM_SLEEP to KM_PUSHPAGENed Bass2013-02-062-3/+3
| | | | | | | | | | | | Two more locations where KM_SLEEP was used in a call which must use KM_PUSHPAGE were found while using the zpool upgrade command. See commit b8d06fc for additional details. Also make a small correction to the comment block above dsl_dir_open_spa(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #1268
* Cast 'zfs bad bloc' to ULL for x86Brian Behlendorf2013-02-041-1/+1
| | | | | | | | | | | Explicitly case this value to an unsigned long long for 32-bit systems to inform the compiler that a long type should not be used. Otherwise we get the following compiler error: dmu_send.c:376: error: integer constant is too large for ‘long’ type Signed-off-by: Brian Behlendorf <[email protected]>
* Add zfs_arc_memory_throttle_disable module optionBrian Behlendorf2013-02-011-0/+7
| | | | | | | | | | | | | | | | | | | | | | The way in which virtual box ab(uses) memory can throw off the free memory calculation in arc_memory_throttle(). The result is the txg_sync thread will effectively spin waiting for memory to be released even though there's lots of memory on the system. To handle this case I'm adding a zfs_arc_memory_throttle_disable module option largely for virtual box users. Setting this option disables free memory checks which allows the txg_sync thread to make progress. By default this option is disabled to preserve the current behavior. However, because Linux supports direct memory reclaim it's doubtful throttling due to perceived memory pressure is ever a good idea. We should enable this option by default once we've done enough real world testing to convince ourselve there aren't any unexpected side effects. Signed-off-by: Brian Behlendorf <[email protected]> Closes #938
* Add zfs_disable_dup_eviction module optionBrian Behlendorf2013-02-011-0/+3
| | | | | | | | Commit 1eb5bfa introduced a new zfs_disable_dup_eviction tunable. It should have been made available as a module option in the original patch but was overlooked. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix mismatch between SA header size and layoutNed Bass2013-01-311-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a system attribute layout is created an inconsistency may occur between the system attribute header (sa_hdr_phys_t) size and the variable-sized attribute count stored in the layout. The inconsistency results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT returns false: SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab()) ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr, tb)) || !IS_SA_BONUSTYPE(bonustype) || (IS_SA_BONUSTYPE(bonustype) && hdr->sa_layout_info == 0)) failed The bug originates in this snippet from sa_find_sizes(). if (is_var_sz && var_size > 1) { if (P2ROUNDUP(hdrsize + sizeof (uint16_t), *total < full_space) { hdrsize += sizeof (uint16_t); This assumes that the current variable-sized attribute will be stored in the current buffer and accounts for the space needed to store its size in the sa_hdr_phys_t. However if the next attribute spills over we need to store a blkptr_t at the end of the bonus buffer to point to the spill block. If the current attribute is in the way of the blkptr_t then it too will be relocated into the spill block. But since we've already accounted for it in the header size we get the inconsistency described above. To avoid this, record the index of the last variable-sized attribute that prompted a hdrsize increase, and reverse the increase if we later determine that that attribute will be relocated to the spill block. Signed-off-by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1250
* Fix rounding discrepancy in sa_find_sizes()Ned Bass2013-01-311-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A rounding discrepancy exists between how sa_build_layouts() and sa_find_sizes() calculate when the spill block needs to be kicked in. This results in a narrow size range where sa_build_layouts() believes there must be a spill block allocated but due to the discrepancy there isn't. A panic then occurs when the hdl->sa_spill NULL pointer is dereferenced. The following reproducer for this bug was isolated: truncate -s 128m /tmp/tank zpool create tank /tmp/tank zfs create -o xattr=sa tank/fish ln -s `perl -e 'print "z" x 41'` /tank/fish/z setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z This test results in roughly the following system attribute (SA) layout: 176 bytes - "standard" SA's 41 bytes - name of symbolic link target 100 bytes - XDR encoded nvlist for xattr --- 317 bytes - total Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes() decides no spill block is needed. But sa_build_layouts() rounds 41 up to 48 when computing the space requirements so it tries to switch to the spill block. Note that we were only able to reproduce this bug using a combination of symbolic links and the Linux-specific xattr=sa dataset property. So while this issue is not technically Linux-specific, it may be difficult or impossible to hit the narrow size range needed to reproduce it on other platforms. To fix the discrepancy, round the running total in sa_find_sizes() up to an 8-byte boundary before accounting for each SA, since this is how they will be stored in the bonus and (possibly) spill buffers. To make the intent of the code more clear, explicitly assert key assumptions about expected alignment of data and whether spill-over will occur. Signed-off-by: Matthew Ahrens <[email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #1240
* Illumos #3447 improve the comment in txg.cAdam H. Leventhal2013-01-301-2/+71
| | | | | | | | | | | | | | | | 3447 improve the comment in txg.c Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: illumos/illumos-gate@adbbcfface63b3a71922d5a25d34a2018c0435de https://www.illumos.org/issues/3447 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #3035 LZ4 compression support in ZFS and GRUBEric Dillmann2013-01-296-0/+1126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3035 LZ4 compression support in ZFS and GRUB Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Christopher Siden <[email protected]> References: illumos/illumos-gate@a6f561b4aee75d0d028e7b36b151c8ed8a86bc76 https://www.illumos.org/issues/3035 http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS This patch has been slightly modified from the upstream Illumos version to be compatible with Linux. Due to the very limited stack space in the kernel a lz4 workspace kmem cache is used. Since we are using gcc we are also able to take advantage of the gcc optimized __builtin_ctz functions. Support for GRUB has been dropped from this patch. That code is available but those changes will need to made to the upstream GRUB package. Lastly, several hunks of dead code were dropped for clarity. They include the functions real_LZ4_uncompress(), LZ4_compressBound() and the Visual Studio specific hunks wrapped in _MSC_VER. Ported-by: Eric Dillmann <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1217
* Avoid gcc -Werror=maybe-uninitialized warningsChris Wedgwood2013-01-281-2/+2
| | | | | | | | | | | Explicitly set acl details to zero to silence gcc (zfs_acl_node_read can't be sure zfs_acl_znode_info will set acl_count and aclsize). Normally suppressing these warnings by setting this to zero at declaration time is a bad idea but in this instance it's hard to avoid and should be fairly safe. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1244
* Use dsl_dataset_snap_lookup()Brian Behlendorf2013-01-253-34/+5
| | | | | | | | | | | | Retire the dmu_snapshot_id() function which was introduced in the initial .zfs control directory implementation. There is already an existing dsl_dataset_snap_lookup() which does exactly what we need, and the dmu_snapshot_id() function as implemented is racy. https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879 Signed-off-by: Brian Behlendorf <[email protected]> Closes #1238
* Add d_clear_d_op() compatibilityBrian Behlendorf2013-01-231-0/+2
| | | | | | | | | | | | | Added d_clear_d_op() helper function which clears some flags and the registered dentry->d_op table. This is required because d_set_d_op() issues a warning when the dentry operations table is already set. For the .zfs control directory to work properly we must be able to override the default operations table and register custom .d_automount and .d_revalidate callbacks. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #1230
* fzap_cursor_move_to_key() should drop l_rwlockNed Bass2013-01-231-6/+6
| | | | | | | | | | | | | Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock since that function returns with the lock held on success. All other callers drop the lock correctly but it seems fzap_cursor_move_to_key() does not. This may block writers or cause VERIFY failures when the lock is freed. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1215 Closes zfsonlinux/spl#143 Closes zfsonlinux/spl#97
* Fix zpl_revalidate() NULL derefBrian Behlendorf2013-01-221-1/+1
| | | | | | | | | | | | In zpl_revalidate() it's possible for the nameidata to be NULL for kernels which still accept the parameter. In particular, lookup_one_len() calls d_revalidate() with a NULL nameidata. Resolve the issue by checking for a NULL nameidata in which case just set the flags to 0. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1226
* Use sb->s_d_op default dentry operationsBrian Behlendorf2013-01-182-11/+9
| | | | | | | | | | As of Linux 2.6.37 the right way to register custom dentry operations is to use the super block's ->s_d_op field. For older kernels they should be registered as part of the lookup operation. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1223
* Fix zpool on zvol deadlockMassimo Maggi2013-01-181-13/+13
| | | | | | | | | | | | | | | Commit 65d56083b4617a4cade0cff68cbbaf68114169d6 fixes the lock inversion between spa_namespace_lock and bdev->bd_mutex but only for the first user of spa_namespace_lock: dmu_objset_own(). Later spa_namespace_lock gets acquired by dsl_prop_get_integer() though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()-> spa_open()->spa_open_common() without this "protection". By moving the mutex release after this second use, even this acquisition of the lock is "protected" by the ERESTARTSYS trick. Signed-off-by: Massimo Maggi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1220
* Revert "Revert "Fix unlink/xattr deadlock""Brian Behlendorf2013-01-172-55/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 53c7411919a64d6f0889aa0d6974610f6cd35744 effectively reinstating the asynchronous xattr cleanup code. These Linux changes were reverted because after testing and careful contemplation I was convinced that due to the 89260a1c8851ce05ea04b23606ba438b271d890 commit they were no longer required. Unfortunately, the deadlock described in #1176 was a case which wasn't considered. At mount zfs_unlinked_drain() can occur which will unlink a list of znodes in effectively a random order which isn't safe. The only reason it was safe to originally revert this change was the we could guarantee that the VFS would always prune the xattr leaves before the parents. Therefore, until we can cleanly resolve this deadlock for all cases we need to keep this change in spite of the xattr unlink performance penalty associated with it. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1176 Issue #457
* Fix 'zfs rollback' on mounted file systemsBrian Behlendorf2013-01-175-41/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rolling back a mounted filesystem with open file handles and cached dentries+inodes never worked properly in ZoL. The major issue was that Linux provides no easy mechanism for modules to invalidate the inode cache for a file system. Because of this it was possible that an inode from the previous filesystem would not get properly dropped from the cache during rolling back. Then a new inode with the same inode number would be create and collide with the existing cached inode. Ideally this would trigger an VERIFY() but in practice the error wasn't handled and it would just NULL reference. Luckily, this issue can be resolved by sprucing up the existing Solaris zfs_rezget() functionality for the Linux VFS. The way it works now is that when a file system is rolled back all the cached inodes will be traversed and refetched from disk. If a version of the cached inode exists on disk the in-core copy will be updated accordingly. If there is no match for that object on disk it will be unhashed from the inode cache and marked as stale. This will effectively make the inode unfindable for lookups allowing the inode number to be immediately recycled. The inode will then only be accessible from the cached dentries. Subsequent dentry lookups which reference a stale inode will result in the dentry being invalidated. Once invalidated the dentry will drop its reference on the inode allowing it to be safely pruned from the cache. Special care is taken for negative dentries since they do not reference any inode. These dentires will be invalidate based on when they were added to the dentry cache. Entries added before the last rollback will be invalidate to prevent them from masking real files in the dataset. Two nice side effects of this fix are: * Removes the dependency on spl_invalidate_inodes(), it can now be safely removed from the SPL when we choose to do so. * zfs_znode_alloc() no longer requires a dentry to be passed. This effectively reverts this portition of the code to its upstream counterpart. The dentry is not instantiated more correctly in the Linux ZPL layer. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #795
* Fix false ENOENT on snapshot control dentriesNed Bass2013-01-161-58/+66
| | | | | | | | | | | | | | | | | | | | | | Lookups in the snapshot control directory for an existing snapshot fail with ENOENT if an earlier lookup failed before the snapshot was created. This is because the earlier lookup causes a negative dentry to be cached which is never invalidated. The bug can be reproduced as follows (the second ls should succeed): $ ls /tank/.zfs/snapshot/s ls: cannot access /tank/.zfs/snapshot/s: No such file or directory $ zfs snap tank@s $ ls /tank/.zfs/snapshot/s ls: cannot access /tank/.zfs/snapshot/s: No such file or directory To remedy this, always invalidate cached dentries in the snapshot control directory. Since these entries never exist on disk there is no significant performance penalty for the extra lookups. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1192
* Fix quoting error in unmount commandNed Bass2013-01-161-1/+1
| | | | | | | | | A misplaced single quote caused the umount command to fail with a syntax error when unmounting snapshots under the .zfs/snapshot control directory. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1210
* Illumos #3189 kernel panic in test hotspare_onoffline_004_negChristopher Siden2013-01-141-1/+1
| | | | | | | | | | | | | | | 3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Arne Jansen <[email protected]> Approved by: Dan McDonald <[email protected]> References: illumos/illumos-gate@8f0b538d1dc99df23a6a89cfd9ffddc1b9804a00 changeset: 13818:e9ad0a945d45 https://www.illumos.org/issues/3189 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #1862 incremental zfs receive fails for sparse file > 8PBArne Jansen2013-01-141-15/+27
| | | | | | | | | | | | | | | 1862 incremental zfs receive fails for sparse file > 8PB Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Simon Klinkert <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@31495a1e56860f4575614774a592fe33fc9c71f2 illumos changeset: 13789:f0c17d471b7a https://www.illumos.org/issues/1862 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #3208 cross-endian incorrect user/group accountingMatthew Ahrens2013-01-141-9/+25
| | | | | | | | | | | | | | | | | | 3208 moving zpool cross-endian results in incorrect user/group accounting Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@e828a46d29ad418487f50d56b5c19e2a1f9033a7 illumos changeset: 13835:eea81edc4f14 https://www.illumos.org/issues/3208 Ported-by: Brian Behlendorf <[email protected]> Closes #627 Closes #1136
* Illumos #2618 arc.c mistypes in the commentsBart Coddens2013-01-111-4/+4
| | | | | | | | | | | | | | | 2618 arc.c mistypes in the comments Reviewed by: Jason King <[email protected]> Reviewed by: Josef Sipek <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@fc98fea58e89224f6f13d7fae246d6cb5dfa35ea illumos changeset: 13721:5b51a16a186f https://www.illumos.org/issues/2618 Ported-by: Brian Behlendorf <[email protected]>
* call_usermodehelper() should wait for processNed Bass2013-01-092-3/+3
| | | | | | | | | | | | | | | | As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the process). A number of call sites used the number 1 instead of the constant name, so the behavior was not as expected on kernels with this change. One visible consequence of this change was that processes accessing automounted snapshots received an ELOOP error because they failed to wait for zfs.mount to complete. Signed-off-by: Brian Behlendorf <[email protected]> Closes #816