summaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Illumos #3122 zfs destroy filesystem should prefetch blocksMatthew Ahrens2013-07-022-15/+85
| | | | | | | | | | | | | | | 3122 zfs destroy filesystem should prefetch blocks Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: illumos/illumos-gate@b4709335aa83dcbfd0dba33c9be21fcabebd28e4 https://www.illumos.org/issues/3122 Ported-by: Brian Behlendorf <[email protected]> Closes #1565
* Add zfs_sync_pass_* tunable parametersCyril Plisko2013-07-021-0/+12
| | | | | | | | | | | | | | | | | | | | | | | Commit 55d85d5a8c45c4559a4a0e675c37b0c3afb19c2f (backport of the upstream changes) replaced three hardcoded constants: #define SYNC_PASS_DEFERRED_FREE 2 /* defer frees after this pass */ #define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */ #define SYNC_PASS_REWRITE 1 /* rewrite new bps after this pass */ with a tunable parameters: int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */ int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */ int zfs_sync_pass_rewrite = 2; /* rewrite new bps starting in this pass */ This commit makes these tunables available as module parameters in Linux. They should only be used for performance analysis because changing them can result in subtle and pathological performance problems. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1562
* Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()Li Dongyang2013-07-022-11/+50
| | | | | | | | | | | | | | | | | | | | | | | | The approach taken was the rework zfs_holey() as little as possible and then just wrap the code as needed to ensure correct locking and error handling. Tested with xfstests 285 and 286. All tests pass except for 7-9 of 285 which try to reserve blocks first via fallocate(2) and fail because fallocate(2) is not yet supported. Note that the filp->f_lock spinlock did not exist prior to Linux 2.6.30, but we avoid the need for autotools check by virtue of the fact that SEEK_DATA/SEEK_HOLE support was not added until Linux 3.1. An autoconf check was added for lseek_execute() which is currently a private function but the expectation is that it will be exported perhaps as early as Linux 3.11. Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1384
* Readd zfs_holey() from OpenSolarisMatthew Ahrens2013-07-021-0/+45
| | | | | | | | | | | This patch restores the zfs_holey() function from OpenSolaris. This was removed by commit 3558fd7 because it wasn't clear we had a use for it in ZoL. However, this functionality is a prerequisite for adding SEEK_DATA/SEEK_HOLE support to the ZPL. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #1384
* kmem_zalloc(..., KM_SLEEP) will never failshenyan12013-07-012-5/+1
| | | | | | | | | By definitition these allocations will never fail. For consistency with the rest of the code remove this dead error handling code. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1558
* Fix zfs_sb_teardown/zfs_resume_fs NULL dereferenceTim Chase2013-07-011-5/+8
| | | | | | | | | | | | | | | Fix a pair of conditions in which a concurrent umount can cause NULL pointer dereferences: * zfs_sb_teardown - prevent a NULL dereference by not calling dmu_objset_pool with a null z_os. * zfs_resume_fs - don't try to unmount with a null z_os. This change makes the ZoL code more consistent with both Illumos and FreeBSD. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1543
* Fix module probe failure on 32-bit systemsYing Zhu2013-06-271-2/+2
| | | | | | | | | | | | | | | | | Previous commit 7ef5e54e2e28884a04dc800657967b891239e933 caused module probe failure on 32-bit systems, dmesg showed Unknown symbol __moddi3 This was caused by the modulo operation 'gethrtime() % tqs->stqs_count' in the committed code. Instead of implementing __moddi3 for all 32-bit systems, Behlendorf advised we can just cast the return value of gethrtime() into a uint64_t, since gethrtime does not return negative value on all circumstances we need not care about the potential overflow. Signed-off-by: Ying Zhu <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1551
* Return -EOPNOTSUPP for ZFS_IOC_{GET|SET}FLAGSBrian Behlendorf2013-06-261-0/+29
| | | | | | | | | | Until these hooks are fully implemented return the expected -EOPNOTSUPP error to indicate they are not functional. This allows test suites such as xfstests to cleanly skip testing this functionality until it's implemented. Signed-off-by: Brian Behlendorf <[email protected]> Issue #229
* Register correct handlers in nvlist_alloc()Brian Behlendorf2013-06-202-4/+37
| | | | | | | | | | | | | | | | | | The non-blocking allocation handlers in nvlist_alloc() would be mistakenly assigned if any flags other than KM_SLEEP were passed. This meant that nvlists allocated with KM_PUSHPUSH or other KM_* debug flags were effectively always using atomic allocations. While these failures were unlikely it could lead to assertions because KM_PUSHPAGE allocations in particular are guaranteed to succeed or block. They must never fail. Since the existing API does not allow us to pass allocation flags to the private allocators the cleanest thing to do is to add a KM_PUSHPAGE allocator. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/spl#249
* Illumos #3805 arc shouldn't cache freed blocksMatthew Ahrens2013-06-202-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3805 arc shouldn't cache freed blocks Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Will Andrews <[email protected]> Approved by: Dan McDonald <[email protected]> References: illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2 https://www.illumos.org/issues/3805 ZFS should proactively evict freed blocks from the cache. On dcenter, we saw that we were caching ~256GB of metadata, while the pool only had <4GB of metadata on disk. We were wasting about half the system's RAM (252GB) on blocks that have been freed. Even though these freed blocks will never be used again, and thus will eventually be evicted, this causes us to use memory inefficiently for 2 reasons: 1. A block that is freed has no chance of being accessed again, but will be kept in memory preferentially to a block that was accessed before it (and is thus older) but has not been freed and thus has at least some chance of being accessed again. 2. We partition the ARC into several buckets: user data that has been accessed only once (MRU) metadata that has been accessed only once (MRU) user data that has been accessed more than once (MFU) metadata that has been accessed more than once (MFU) The user data vs metadata split is somewhat arbitrary, and the primary control on how much memory is used to cache data vs metadata is to simply try to keep the proportion the same as it has been in the past (each bucket "evicts against" itself). The secondary control is to evict data before evicting metadata. Because of this bucketing, we may end up with one bucket mostly containing freed blocks that are very old, while another bucket has more recently accessed, still-allocated blocks. Data in the useful bucket (with still-allocated blocks) may be evicted in preference to data in the useless bucket (with old, freed blocks). On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU data bucket was 27GB and the MRU metadata bucket was 256GB. However, the vast majority of data in the MRU metadata bucket (256GB) was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB) was constantly evicting useful blocks that will be soon needed. The problem of cache segmentation is a larger problem that needs more investigation. However, if we stop caching freed blocks, it should reduce the impact of this more fundamental issue. Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1503
* Illumos #3552, #3564George Wilson2013-06-193-94/+290
| | | | | | | | | | | | | | | | | | 3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread 3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing) Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@16a4a8074274d2d7cc408589cf6359f4a378c861 https://www.illumos.org/issues/3552 https://www.illumos.org/issues/3564 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1513
* Illumos #3006Madhav Suresh2013-06-1929-173/+182
| | | | | | | | | | | | | | | | | | | | 3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument is zero Reviewed by Matt Ahrens <[email protected]> Reviewed by George Wilson <[email protected]> Approved by Eric Schrock <[email protected]> References: illumos/illumos-gate@fb09f5aad449c97fe309678f3f604982b563a96f https://illumos.org/issues/3006 Requires: zfsonlinux/spl@1c6d149feb4033e4a56fb987004edc5d45288bcb Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1509
* Only check directory xattr on ENOENTBrian Behlendorf2013-05-101-1/+1
| | | | | | | | | | | | | | | When SA xattrs are enabled only fallback to checking the directory xattrs when the name is not found as a SA xattr. Otherwise, the SA error which should be returned to the caller is overwritten by the directory xattr errors. Positive return values indicating success will also be immediately returned. In the case of #1437 the ERANGE error was being correctly returned by zpl_xattr_get_sa() only to be overridden with ENOENT which was returned by the subsequent unnessisary call to zpl_xattr_get_dir(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #1437
* zfs_scrub_limit tunable is not used anywhereCyril Plisko2013-05-061-6/+0
| | | | | | | | | | | | As a part of scrub/resilver tuning zfs_scrub_limit fell out of use, but the definition of the variable remained in place. Moreover various guides still (misleadingly) mention it as a way to influence resilver/scrub behavior. This commit removes its finally. Signed-off-by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1444
* Fix incorrect assertions in ddt_phys_decref and ddt_sync_entryYing Zhu2013-05-061-2/+1
| | | | | | | | | | | The assertions in ddt_phys_decref and ddt_sync_entry cast ddp->ddp_refcnt from uint64_t to int64_t, with a reference count bigger than 2^63, e.g. the reference count of zero blocks commonly available in spare files, we may mistakenly hit these assertations, so drop the type conversions here. Signed-off-by: Ying Zhu <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1436
* Use taskq for dump_bytes()Brian Behlendorf2013-05-062-6/+59
| | | | | | | | | | | | | | | | | | | The vn_rdwr() function performs I/O by calling the vfs_write() or vfs_read() functions. These functions reside just below the system call layer and the expectation is they have almost the entire 8k of stack space to work with. In fact, certain layered configurations such as ext+lvm+md+multipath require the majority of this stack to avoid stack overflows. To avoid this posibility the vn_rdwr() call in dump_bytes() has been moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be performed with the majority of the stack space available. This ends up being very similiar to as if the I/O were issued via sys_write() or sys_read(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #1399 Closes #1423
* Illumos #3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock contentionAdam Leventhal2013-05-063-75/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Gordon Ross <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@ec94d32 https://illumos.org/issues/3581 Notes for Linux port: Earlier commit 08d08eb reduced contention on this taskq lock by simply reducing the number of z_fr_iss threads from 100 to one-per-CPU. We also optimized the taskq implementation in zfsonlinux/spl@3c6ed54. These changes significantly improved unlink performance to acceptable levels. This patch further reduces time spent spinning on this lock by randomly dispatching the work items over multiple independent task queues. The Illumos ZFS developers stated that this lock contention only arose after "3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()" was landed. It's not clear if 3329 affects the Linux port or not. I didn't see spa_free_sync_cb() show up in oprofile sessions while unlinking large files, but I may just not have used the right test case. I tested unlinking a 1 TB of data with and without the patch and didn't observe a meaningful difference in elapsed time. However, oprofile showed that the percent time spent in taskq_thread() was reduced from about 16% to about 5%. Aside from a possible slight performance benefit this may be worth landing if only for the sake of maintaining consistency with upstream. Ported-by: Ned Bass <[email protected]> Closes #1327
* Illumos #3329, #3330, #3331, #3335George Wilson2013-05-065-14/+59
| | | | | | | | | | | | | | | | | | | | | | | | 3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb() 3330 space_seg_t should have its own kmem_cache 3331 deferred frees should happen after sync_pass 1 3335 make SYNC_PASS_* constants tunable Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@01f55e48fb4d524eaf70687728aa51b7762e2e97 https://www.illumos.org/issues/3329 https://www.illumos.org/issues/3330 https://www.illumos.org/issues/3331 https://www.illumos.org/issues/3335 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #3306, #3321George Wilson2013-05-033-20/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3306 zdb should be able to issue reads in parallel 3321 'zpool reopen' command should be documented in the man page and help Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: illumos/illumos-gate@31d7e8fa33fae995f558673adb22641b5aa8b6e1 https://www.illumos.org/issues/3306 https://www.illumos.org/issues/3321 The vdev_file.c implementation in this patch diverges significantly from the upstream version. For consistenty with the vdev_disk.c code the upstream version leverages the Illumos bio interfaces. This makes sense for Illumos but not for ZoL for two reasons. 1) The vdev_disk.c code in ZoL has been rewritten to use the Linux block device interfaces which differ significantly from those in Illumos. Therefore, updating the vdev_file.c to use the Illumos interfaces doesn't get you consistency with vdev_disk.c. 2) Using the upstream patch as is would requiring implementing compatibility code for those Solaris block device interfaces in user and kernel space. That additional complexity could lead to confusion and doesn't buy us anything. For these reasons I've opted to simply move the existing vn_rdwr() as is in to the taskq function. This has the advantage of being low risk and easy to understand. Moving the vn_rdwr() function in to its own taskq thread also neatly avoids the possibility of a stack overflow. Finally, because of the additional work which is being handled by the free taskq the number of threads has been increased. The thread count under Illumos defaults to 100 but was decreased to 2 in commit 08d08e due to contention. We increase it to 8 until the contention can be address by porting Illumos #3581. Ported-by: Brian Behlendorf <[email protected]> Closes #1354
* 3246 ZFS I/O deadman threadGeorge.Wilson2013-05-018-35/+179
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> NOTES: This patch has been reworked from the original in the following ways to accomidate Linux ZFS implementation *) Usage of the cyclic interface was replaced by the delayed taskq interface. This avoids the need to implement new compatibility code and allows us to rely on the existing taskq implementation. *) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h because declaring externs in source files as was done in the original patch is just plain wrong. *) Instead of panicing the system when the deadman triggers a zevent describing the blocked vdev and the first pending I/O is posted. If the panic behavior is desired Linux provides other generic methods to panic the system when threads are observed to hang. *) For reference, to delay zios by 30 seconds for testing you can use zinject as follows: 'zinject -d <vdev> -D30 <pool>' References: illumos/illumos-gate@283b84606b6fc326692c03273de1774e8c122f9a https://www.illumos.org/issues/3246 Ported-by: Brian Behlendorf <[email protected]> Closes #1396
* Fix txg_quiesce thread deadlockBrian Behlendorf2013-04-262-8/+8
| | | | | | | | | | | | | | | | | | A deadlock was accidentally introduced by commit e95853a which can occur when the system is under memory pressure. What happens is that while the txg_quiesce thread is holding the tx->tx_cpu locks it enters memory reclaim. In the context of this memory reclaim it then issues synchronous I/O to a ZVOL swap device. Because the txg_quiesce thread is holding the tx->tx_cpu locks a new txg cannot be opened to handle the I/O. Deadlock. The fix is straight forward. Move the memory allocation outside the critical region where the tx->tx_cpu locks are held. And for good measure change the offending allocation to KM_PUSHPAGE to ensure it never attempts to issue I/O during reclaim. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1274
* Correctly return ERANGE in getxattr(2)Brian Behlendorf2013-04-241-2/+10
| | | | | | | | | | According to the getxattr(2) man page the ERANGE errno should be returned when the size of the value buffer is to small to hold the result. Prior to this patch the implementation would just truncate the value to size bytes. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1408
* Trivial spelling fixChris Dunlop2013-04-191-1/+1
| | | | | Signed-off-by: Brian Behlendorf <[email protected]> Closes #1411
* Remove .readdir from zpl_file_operations tableCaleb James DeLisle2013-04-191-1/+0
| | | | | | | | | | | The zpl_readdir() function shouldn't be registered as part of the zpl_file_operations table, it must only be part of the zpl_dir_file_operations table. By removing this callback the VFS will now correctly return ENOTDIR when calling getdents() on a file. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1404
* Allow setting a lower ashift with -o ashiftMartin Matuska2013-04-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1381 Closes #1328 Issue #967 Issue #548
* Illumos #3422, #3425George Wilson2013-04-122-10/+5
| | | | | | | | | | | | | | | | | 3422 zpool create/syseventd race yield non-importable pool 3425 first write to a new zvol can fail with EFBIG Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: illumos/illumos-gate@bda8819455defbccd06981d9a13b240b682a3d50 https://www.illumos.org/issues/3422 https://www.illumos.org/issues/3425 Ported-by: Brian Behlendorf <[email protected]> Closes #1390
* gitignore: anchor entries at their respective directoryJan Engelhardt2013-04-021-0/+6
| | | | | | | .ko is specific to module, .m4 to config, etc. Signed-off-by: Jan Engelhardt <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* build: resolve orthographic and other grammatical errorsJan Engelhardt2013-04-021-3/+3
| | | | | Signed-off-by: Jan Engelhardt <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Add zio_ddt_free()+ddt_phys_decref() error handlingBrian Behlendorf2013-03-192-4/+9
| | | | | | | | | | | | | | | | | The assumption in zio_ddt_free() is that ddt_phys_select() must always find a match. However, if that fails due to a damaged DDT or some other reason the code will NULL dereference in ddt_phys_decref(). While this should never happen it has been observed on various platforms. The result is that unless your willing to patch the ZFS code the pool is inaccessible. Therefore, we're choosing to more gracefully handle this case rather than leave it fatal. http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html Signed-off-by: Brian Behlendorf <[email protected]> Closes #1308
* Add metaslab_debug optionBrian Behlendorf2013-03-181-1/+6
| | | | | | | | | Enabling metaslab debugging will prevent space maps from being automatically unloaded. This can significantly increase the memory footprint but being able to dynamically control this is helpful for debugging and certain performance testing. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.9 compat: Undefine GCC_VERSIONRichard Yao2013-03-061-0/+8
| | | | | | | | | | | | The mainline kernel started defining GCC_VERSION with commit torvalds/linux@3f3f8d2f48acfd8ed3b8e6b7377935da57b27b16. Unfortunately, LZ4 also defines this macro, but the two defintions are incompatible. We undefine GCC_VERSION in lz4.c to handle this. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1339
* Refresh links to web siteNed Bass2013-03-061-1/+1
| | | | | | | | A few files still refer to @behlendorf's private fork on github. Use the primary web site URL instead. Two typos are also corrected. Signed-off-by: Brian Behlendorf <[email protected]>
* Add KMODDIR to install targetBrian Behlendorf2013-03-061-8/+13
| | | | | | | | | | | | | Provide a mechanism to control the directory name the modules are installed in. The kernel privdes INSTALL_MOD_DIR for this but it was hardcoded to be 'addon/zfs'. Add a KMODDIR variable which can be passed to 'make install' to override the default directory name. While we're here change the default from 'addon/zfs' to 'extra' which is the kernel.org default. Signed-off-by: Brian Behlendorf <[email protected]>
* Add snapdev=[hidden|visible] dataset propertyEric Dillmann2013-03-053-3/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new snapdev dataset property may be set to control the visibility of zvol snapshot devices. By default this value is set to 'hidden' which will prevent zvol snapshots from appearing under /dev/zvol/ and /dev/<dataset>/. When set to 'visible' all zvol snapshots for the dataset will be visible. This functionality was largely added because when automatic snapshoting is enabled large numbers of read-only zvol snapshots will be created. When creating these devices the kernel will attempt to read their partition tables, and blkid will attempt to identify any filesystems on those partitions. This leads to a variety of issues: 1) The zvol partition tables will be read in the context of the `modprobe zfs` for automatically imported pools. This is undesirable and should be done asynchronously, but for now reducing the number of visible devices helps. 2) Udev expects to be able to complete its work for a new block devices fairly quickly. When many zvol devices are added at the same time this is no longer be true. It can lead to udev timeouts and missing /dev/zvol links. 3) Simply having lots of devices in /dev/ can be aukward from a management standpoint. Hidding the devices your unlikely to ever use helps with this. Any snapshot device which is needed can be made visible by changing the snapdev property. NOTE: This patch changes the default behavior for zvols which was effectively 'snapdev=visible'. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1235 Closes #945 Issue #956 Issue #756
* Merge zvol.c changes from PSARC 2010/306 Read-only ZFS poolsGeorge Wilson2013-03-041-5/+8
| | | | | | | | | | | | | | | | | | The changes to zvol.c were never merged from the last onnv_147 bulk update. This was because zvol.c was largely rewritten for Linux making it fairly easy to miss these sorts of changes. This causes a regression when importing a zpool with zvols read-only. This does not impact pool which only contain filesystem datasets. References: illumos/illumos-gate@f9af39b Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1332 Closes #1333
* Constify structures containing function pointersRichard Yao2013-03-045-43/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PaX team modified the kernel's modpost to report writeable function pointers as section mismatches because they are potential exploit targets. We could ignore the warnings, but their presence can obscure actual issues. Proper const correctness can also catch programming mistakes. Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2 kernel reports 133 section mismatches prior to this patch. This patch eliminates 130 of them. The quantity of writeable function pointers eliminated by constifying each structure is as follows: vdev_opts_t 52 zil_replay_func_t 24 zio_compress_info_t 24 zio_checksum_info_t 9 space_map_ops_t 7 arc_byteswap_func_t 5 The remaining 3 writeable function pointers cannot be addressed by this patch. 2 of them are in zpl_fs_type. The kernel's sget function requires that this be non-const. The final writeable function pointer is created by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and remove_shrinker() functions also require that this be non-const. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1300
* Fix hot sparesBrian Behlendorf2013-03-011-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue with hot spares in ZoL is because it opens all leaf vdevs exclusively (O_EXCL). On Linux, exclusive opens cause subsequent exclusive opens to fail with EBUSY. This could be resolved by not opening any of the devices exclusively, which is what Illumos does, but the additional protection offered by exclusive opens is desirable. It cleanly prevents you from accidentally adding an in-use non-ZFS device to your pool. To fix this we very slightly relaxed the usage of O_EXCL in the following ways. 1) Functions which open the device but only read had the O_EXCL flag removed and were updated to use O_RDONLY. 2) A common holder was added to the vdev disk code. This allow the ZFS code to internally open the device multiple times but non-ZFS callers may not. 3) An exception was added to make_disks() for hot spare when creating partition tables. For hot spare devices which are already opened exclusively we skip creating the partition table because this must already have been done when the disk was originally added as a hot spare. Additional minor changes include fixing check_in_use() to use a partition instead of a slice suffix. And is_spare() was moved above make_disks() to avoid adding a forward reference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #250
* Remove wholedisk check from vdev_disk_open()Brian Behlendorf2013-02-281-14/+0
| | | | | | | | | As described by the comment and enforced the by assertion the v->vdev_wholedisk will never be -1. The wholedisk handling is performed by the user space utilities. To prevent confusion this dead code is being removed. Signed-off-by: Brian Behlendorf <[email protected]>
* Leaf vdevs should not be reopenedBrian Behlendorf2013-02-281-3/+16
| | | | | | | | | | | | | | | | | | | When vdev_disk.c was implemented for Linux we failed to handle the reopen case. According to the vdev_reopen() comment leaf vdevs should not be closed or opened when v->vdev_reopening is set. Under Linux we would always close and open the device. This issue was only noticed when a 'zpool scrub' command was run while the leaf vdev device names in /dev/disk/by-vdev were missing. The scrub command calls vdev_reopen() which caused the vdevs to be closed but they couldn't be reopened due to the missing links. The result was that all the vdevs were marked unavailable and the pool was halted due to failmode=wait. This patch adds the missing functionality in a similiar fashion to to the Illumos code. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove the bio_empty_barrier() check.Etienne Dechamps2013-02-241-9/+0
| | | | | | | | | | | | | | | | | | | | | | | To determine whether the kernel is capable of handling empty barrier BIOs, we check for the presence of the bio_empty_barrier() macro, which was introduced in 2.6.24. If this macro is defined, then we can flush disk vdevs; if it isn't, then flushing is disabled. Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37, even though the kernel is still capable of handling empty barrier BIOs. As a result, flushing is effectively disabled on kernels >= 2.6.37, meaning that starting from this kernel version, zfs doesn't use barriers to guarantee on-disk data consistency. This is quite bad and can lead to potential data corruption on power failures. This patch fixes the issue by removing the configure check for bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore. Thanks to Richard Kojedzinszky for catching this nasty bug. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1318
* Enable zfs_arc_memory_throttle_disable by defaultBrian Behlendorf2013-02-211-1/+1
| | | | | | | | | | | | | | | | | | | The zfs_arc_memory_throttle_disable module option was introduced by commit 0c5493d47059f25ce9dbf20c9fe87655f55102a1 to resolve a memory miscalculation which could result in the txg_sync thread spinning. When this was first introduced the default behavior was left unchanged until enough real world usage confirmed there were no unexpected issues. We've now reached that point. Linux's direct reclaim is working as expected so we're enabling this behavior by default. This helps pave the way to retire the spl_kmem_availrmem() functionality in the SPL layer. This was the only caller. Signed-off-by: Brian Behlendorf <[email protected]> Issue #938
* Make spa.c assertions catch unsupported pre-feature flag pool versionsRichard Yao2013-02-121-2/+2
| | | | | | | | | | | | | A couple of assertions in spa.c were designed to prevent the use of invalid pool versions. They were written under the assumption that all valid pools are less than SPA_VERSION. Since feature flags jumped from 28 to 5000, any numbers in the range 28 to 5000 non-inclusive will fail to trigger them. We switch to the new SPA_VERSION_IS_SUPPORTED macro to correct this. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1282
* Add explicit MAXNAMELEN checkBrian Behlendorf2013-02-121-0/+3
| | | | | | | | | | | | | It turns out that the Linux VFS doesn't strictly handle all cases where a component path name exceeds MAXNAMELEN. It does however appear to correctly handle MAXPATHLEN for us. The right way to handle this appears to be to add an explicit check to the zpl_lookup() function. Several in-tree filesystems handle this case the same way. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1279
* Switch KM_SLEEP to KM_PUSHPAGENed Bass2013-02-062-3/+3
| | | | | | | | | | | | Two more locations where KM_SLEEP was used in a call which must use KM_PUSHPAGE were found while using the zpool upgrade command. See commit b8d06fc for additional details. Also make a small correction to the comment block above dsl_dir_open_spa(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #1268
* Cast 'zfs bad bloc' to ULL for x86Brian Behlendorf2013-02-041-1/+1
| | | | | | | | | | | Explicitly case this value to an unsigned long long for 32-bit systems to inform the compiler that a long type should not be used. Otherwise we get the following compiler error: dmu_send.c:376: error: integer constant is too large for ‘long’ type Signed-off-by: Brian Behlendorf <[email protected]>
* Add zfs_arc_memory_throttle_disable module optionBrian Behlendorf2013-02-011-0/+7
| | | | | | | | | | | | | | | | | | | | | | The way in which virtual box ab(uses) memory can throw off the free memory calculation in arc_memory_throttle(). The result is the txg_sync thread will effectively spin waiting for memory to be released even though there's lots of memory on the system. To handle this case I'm adding a zfs_arc_memory_throttle_disable module option largely for virtual box users. Setting this option disables free memory checks which allows the txg_sync thread to make progress. By default this option is disabled to preserve the current behavior. However, because Linux supports direct memory reclaim it's doubtful throttling due to perceived memory pressure is ever a good idea. We should enable this option by default once we've done enough real world testing to convince ourselve there aren't any unexpected side effects. Signed-off-by: Brian Behlendorf <[email protected]> Closes #938
* Add zfs_disable_dup_eviction module optionBrian Behlendorf2013-02-011-0/+3
| | | | | | | | Commit 1eb5bfa introduced a new zfs_disable_dup_eviction tunable. It should have been made available as a module option in the original patch but was overlooked. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix mismatch between SA header size and layoutNed Bass2013-01-311-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a system attribute layout is created an inconsistency may occur between the system attribute header (sa_hdr_phys_t) size and the variable-sized attribute count stored in the layout. The inconsistency results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT returns false: SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab()) ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr, tb)) || !IS_SA_BONUSTYPE(bonustype) || (IS_SA_BONUSTYPE(bonustype) && hdr->sa_layout_info == 0)) failed The bug originates in this snippet from sa_find_sizes(). if (is_var_sz && var_size > 1) { if (P2ROUNDUP(hdrsize + sizeof (uint16_t), *total < full_space) { hdrsize += sizeof (uint16_t); This assumes that the current variable-sized attribute will be stored in the current buffer and accounts for the space needed to store its size in the sa_hdr_phys_t. However if the next attribute spills over we need to store a blkptr_t at the end of the bonus buffer to point to the spill block. If the current attribute is in the way of the blkptr_t then it too will be relocated into the spill block. But since we've already accounted for it in the header size we get the inconsistency described above. To avoid this, record the index of the last variable-sized attribute that prompted a hdrsize increase, and reverse the increase if we later determine that that attribute will be relocated to the spill block. Signed-off-by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1250
* Fix rounding discrepancy in sa_find_sizes()Ned Bass2013-01-311-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A rounding discrepancy exists between how sa_build_layouts() and sa_find_sizes() calculate when the spill block needs to be kicked in. This results in a narrow size range where sa_build_layouts() believes there must be a spill block allocated but due to the discrepancy there isn't. A panic then occurs when the hdl->sa_spill NULL pointer is dereferenced. The following reproducer for this bug was isolated: truncate -s 128m /tmp/tank zpool create tank /tmp/tank zfs create -o xattr=sa tank/fish ln -s `perl -e 'print "z" x 41'` /tank/fish/z setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z This test results in roughly the following system attribute (SA) layout: 176 bytes - "standard" SA's 41 bytes - name of symbolic link target 100 bytes - XDR encoded nvlist for xattr --- 317 bytes - total Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes() decides no spill block is needed. But sa_build_layouts() rounds 41 up to 48 when computing the space requirements so it tries to switch to the spill block. Note that we were only able to reproduce this bug using a combination of symbolic links and the Linux-specific xattr=sa dataset property. So while this issue is not technically Linux-specific, it may be difficult or impossible to hit the narrow size range needed to reproduce it on other platforms. To fix the discrepancy, round the running total in sa_find_sizes() up to an 8-byte boundary before accounting for each SA, since this is how they will be stored in the bonus and (possibly) spill buffers. To make the intent of the code more clear, explicitly assert key assumptions about expected alignment of data and whether spill-over will occur. Signed-off-by: Matthew Ahrens <[email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #1240
* Illumos #3447 improve the comment in txg.cAdam H. Leventhal2013-01-301-2/+71
| | | | | | | | | | | | | | | | 3447 improve the comment in txg.c Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: illumos/illumos-gate@adbbcfface63b3a71922d5a25d34a2018c0435de https://www.illumos.org/issues/3447 Ported-by: Brian Behlendorf <[email protected]>