summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Move udev rules from /etc/udev to /lib/udevKyle Fuller2011-08-0818-30/+45
| | | | | | | | | | | | | | | | | | | This change moves the default install location for the zfs udev rules from /etc/udev/ to /lib/udev/. The correct convention is for rules provided by a package to be installed in /lib/udev/. The /etc/udev/ directory is reserved for custom rules or local overrides. Additionally, this patch cleans up some abuse of the bindir install location by adding a udevdir and udevruledir install directories. This allows us to revert to the default bin install location. The udev install directories can be set with the following new options. --with-udevdir=DIR install udev helpers [EPREFIX/lib/udev] --with-udevruledir=DIR install udev rules [UDEVDIR/rules.d] Signed-off-by: Brian Behlendorf <[email protected]> Closes #356
* Correctly lock pages for .readpages()Brian Behlendorf2011-08-081-54/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike the .readpage() callback which is passed a single locked page to be populated. The .readpages() callback is passed a list of unlocked pages which are all marked for read-ahead (PG_readahead set). It is the responsibly of .readpages() to ensure to pages are properly locked before being populated. Prior to this change the requested read-ahead pages would be updated outside of the page lock which is unsafe. The unlocked pages would then be unlocked again which is harmless but should have been immediately detected as bug. Unfortunately, newer kernels failed detect this issue because the check is done with a VM_BUG_ON which is disabled by default. Luckily, the old Debian Lenny 2.6.26 kernel caught this because it simply uses a BUG_ON. The straight forward fix for this is to update the .readpages() callback to use the read_cache_pages() helper function. The helper function will ensure that each page in the list is properly locked before it is passed to the .readpage() callback. In addition resolving the bug, this results in a nice simplification of the existing code. The downside to this change is that instead of passing one large read request to the dmu multiple smaller ones are submitted. All of these requests however are marked for readahead so the lower layers should issue a large I/O regardless. Thus most of the request should hit the ARC cache. Futher optimization of this code can be done in the future is a perform analysis determines it to be worthwhile. But for the moment, it is preferable that code be correct and understandable. Signed-off-by: Brian Behlendorf <[email protected]> Closes #355
* Add backing_device_info per-filesystemBrian Behlendorf2011-08-0461-0/+255
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a long time now the kernel has been moving away from using the pdflush daemon to write 'old' dirty pages to disk. The primary reason for this is because the pdflush daemon is single threaded and can be a limiting factor for performance. Since pdflush sequentially walks the dirty inode list for each super block any delay in processing can slow down dirty page writeback for all filesystems. The replacement for pdflush is called bdi (backing device info). The bdi system involves creating a per-filesystem control structure each with its own private sets of queues to manage writeback. The advantage is greater parallelism which improves performance and prevents a single filesystem from slowing writeback to the others. For a long time both systems co-existed in the kernel so it wasn't strictly required to implement the bdi scheme. However, as of Linux 2.6.36 kernels the pdflush functionality has been retired. Since ZFS already bypasses the page cache for most I/O this is only an issue for mmap(2) writes which must go through the page cache. Even then adding this missing support for newer kernels was overlooked because there are other mechanisms which can trigger writeback. However, there is one critical case where not implementing the bdi functionality can cause problems. If an application handles a page fault it can enter the balance_dirty_pages() callpath. This will result in the application hanging until the number of dirty pages in the system drops below the dirty ratio. Without a registered backing_device_info for the filesystem the dirty pages will not get written out. Thus the application will hang. As mentioned above this was less of an issue with older kernels because pdflush would eventually write out the dirty pages. This change adds a backing_device_info structure to the zfs_sb_t which is already allocated per-super block. It is then registered when the filesystem mounted and unregistered on unmount. It will not be registered for mounted snapshots which are read-only. This change will result in flush-<pool> thread being dynamically created and destroyed per-mounted filesystem for writeback. Signed-off-by: Brian Behlendorf <[email protected]> Closes #174
* Cleanup mmap(2) writesBrian Behlendorf2011-08-023-111/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While the existing implementation of .writepage()/zpl_putpage() was functional it was not entirely correct. In particular, it would move dirty pages in to a clean state simply after copying them in to the ARC cache. This would result in the pages being lost if the system were to crash enough though the Linux VFS believed them to be safe on stable storage. Since at the moment virtually all I/O, except mmap(2), bypasses the page cache this isn't as bad as it sounds. However, as hopefully start using the page cache more getting this right becomes more important so it's good to improve this now. This patch takes a big step in that direction by updating the code to correctly move dirty pages through a writeback phase before they are marked clean. When a dirty page is copied in to the ARC it will now be set in writeback and a completion callback is registered with the transaction. The page will stay in writeback until the dmu runs the completion callback indicating the page is on stable storage. At this point the page can be safely marked clean. This process is normally entirely asynchronous and will be repeated for every dirty page. This may initially sound inefficient but most of these pages will end up in a few txgs. That means when they are eventually written to disk they should be nicely batched. However, there is room for improvement. It may still be desirable to batch up the pages in to larger writes for the dmu. This would reduce the number of callbacks and small 4k buffer required by the ARC. Finally, if the caller requires that the I/O be done synchronously by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O will trigger a zil_commit() to flush the data to stable storage. At which point the registered callbacks will be run leaving the date safe of disk and marked clean before returning from .writepage. Signed-off-by: Brian Behlendorf <[email protected]>
* Use libzfs_run_process() in libshare.Gunnar Beutner2011-08-011-59/+29
| | | | | | | | This should simplify the code a bit by re-using existing code to fork and exec a process. Signed-off-by: Brian Behlendorf <[email protected]> Issue #190
* Use /dev/null for stdout/stderr in libzfs_run_process().Gunnar Beutner2011-08-011-3/+10
| | | | | | | | | | | | Simply closing the stdout and/or stderr file descriptors for the child process can have bad side effects if for example the child writes to stdout/stderr after open()ing a file. The open() call might have returned the same file descriptor one would usually expect for stdout/stderr (1 and 2), thereby causing mis-directed writes. Signed-off-by: Brian Behlendorf <[email protected]> Issue #190
* Call exportfs -v once for NFS shares.James H2011-08-011-90/+74
| | | | | | | | | | | | | | | | | | At the moment we call exportfs -v every time we check whether an NFS share is active. This happens every time you run a zfs or zpool command, making them extremely slow when you have a lot of exports. The time taken is approx O(n2) of the number of shares. This commit stores the output from exportfs -v in a temporary file and use this to speed up subsequent accesses. This mechanism is still too slow - if you have tens of thousands of NFS shares it will still be painful running ANY zfs/zpool command. Signed-off-by: Gunnar Beutner <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #341
* Merge branch 'illumos'Brian Behlendorf2011-08-0123-106/+2700
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge in ten upstream fixes which have already been made to both the Illumos and FreeBSD ZFS implementations. This brings us up to date with the latest ZFS changes in Illumos. Credit goes to Martin Matuska of the FreeBSD project for posting an excellent summary of the upstream patches we were missing. Illumos #1313: Integer overflow in txg_delay() Illumos #278: get rid zfs of python and pyzfs dependencies Illumos #1043: Recursive zfs snapshot destroy fails Illumos #883: ZIL reuse during remount corruption Illumos #1092: zfs refratio property Illumos #1051: zfs should handle Illumos #510: 'zfs get' enhancement - mountpoint as an argument Illumos #175: zfs vdev cache consumes excessive memory Illumos #764: panic in zfs:dbuf_sync_list Illumos #xxx: zdb -vvv broken after zfs diff integration Signed-off-by: Brian Behlendorf <[email protected]> Closes #340
| * Illumos #1313: Integer overflow in txg_delay()Martin Matuska2011-08-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function txg_delay() is used to delay txg (transaction group) threads in ZFS. The timeout value for this function is calculated using: int timeout = ddi_get_lbolt() + ticks; Later, the actual wait is performed: while (ddi_get_lbolt() < timeout && tx->tx_syncing_txg < txg-1 && !txg_stalled(dp)) (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock, timeout - ddi_get_lbolt()); The ddi_get_lbolt() function returns current uptime in clock ticks and is typed as clock_t. The clock_t type on 64-bit architectures is int64_t. The "timeout" variable will overflow depending on the tick frequency (e.g. for 1000 it will overflow in 28.855 days). This will make the expression "ddi_get_lbolt() < timeout" always false - txg threads will not be delayed anymore at all. This leads to a slowdown in ZFS writes. The attached patch initializes timeout as clock_t to match the return value of ddi_get_lbolt(). Signed-off-by: Brian Behlendorf <[email protected]> Issue #352
| * Illumos #278: get rid zfs of python and pyzfs dependenciesAlexander Stetsenko2011-08-015-34/+2440
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove all python and pyzfs dependencies for consistency and to ensure full functionality even in a mimimalist environment. Reviewed by: [email protected] Reviewed by: [email protected] Reviewed by: [email protected] Reviewed by: [email protected] Approved by: [email protected] References to Illumos issue and patch: - https://www.illumos.org/issues/278 - https://github.com/illumos/illumos-gate/commit/1af68beac3 Signed-off-by: Brian Behlendorf <[email protected]> Issue #340 Issue #160 Signed-off-by: Brian Behlendorf <[email protected]>
| * Illumos #1043: Recursive zfs snapshot destroy failsMartin Matuska2011-08-011-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to revision 11314 if a user was recursively destroying snapshots of a dataset the target dataset was not required to exist. The zfs_secpolicy_destroy_snaps() function introduced the security check on the target dataset, so since then if the target dataset does not exist, the recursive destroy is not performed. Before 11314, only a delete permission check on the snapshot's master dataset was performed. Steps to reproduce: zfs create pool/a zfs snapshot pool/a@s1 zfs destroy -r pool@s1 Therefore I suggest to fallback to the old security check, if the target snapshot does not exist and continue with the destroy. References to Illumos issue and patch: - https://www.illumos.org/issues/1043 - https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch Signed-off-by: Brian Behlendorf <[email protected]> Issue #340
| * Illumos #883: ZIL reuse during remount corruptionEric Schrock2011-08-012-16/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Moving the zil_free() cleanup to zil_close() prevents this problem from occurring in the first place. There is a very good description of the issue and fix in Illumus #883. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Albert Lee <[email protected]> Reviewed by: Gordon Ross <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reivewed by: Dan McDonald <[email protected]> Approved by: Gordon Ross <[email protected]> References to Illumos issue and patch: - https://www.illumos.org/issues/883 - https://github.com/illumos/illumos-gate/commit/c9ba2a43cb Signed-off-by: Brian Behlendorf <[email protected]> Issue #340
| * Illumos #1092: zfs refratio propertyMatt Ahrens2011-08-015-7/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a "REFRATIO" property, which is the compression ratio based on data referenced. For snapshots, this is the same as COMPRESSRATIO, but for filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie, includes blocks in children, but not blocks shared with the origin). This is needed to figure out how much space a filesystem would use if it were not compressed (ignoring snapshots). Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Mark Musante <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Garrett D'Amore <[email protected]> References to Illumos issue and patch: - https://www.illumos.org/issues/1092 - https://github.com/illumos/illumos-gate/commit/187d6ac08a Signed-off-by: Brian Behlendorf <[email protected]> Issue #340
| * Illumos #1051: zfs should handle imbalanced lunsGeorge Wilson2011-08-018-28/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today zfs tries to allocate blocks evenly across all devices. This means when devices are imbalanced zfs will use lots of CPU searching for space on devices which tend to be pretty full. It should instead fail quickly on the full LUNs and move onto devices which have more availability. Reviewed by: Eric Schrock <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Albert Lee <[email protected]> Reviewed by: Gordon Ross <[email protected]> Approved by: Garrett D'Amore <[email protected]> References to Illumos issue and patch: - https://www.illumos.org/issues/510 - https://github.com/illumos/illumos-gate/commit/5ead3ed965 Signed-off-by: Brian Behlendorf <[email protected]> Issue #340
| * Illumos #510: 'zfs get' enhancement - mountpoint as an argumentShampavman2011-08-011-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'zfs get' command should be able to deal with mountpoint as an argument. It already works with 'zfs list' command: # zfs list /export/home/estibi NAME USED AVAIL REFER MOUNTPOINT rpool/export/home/estibi 1.14G 3.86G 1.14G /export/home/estibi but it fails with 'zfs get': # zfs get all /export/home/estibi cannot open '/export/home/estibi': invalid dataset name Reviewed by: Eric Schrock <[email protected]> Reviewed by: Deano <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Garrett D'Amore <[email protected]> References to Illumos issue and patch: - https://www.illumos.org/issues/510 - https://github.com/illumos/illumos-gate/commit/5ead3ed965 Signed-off-by: Brian Behlendorf <[email protected]> Issue #340
| * Illumos #175: zfs vdev cache consumes excessive memoryGarrett D'Amore2011-08-011-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note that with the current ZFS code, it turns out that the vdev cache is not helpful, and in some cases actually harmful. It is better if we disable this. Once some time has passed, we should actually remove this to simplify the code. For now we just disable it by setting the zfs_vdev_cache_size to zero. Note that Solaris 11 has made these same changes. References to Illumos issue and patch: - https://www.illumos.org/issues/175 - https://github.com/illumos/illumos-gate/commit/b68a40a845 Reviewed by: George Wilson <[email protected]> Reviewed by: Eric Schrock <[email protected]> Approved by: Richard Lowe <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #340
| * Illumos #764: panic in zfs:dbuf_sync_listGordon Ross2011-08-011-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hypothesis about what's going on here. At some time in the past, something, i.e. dnode_reallocate() calls one of: dbuf_rm_spill(dn, tx); These will do: dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx) dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx) dbuf_undirty(db, tx) Currently dbuf_undirty can leave a spill block in dn_dirty_records[], (it having been put there previously by dbuf_dirty) and free it. Sometime later, dbuf_sync_list trips over this reference to free'd (and typically reused) memory. Also, dbuf_undirty can call dnode_clear_range with a bogus block ID. It needs to test for DMU_SPILL_BLKID, similar to how dnode_clear_range is called in dbuf_dirty(). References to Illumos issue and patch: - https://www.illumos.org/issues/764 - https://github.com/illumos/illumos-gate/commit/3f2366c2bb Reviewed by: George Wilson <[email protected]> Reviewed by: [email protected] Reviewed by: Albert Lee <[email protected] Approved by: Garrett D'Amore <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #340
| * Illumos #xxx: zdb -vvv broken after zfs diff integrationTim Haley2011-08-011-14/+14
|/ | | | | | | | References to Illumos issue and patch: - https://github.com/illumos/illumos-gate/commit/163eb7ff Signed-off-by: Brian Behlendorf <[email protected]> Issue #340
* Add .gitignore for zfs.<distro> init scriptsBrian Behlendorf2011-08-011-0/+6
| | | | | | Treat the automatically generated zfs.<distro> init scripts as build products by adding them to a directory specific .gitignore file.
* Turn the init.d scripts into autoconf config filesKyle Fuller2011-08-019-35/+73
| | | | | | | | | | This change ensures the paths used by the provided init scripts always reference the prefixes provided at configure time. The @sbindir@ and @sysconfdir@ prefixes will be correctly replaced at build time. Signed-off-by: Brian Behlendorf <[email protected]> Closes #336
* Wrap dracut scripts to 79 charsZachary Bedell2011-07-313-10/+14
| | | | | Signed-off-by: Brian Behlendorf <[email protected]> Closes #347
* Make autogen.sh executableKyle Fuller2011-07-261-0/+0
| | | | Signed-off-by: Brian Behlendorf <[email protected]>
* Catch return errors from zpool commandsZachary Bedell2011-07-251-2/+3
| | | | | | | | | | | | | | | | This fixes a bug that can effect first reboot after install using Dracut. The Dracut module didn't check the return value from several calls to z* functions. This resulted in "Using no pools available as root" on boot if the ZFS module didn't auto-import the pools. It's most likely to happen on initial restart after a fresh install & requires juggling in the Dracut emergency holographic shell to fix. This patch checks return codes & output from zpool list and related functions and correctly falls into the explicit zpool import code branch if the module didn't import the pool at load. Signed-off-by: Brian Behlendorf <[email protected]>
* Soft to hard tabsZachary Bedell2011-07-253-114/+114
| | | | | | | For consistency with the upstream sources and the rest of the project use hard instead of soft tabs. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix txg_sync_thread deadlockBrian Behlendorf2011-07-221-2/+2
| | | | | | | | | | | | Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE. Because these functions are called from txg_sync_thread we must ensure they don't reenter the zfs filesystem code via the .writepage callback. This would result in a deadlock. This deadlock is rare and has only been observed once under an abusive mmap() write workload. Signed-off-by: Brian Behlendorf <[email protected]>
* Add missing <pool> optionBrian Behlendorf2011-07-221-1/+1
| | | | | | | The bootfs example in the dracut documentation was sightly incorrect because it lacked the trailing required pool argument. Add it. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix the configure CONFIG_* option detectionBrian Behlendorf2011-07-222-15/+5
| | | | | | | | | | | | | | | The latest kernels no longer define AUTOCONF_INCLUDED which was being used to detect the new style autoconf.h kernel configure options. This results in the CONFIG_* checks always failing incorrectly for newer kernels. The fix for this is a simplification of the testing method. Rather than attempting to explicitly include to renamed config header. It is simpler to unconditionally include <linux/module.h> which must pick up the correctly named header. Signed-off-by: Brian Behlendorf <[email protected]> Closes #320
* Use zfs_mknode() to create dataset rootBrian Behlendorf2011-07-201-31/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | Long, long, long ago when the effort to port ZFS was begun the zfs_create_fs() function was heavily modified to remove all of its VFS dependencies. This allowed Lustre to use the dataset without us having to spend the time porting all the required VFS code. Fast-forward several years and we now have all the VFS code in place but are still relying on the modified zfs_create_fs(). This isn't required anymore and we can now use zfs_mknode() to create the root znode for the filesystem. This commit reverts the contents of zfs_create_fs() to largely match the upstream OpenSolaris code. There have been minor modifications to accomidate the Linux VFS but that is all. This code fixes issue #116 by bootstraping enough of the VFS data structures so we can rely on zfs_mknode() to create the root directory. This ensures it is created properly with support for system attributes. Previously it wasn't which is why it behaved differently that all other directories when modified. Signed-off-by: Brian Behlendorf <[email protected]> Closes #116
* Honor setgit bit on directoriesBrian Behlendorf2011-07-201-20/+22
| | | | | | | | | | | | | | | Newly created files were always being created with the fsuid/fsgid in the current users credentials. This is correct except in the case when the parent directory sets the 'setgit' bit. In this case according to posix the newly created file/directory should inherit the gid of the parent directory. Additionally, in the case of a subdirectory it should also inherit the 'setgit' bit. Finally, this commit performs a little cleanup of the vattr_t initialization by moving it to a common helper function. Signed-off-by: Brian Behlendorf <[email protected]> Closes #262
* Fix 'make install' overly broad 'rm'Brian Behlendorf2011-07-201-1/+5
| | | | | | | | | | | | | | | | | | | When running 'make install' without DESTDIR set the module install rules would mistakenly destroy the 'modules.*' files for ALL of your installed kernels. This could lead to a non-functional system for the alternate kernels because 'depmod -a' will only be run for the kernel which was compiled against. This issue would not impact anyone using the 'make <deb|rpm|pkg>' build targets to build and install packages. The fix for this issue is to only remove extraneous build products when DESTDIR is set. This almost exclusively indicates we are building packages and installed the build products in to a temporary staging location. Additionally, limit the removal the unneeded build products to the target kernel version. Signed-off-by: Brian Behlendorf <[email protected]> Closes #328
* Fix zpl_writepage() deadlockBrian Behlendorf2011-07-191-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disable the normal reclaim path for zpl_putpage(). This ensures that all memory allocations under this call path will never enter direct reclaim. If this were to happen the VM might try to write out additional pages by calling zpl_putpage() again resulting in a deadlock. This sitution is typically handled in Linux by marking each offending allocation GFP_NOFS. However, since much of the code used is common it makes more sense to use PF_MEMALLOC to flag the entire call tree. Alternately, the code could be updated to pass the needed allocation flags but that's a more invasive change. The following example of the above described deadlock was triggered by test 074 in the xfstest suite. Call Trace: [<ffffffff814dcdb2>] down_write+0x32/0x40 [<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs] [<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs] [<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs] [<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs] [<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs] [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff8115464a>] alloc_pages_current+0xaa/0x110 [<ffffffff8111e36e>] __get_free_pages+0xe/0x50 [<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl] [<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl] [<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs] [<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs] [<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs] [<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs] [<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs] [<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs] [<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs] [<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs] [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0 [<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs] [<ffffffff81122521>] do_writepages+0x21/0x40 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0 [<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240 [<ffffffff811308ea>] bdi_forker_task+0x6a/0x310 [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <[email protected]> Closes #327
* Fix zio_execute() deadlockBrian Behlendorf2011-07-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To avoid deadlocking the system it is crucial that all memory allocations performed in the zio_execute() call path are marked KM_PUSHPAGE (GFP_NOFS). This ensures that while a z_wr_iss thread is processing the syncing transaction group it does not re-enter the filesystem code and deadlock on itself. Call Trace: [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl] [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs] [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs] [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs] [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff81159932>] kmem_getpages+0x62/0x170 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160 [<ffffffff8115b059>] __kmalloc+0x189/0x220 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl] [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs] [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs] [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs] [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs] [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs] [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs] [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs] [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs] [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl] [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <[email protected]> Issue #291
* Fix mmap(2)/write(2)/read(2) deadlockBrian Behlendorf2011-07-191-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When modifing overlapping regions of a file using mmap(2) and write(2)/read(2) it is possible to deadlock due to a lock inversion. The zfs_write() and zfs_read() hooks first take the zfs range lock and then lock the individual pages. Conversely, when using mmap'ed I/O the zpl_writepage() hook is called with the individual page locks already taken and then zfs_putpage() takes the zfs range lock. The most straight forward fix is to simply not take the zfs range lock in the mmap(2) case. The individual pages will still be locked thus serializing access. Updating the same region of a file with write(2) and mmap(2) has always been a dodgy thing to do. This change at a minimum ensures we don't deadlock and is consistent with the existing Linux semantics enforced by the VFS. This isn't an issue under Solaris because the only range locking performed will be with the zfs range locks. It's up to each filesystem to perform its own file locking. Under Linux the VFS provides many of these services. It may be possible/desirable at a latter date to entirely dump the existing zfs range locking and rely on the Linux VFS page locks. However, for now its safest to perform both layers of locking until zfs is more tightly integrated with the page cache. Signed-off-by: Brian Behlendorf <[email protected]> Issue #302
* Update 'zpool import' man pageBrian Behlendorf2011-07-191-5/+78
| | | | | | | | | | | | | | The following supported options were missing from the zpool.8 man page. The OpenSolaris man pages originally used were simply out of date with the code. zpool import -F Recovery mode -m Allow missing log devices -N Import but don't mount -n Determine if recoverable but don't do it Signed-off-by: Brian Behlendorf <[email protected]>
* Fix send/recv 'dataset is busy' errorsBrian Behlendorf2011-07-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes a regression which was accidentally introduced by the Linux 2.6.39 compatibility chanages. As part of these changes instead of holding an active reference on the namepsace (which is no longer posible) a reference is taken on the super block. This reference ensures the super block remains valid while it is in use. To handle the unlikely race condition of the filesystem being unmounted concurrently with the start of a 'zfs send/recv' the code was updated to only take the super block reference when there was an existing reference. This indicates that the filesystem is active and in use. Unfortunately, in the 'zfs recv' case this is not the case. The newly created dataset will not have a super block without an active reference which results in the 'dataset is busy' error. The most straight forward fix for this is to simply update the code to always take the reference even when it's zero. This may expose us to very very unlikely concurrent umount/send/recv case but the consequences of that are minor. Closes #319
* Provide a rc.d script for archlinuxzfs-0.6.0-rc5Kyle Fuller2011-07-1158-18/+157
| | | | | | | | | | | Unlike most other Linux distributions archlinux installs its init scripts in /etc/rc.d insead of /etc/init.d. This commit provides an archlinux rc.d script for zfs and extends the build infrastructure to ensure it get's installed in the correct place. Signed-off-by: Brian Behlendorf <[email protected]> Closes #322
* Improve fstat(2) performanceBrian Behlendorf2011-07-113-27/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is at most a factor of 3x performance improvement to be had by using the Linux generic_fillattr() helper. However, to use it safely we need to ensure the values in a cached inode are kept rigerously up to date. Unfortunately, this isn't the case for the blksize, blocks, and atime fields. At the moment the authoritative values are still stored in the znode. This patch introduces an optimized zfs_getattr_fast() call. The idea is to use the up to date values from the inode and the blksize, block, and atime fields from the znode. At some latter date we should be able to strictly use the inode values and further improve performance. The remaining overhead in the zfs_getattr_fast() call can be attributed to having to take the znode mutex. This overhead is unavoidable until the inode is kept strictly up to date. The the careful reader will notice the we do not use the customary ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to ensure the filesystem is not torn down in the middle of an operation. However, in this case the VFS is holding a reference on the active inode so we know this is impossible. =================== Performance Tests ======================== This test calls the fstat(2) system call 10,000,000 times on an open file description in a tight loop. The test results show the zfs stat(2) performance is now only 22% slower than ext4. This is a 2.5x improvement and there is a clear long term plan to get to parity with ext4. filesystem | test-1 test-2 test-3 | average | times-ext4 --------------+-------------------------+---------+----------- ext4 | 7.785s 7.899s 7.284s | 7.656s | 1.000x zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x zfs-faststat | 9.224s 9.398s 9.485s | 9.369s | 1.223x The second test is to run 'du' of a copy of the /usr tree which contains 110514 files. The test is run multiple times both using both a cold cache (/proc/sys/vm/drop_caches) and a hot cache. As expected this change signigicantly improved the zfs hot cache performance and doesn't quite bring zfs to parity with ext4. A little surprisingly the zfs cold cache performance is better than ext4. This can probably be attributed to the zfs allocation policy of co-locating all the meta data on disk which minimizes seek times. By default the ext4 allocator will spread the data over the entire disk only co-locating each directory. filesystem | cold | hot --------------+---------+-------- ext4 | 13.318s | 1.040s zfs-0.6.0-rc4 | 4.982s | 1.762s zfs-faststat | 4.933s | 1.345s
* Add L2ARC tunablesBrian Behlendorf2011-07-081-8/+32
| | | | | | | | | | | | | | | | | The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads: l2arc_write_max max write bytes per interval l2arc_write_boost extra write bytes during device warmup l2arc_noprefetch skip caching prefetched buffers l2arc_headroom number of max device writes to precache l2arc_feed_secs seconds between L2ARC writing l2arc_feed_min_ms min feed interval in milliseconds l2arc_feed_again turbo L2ARC warmup l2arc_norw no reads during writes Signed-off-by: Brian Behlendorf <[email protected]> Closes #316
* Update 'zfs send' documentationBrian Behlendorf2011-07-082-5/+29
| | | | | | | The -D and -p options were missing from the manpage. This commit adds documentation for these features. Closes #311
* Remove zfs service only on uninstall, not on upgradeFajar A. Nugraha2011-07-081-1/+1
| | | | | | | | | | | | | | | | | | | This caused problems on upgrade using RPM: * The new version will run chkconfig --add, which has no effect since the service was already added. * The old version will run chkconfig --del, which caused zfs service removal. Only run "chkconfig --del" on complete uninstall, by checking the value of "$1" to %preun, which will be "0" on uninstall, and "1" on upgrade. http://www.rpm.org/max-rpm/s1-rpm-inside-scripts.html Signed-off-by: Brian Behlendorf <[email protected]> Closes #314
* Check for "udevadm settle" vs "udevsettle"Fajar A. Nugraha2011-07-081-1/+5
| | | | | | | RHEL5 does not have udevadm, so fix initscript accordingly Signed-off-by: Brian Behlendorf <[email protected]> Closes #315
* Update ztest pathsBrian Behlendorf2011-07-061-1/+3
| | | | | | | Unfortunately, ztest is hard coded to export the zdb utility to be installed in a certain location. When the packaging was updated to install zdb in /sbin/ ztest was broken. To fix this I'm updating ztest to check both common install paths.
* Add proper library versioningBrian Behlendorf2011-07-0616-13/+43
| | | | | | | | The zfs libraries were never properly versioned. Since the API has remained static for quite some time this we never an issue. However, going forward they should be versioned. This commit versions all of the libraries to 1.0.0. From here on out this version must be updated to reflect changes to the library.
* Updated init scripts to enable automatic sharing of ZFS datasets.Gunnar Beutner2011-07-065-0/+25
| | | | | | | The relevant init scripts were updated so as to automatically share ZFS datasets using "zfs share -a" at boot time. Signed-off-by: Brian Behlendorf <[email protected]>
* Renamed HAVE_SHARE ifdefs to HAVE_SMB_SHARE.Gunnar Beutner2011-07-062-12/+12
| | | | | | | | | | The remaining code that is guarded by HAVE_SHARE ifdefs is related to the .zfs/shares functionality which is currently not available on Linux. On Solaris the .zfs/shares directory can be used to set permissions for SMB shares. Signed-off-by: Brian Behlendorf <[email protected]>
* Link libshare directly to libzfsGunnar Beutner2011-07-063-157/+18
| | | | | | | | | | | | | Drop usage of dlopen/dlsym for libshare. There is no need to do this because the zfs packages provide libshare. Unlike on Solaris we are guaranteed it will be available. This avoids possible problems with hardcoding the libshare path in the code (e.g. when users specify a different install path via configure options). It additionally simplifies the code which is good for maintainability. Signed-off-by: Brian Behlendorf <[email protected]>
* Implemented sharing datasets via NFS using libshare.Gunnar Beutner2011-07-0613-167/+2380
| | | | | | | | The sharenfs and sharesmb properties depend on the libshare library to export datasets via NFS and SMB. This commit implements the base libshare functionality as well as support for managing NFS shares. Signed-off-by: Brian Behlendorf <[email protected]>
* Document initramfs processZachary Bedell2011-07-061-0/+137
| | | | | | | Add documentation for Dracut and the initramfs process. This includes detailing the basic boot process and options available. Signed-off-by: Brian Behlendorf <[email protected]>
* Update for Dracut-010Zachary Bedell2011-07-0610-87/+159
| | | | | | | | Update Dracut module for Dracut-010 and fix race conditions that caused boot to fail on MP systems. Add support for zfs_force flag and parsing of spl_hostid from kernel command line. Signed-off-by: Brian Behlendorf <[email protected]>
* Update zfs.gentoo/zfs.lsb init scriptZachary Bedell2011-07-062-11/+11
| | | | | | | | | * Update paths to zpool/zfs tools, * Log less for non-error conditions, * Don't be fatal if umount fails at shutdown -- final init remount will take care of it if /usr or / are in use Signed-off-by: Brian Behlendorf <[email protected]>