summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix duplicate words in zpool.8Dominik Honnef2013-01-071-1/+1
| | | | | | | Remove the duplicate words 'cannot be' from the zpool.8 man page. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1177
* Allow fake mounts to succeed on non-legacy filesystems.Will Rouesnel2013-01-071-1/+2
| | | | | | | | | | | | | mountall in Debian depends on being able to pass the -f parameter to mount, which specifies a fake mount and just updates the mtab. Currently mount.zfs will fail such a request if it is not passed with -o zfsutil. This patch allows a fake mount on a non-legacy filesystem to succeed in the same manner as a -o remount does, thus enabling mountall to work correctly. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1167
* Fix gcc array subscript above bounds warningNed Bass2013-01-071-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | In a debug build, certain GCC versions flag an array bounds warning in the below code from dnode_sync.c } else { int i; ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr); /* the blkptrs we are losing better be unallocated */ for (i = dn->dn_next_nblkptr[txgoff]; i < dnp->dn_nblkptr; i++) ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i])); This usage is in fact safe, since the ASSERT ensures the index does not exceed to maximum possible number of block pointers. However gcc can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];' falls within the array bounds so it issues a warning. To avoid this, initialize i to zero to make gcc happy but skip the elements before dn->dn_next_nblkptr[txgoff] in the loop body. Since a dnode contains at most 3 block pointers this overhead should be negligible. Signed-off-by: Brian Behlendorf <[email protected]> Closes #950
* Merge branch 'io_schedule'Brian Behlendorf2013-01-072-22/+11
|\ | | | | | | | | | | | | | | | | | | | | Currently ZFS doesn't show any I/O time in eg "top" wait% or in /proc/$pid/stat's blkio_ticks. Using io_schedule() instead of schedule() in zio_wait()'s cv_wait() is the correct way to fix this. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1158 Closes #1175
| * Use cv_wait_io() which will will account for iowaitMatt Johnston2013-01-072-1/+2
| | | | | | | | | | | | | | Update zio_wait() to use cv_wait_io() to ensure the iowait time is properly accounted for. Signed-off-by: Brian Behlendorf <[email protected]>
| * Revert part of "Log I/Os longer than zio_delay_max (30s default)"Matt Johnston2013-01-071-22/+10
|/ | | | | | | | | | | | | | | | | | | | | | | This reverts commit 9dcb97198338ba2d8764dd5604b278118612f74 which was originally introduced to debug occasional slow I/Os. These I/Os would complete eventually but were observed to take several 100 seconds. The root cause of this issue was the CFQ scheduler which can, under certain conditions, excessively delay an I/O from being issued to the device. This issue was mitigated somewhat by commit 84daaddedbfc9cf4bd1490d8a6f4b2967051e308 which ensures the I/O elevator gets changed even for DM style devices. This change isn't in any way harmful but it does conflict with a required change to properly account from I/O wait time. Because Linux does not export the io_schedule_timeout() function we must instead rely on io_schedule() via cv_wait_io(). The additional debugging information which was added to the delay event has been intentionally left in place. Signed-off-by: Brian Behlendorf <[email protected]>
* ZFS 0.6.0-rc13zfs-0.6.0-rc13Brian Behlendorf2012-12-201-1/+1
|
* Fix zpool on zvol lock inversion deadlockBrian Behlendorf2012-12-201-0/+28
| | | | | | | | | | | | | | | | | | | | | | | In all but one case the spa_namespace_lock is taken before the bdev->bd_mutex lock. But Linux __blkdev_get() function calls fops->open() with the bdev->bd_mutex lock held and we must somehow still safely acquire the spa_namespace_lock. To avoid a potential lock inversion deadlock we preemptively try to take the spa_namespace_lock(). Normally it will not be contended and this is safe because spa_open_common() handles the case where the caller already holds the spa_namespace_lock. When it is contended we risk a lock inversion if we were to block waiting for the lock. Luckily, the __blkdev_get() function allows us to return -ERESTARTSYS which will result in bdev->bd_mutex being dropped, reacquired, and fops->open() being called again. This process can be repeated safely until both locks are acquired. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #612
* Revert "Remove TSD zfs_fsyncer_key"Brian Behlendorf2012-12-204-1/+16
| | | | | | | | This reverts commit 31f2b5abdf95d8426d8bfd66ca7f62ec70215e3c back to the original code until the fsync(2) performance regression can be addressed. Signed-off-by: Brian Behlendorf <[email protected]>
* Refresh AUTHORSBrian Behlendorf2012-12-191-10/+66
| | | | | | | The AUTHORS file was getting stale. Refresh its contents using the authors listed in the git commit logs. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove the ChangeLogBrian Behlendorf2012-12-192-577/+1
| | | | | | | The ChangeLog was retired long ago, the git commit logs are authoritative. To avoid any confusion remove the ChangeLog. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove TSD zfs_fsyncer_keyBrian Behlendorf2012-12-194-16/+1
| | | | | | | | | | | | | | | | | | | | It's my understanding that the zfs_fsyncer_key TSD was added as a performance omtimization to reduce contention on the zl_lock from zil_commit(). This issue manifested itself as very long (100+ms) fsync() system call times for fsync() heavy workloads. However, under Linux I'm not seeing the same contention that was originally described. Therefore, I'm removing this code in order to ween ourselves off any dependence on TSD. If the original performance issue reappears on Linux we can revisit fixing it without resorting to TSD. This just leaves one small ZFS TSD consumer. If it can be cleanly removed from the code we'll be able to shed the SPL TSD implementation entirely. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/spl#174
* Set elevator for DM devices despite vdev_wholediskPrakash Surya2012-12-181-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current state of udev and devicer-mapper devices makes it difficult to construct a mapping of DM partitions and their underlying DM device. For example, with a /dev directory with the following contents: $ ls -d /dev/dm-* /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 it is not immediately apparent if these are completely separate devices, or partitions and real devices intermixed. In contrast, SCSI devices would appear as so: $ ls -d /dev/sd* /dev/sda /dev/sda1 /dev/sdb /dev/sdb1 Here, one can immediately determine that there are two devices (sda and sdb), each containing a single partition. The lack of a predictable and consistent mapping from DM devices to DM device partitions makes it difficult for user space to process these devices the same way it does SCSI devices. As a result, the ZFS utilities do not partition DM devices, and instead set the "vdev_wholedisk" label to 0 and treat them as partitions. This has the side effect that, even if ZFS has sole ownership of the device, the IO scheduler will not be modified because it is treated as a partition. This change adds an exception for DM devices in vdev_elevator_switch, allowing the elevator to be modified even though the "vdev_wholedisk" property is not set. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1149
* Fix using zvol as slog deviceJorgen Lundman2012-12-184-14/+31
| | | | | | | | | | | | | | | During the original ZoL port the vdev_uses_zvols() function was disabled until it could be properly implemented. This prevented a zpool from use a zvol for its slog device. This patch implements that missing functionality by adding a zvol_is_zvol() function to zvol.c. Given the full path to a device it will lookup the device and verify its major number against the registered zvol major number for the system. If they match we know the device is a zvol. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1131
* Fix get/set users/groups in quota props via numeric idMassimo Maggi2012-12-171-7/+7
| | | | | | | | | | | | | | | Fix setting/getting users/groups in quota properties through numeric identifier. This support was accidentally disabled in the original port by applying the HAVE_IDMAP wrapper macro too broadly. Fix obtained by moving #ifdef HAVE_IDMAP to exclude only the part of code that really needs IDMAP. Now zfs (get|set) (user|group)quota@1000 works as expected. Signed-off-by: Massimo Maggi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1147
* Do not use KERNEL_DIR env var in Makefile.amRichard Yao2012-12-171-3/+3
| | | | | | | | | | | | | | A Gentoo user reported an issue where the build system would attempt to recurse into the kernel source tree if KERNEL_DIR is set in the environment. KERNEL_DIR is an environment variable that is used when the kernel sources are in a non-standard location, so it is necessary to stop relying on it to prevent this issue. https://bugs.gentoo.org/show_bug.cgi?id=433946 Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Update SAs when an inode is dirtiedBrian Behlendorf2012-12-145-23/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | Revert the portion of commit d3aa3ea which always resulted in the SAs being update when an mmap()'ed file was closed. That change accidentally resulted in unexpected ctime updates which upset tools like git. That was always a horrible hack and I'm happy it will never make it in to a tagged release. The right fix is something I initially resisted doing because I was worried about the additional overhead. However, in hindsight the overhead isn't as bad as I feared. This patch implemented the sops->dirty_inode() callback which is unsurprisingly called when an inode is dirtied. We leverage this callback to keep the znode SAs strictly in sync with the inode. However, for now we're going to go slowly to avoid introducing any new unexpected issues by only updating the atime, mtime, and ctime. This will cover the callpath of most concern to us. ->filemap_page_mkwrite->file_update_time->update_time-> mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode Signed-off-by: Brian Behlendorf <[email protected]> Closes #764 Closes #1140
* Update 69-vdev.rules .gitignoreBrian Behlendorf2012-12-141-1/+1
| | | | | | | Commit 2957f38 renamed 60-vdev.rules to 69-vdev.rules but failed to update the .gitignore file to reflect this change. Signed-off-by: Brian Behlendorf <[email protected]>
* Avoid ELOOP on auto-mounted snapshotsNed Bass2012-12-131-0/+7
| | | | | | | | | | | | Ensure that the path member pointers are associated with the newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise the follow_automount() function may be called repeatedly, leading to an incorrect ELOOP error return. This problem was observed as a 'Too many levels of symbolic links' error from user-space commands accessing an unmounted snapshot in the .zfs/snapshot directory. Signed-off-by: Brian Behlendorf <[email protected]> Closes #816
* Linux 3.7 compat, schedule_delayed_work()Brian Behlendorf2012-12-122-12/+21
| | | | | | | | | | | | | | | | | | | | Linux kernel commit d8e794d accidentally broke the delayed work APIs for non-GPL callers. While the APIs to schedule a delayed work item are still available to all callers, it is no longer possible to initialize the delayed work item. I'm cautiously optimistic we could get the delayed_work_timer_fn exported for all callers in the upstream kernel. But frankly the compatibility code to use this kernel interface has always been problematic. Therefore, this patch abandons direct use the of the Linux kernel interface in favor of the new delayed taskq interface. It provides roughly the same functionality as delayed work queues but it's a stable interface under our control. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1053
* Switch KM_SLEEP to KM_PUSHPAGERichard Yao2012-12-101-1/+1
| | | | | | | | | | | When writes to zvols invoke ZIL, zfs_range_new_proxy() is called, which allocates memory using KM_SLEEP, triggering a warning. Switch to KM_PUSHPAGE to silence that warning. See commit b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1138
* Revert "Fix unlink/xattr deadlock"Brian Behlendorf2012-12-052-90/+55
| | | | | | | | | | | | | | | | | | This reverts commit b00131d43ca344d4b205a03ab3eb771a060e5087 which is no longer needed due to e89260a1c8851ce05ea04b23606ba438b271d890. This change forces all xattr znodes to hold a reference on their parent which ensures prune_icache() will never attempt to evict both the parent and child concurrently. This effectively prevents the deadlock condition from ever occuring. Therefore we can safely revert back to the upstream synchronous cleanup code. This is nice because it keeps our code base closer to upstream and resolves the performance issues introduced by the original deadlock fix. Signed-off-by: Brian Behlendorf <[email protected]> Closes #457
* Preserve inode mtime/ctime in .writepage()Brian Behlendorf2012-12-051-4/+32
| | | | | | | | | | | | | | | | | | | | | When updating a file via mmap()'ed I/O preserve the mtime/ctime which were updated when the page was made writable by the generic callback filemap_page_mkwrite(). But more importantly than preserving the exact time add the missing call to sa_bulk_update(). This ensures that the znode modifications are written to disk as part of the transaction. Without this the inode may mistaken rollback to the previous on-disk znode state. Additionally, for mmap()'ed znodes explicitly set the atime, mtime, and ctime on close using the up to date values in the inode. This is critical because writepage() may occur after close and on close we need to ensure the values are correct. Original-patch-by: Richard Yao <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #764
* Fix 'zpool create' segfault due to bad syntaxJorgen Lundman2012-12-041-4/+6
| | | | | | | | | | | | | | Incorrect syntax should never cause a segfault. In this case listing multiple comma delimited options after '-o' triggered the problem. For example: zpool create -o ashift=12,listsnaps=on This patch resolves the issue by wrapping the calls which use hdr with a NULL test. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1118
* vdev_id support for device link aliasesNed Bass2012-12-037-150/+261
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a vdev_id feature to map device names based on already defined udev device links. To increase the odds that vdev_id will run after the rules it depends on, increase the vdev.rules rule number from 60 to 69. With this change, vdev_id now provides functionality analogous to zpool_id and zpool_layout, paving the way to retire those tools. A defined alias takes precedence over a topology-derived name, but the two naming methods can otherwise coexist. For example, one might name drives in a JBOD with the sas_direct topology while naming an internal L2ARC device with an alias. For example, the following lines in vdev_id.conf will result in the creation of links /dev/disk/by-vdev/{d1,d2}, each pointing to the same target as the device link specified in the third field. # by-vdev # name fully qualified or base name of device link alias d1 /dev/disk/by-id/wwn-0x5000c5002de3b9ca alias d2 wwn-0x5000c5002def789e Also perform some minor vdev_id cleanup, such as removal of the unused -s command line option. Signed-off-by: Brian Behlendorf <[email protected]> Closes #981
* Directory xattr znodes hold a reference on their parentBrian Behlendorf2012-12-034-29/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike normal file or directory znodes, an xattr znode is guaranteed to only have a single parent. Therefore, we can take a refernce on that parent if it is provided at create time and cache it. Additionally, we take care to cache it on any subsequent zfs_zaccess() where the parent is provided as an optimization. This allows us to avoid needing to do a zfs_zget() when setting up the SELinux security xattr in the create path. This is critical because a hash lookup on the directory will deadlock since it is locked. The zpl_xattr_security_init() call has also been moved up to the zpl layer to ensure TXs to create the required xattrs are performed after the create TX. Otherwise we run the risk of deadlocking on the open create TX. Ideally the security xattr should be fully constructed before the new inode is unlocked. However, doing so would require far more extensive changes to ZFS. This change may also have the benefitial side effect of ensuring xattr directory znodes are evicted from the cache before normal file or directory znodes due to the extra reference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #671
* Implemented sharing datasets via SMB using libshareTurbo Fredriksson2012-12-039-13/+731
| | | | | | | | | Add the initial support for the 'smbshare' option using the existing libshare infrastructure. Because this implementation relies on usershares samba version 3.0.23 is required. Signed-off-by: Brian Behlendorf <[email protected]> Closes #493
* Make zpool attach -o ashift=... actually workCyril Plisko2012-11-301-1/+1
| | | | | | | | | Commit df83110856950c8e7b16a7e94cdf42b8531b9cc8 missed update to getopt() call, while delivering all the rest. This commit adds "o" to getopt(). Signed-off-by: Brian Behlendorf <[email protected]> Issue #566
* Add load_nvlist() error handlingBrian Behlendorf2012-11-301-1/+4
| | | | | | | | | | | Add the missing error handling to load_nvlist(). There's no good reason this needs to be fatal. All callers of load_nvlist() do correctly handle an error condition and it is preferable that an error be returned. This will allow 'zpool import -FX' to safely attempt to rollback through previous txgs looking for a good one. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1120
* Allow GPT+EFI vdevs for root poolsBrian Behlendorf2012-11-301-0/+10
| | | | | | | | | | | | | | | Commit 57a4edd allows the bootfs property to be set on any pool. However, many of the zpool commands still prevent you from using EFI labeled devices for the root pool. For example: # zpool attach rpool /dev/sda /dev/sdb cannot label 'sdb': EFI labeled devices are not supported on root pools. on root devices. For non-Solaris builds such as Linux disable this error. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1077
* Disable page allocation warnings for super blockBrian Behlendorf2012-11-301-1/+1
| | | | | | | | | | Due to the slightly increased size of the ZFS super block caused by 30315d2 there are now allocation warnings. The allocation size is still small (just over 8k) and super blocks are rarely allocated so we suppress the warning. Signed-off-by: Brian Behlendorf <[email protected]> Issue #1101
* Verify --with-linux source directory existsBrian Behlendorf2012-11-291-5/+8
| | | | | | | | | | | | Previously this check was only performed when ./configure was attempting to autodetect your kernel source directory. But we should also handle the case where --with-linux was provided and is obviously wrong. This way we catch the error before invoking make and compiling the source with an incorrect autoconf results. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/spl#162
* vdev_id fails to handle complex device topologiesCyril Plisko2012-11-291-4/+4
| | | | | | | | | | | | While expanding positional parameters shell requires non-single digits to be enclosed in braces. When the SAS topology is non-trivial the number of positional parameters generated internally by vdev_id script (using set -- ...) easily crosses single digit limit and vdev_id fails to generate links. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1119
* Make vdev_id POSIX sh compatibleNed Bass2012-11-271-16/+23
| | | | | | | | | | | Full bash may not be available in all environments where udev helpers run, such as in an initial ramdisk. To avoid breakage in this case, remove use of bash-specific features such as variable arrays and the `declare' keyword from the vdev_id script. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #870
* Fix NULL deref when zvol_alloc() failsBrian Behlendorf2012-11-271-1/+1
| | | | | | | | | | If zvol_alloc() fails zv will be set to NULL and dereferenced in out_dmu_objset_disown. To avoid this entirely the zv->objset line is moved up in to the success block. Original-patch-by: Jorgen Lundman <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1109
* Increase ZFS_OBJ_MTX_SZ to 256Brian Behlendorf2012-11-271-1/+1
| | | | | | | | | | | | | | | | | | Increasing this limit costs us 6144 bytes of memory per mounted filesystem, but this is small price to pay for accomplishing the following: * Allows for up to 256-way concurreny when performing lookups which helps performance when there are a large number of processes. * Minimizes the likelyhood of encountering the deadlock described in issue #1101. Because vmalloc() won't strictly honor __GFP_FS there is still a very remote chance of a deadlock. See the zfsonlinux/spl@043f9b57 commit. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1101
* Recreate minors when renaming zvolsBrian Behlendorf2012-11-191-5/+13
| | | | | | | | | | | When a zvol with snapshots is renamed the device files under /dev/zvol/ are not renamed. This patch resolves the problem by destroying and recreating the minors with the new name so the links can be recreated bu udev. Original-patch-by: Suman Chakravartula <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #408
* mount.zfs: canonicalize mount point for mtabnordaux2012-11-151-2/+9
| | | | | | | | | | | | | | | | Canonicalize the mount point passed to the mount.zfs helper. This way a clean path is always added to mtab which ensures the umount can properly locate and remove the entry. Test case: $ mkdir /mnt/foo $ mount -t zfs zpool/foo /mnt/../mnt/foo//// $ umount /mnt/foo $ cat /etc/mtab | grep zpool/foo zpool/foo /mnt/../mnt/foo//// zfs rw 0 0 Signed-off-by: Brian Behlendorf <[email protected]> Closes #573
* Merge branch 'ashift'Brian Behlendorf2012-11-158-23/+144
|\ | | | | | | | | | | | | | | | | | | This branch adds some overdue ashift improvements. * Add '-o ashift' to 'zpool add' and 'zpool attach' * Improve AF hard disk detection * Allow 'zpool import' to handle increases in ashift Signed-off-by: Brian Behlendorf <[email protected]>
| * Add "-o ashift" to zpool add and zpool attachCyril Plisko2012-11-152-12/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When adding devices to an existing pool "ashift" property is auto-detected. However, if this property was overridden at the pool creation time (i.e. zpool create -o ashift=12 tank ...) this may not be what the user wants. This commit lets the user specify the value of "ashift" property to be used with newly added drives. For example, zpool add -o ashift=12 tank disk1 zpool attach -o ashift=12 tank disk1 disk2 Signed-off-by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #566
| * Improve AF hard disk detectionBrian Behlendorf2012-11-153-5/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the bdev_physical_block_size() interface to determine the minimize write size which can be issued without incurring a read-modify-write operation. This is used to set the ashift correctly to prevent a performance penalty when using AF hard disks. Unfortunately, this interface isn't entirely reliable because it's not uncommon for disks to misreport this value. For this reason you may still need to manually set your ashift with: zpool create -o ashift=12 ... The solution to this in the upstream Illumos source was to add a white list of known offending drives. Maintaining such a list will be a burden, but it still may be worth doing if we can detect a large number of these drives. This should be considered as future work. Reported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #916
| * Illumos #2671: zpool import should not fail if vdev ashift has increasedGeorge Wilson2012-11-154-6/+15
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Gordon Ross <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Richard Lowe <[email protected]> Refererces to Illumos issue: https://www.illumos.org/issues/2671 This patch has been slightly modified from the upstream Illumos version. In the upstream implementation a warning message is logged to the console. To prevent pointless console noise this notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift" event. The event indicates a non-optimial (but entirely safe) ashift value was used to create the pool. Depending on your workload this may impact pool performance. Unfortunately, the only way to correct the issue is to recreate the pool with a new ashift. NOTE: The unrelated fix to the comment in zpool_main.c appears in the upstream commit and was preserved for consistnecy. Ported-by: Cyril Plisko <[email protected]> Reworked-by: Brian Behlendorf <[email protected]> Closes #955
* zfs-0.6.0-rc12zfs-0.6.0-rc12Brian Behlendorf2012-11-131-1/+1
|
* Fix hard coded path in 60-vdev.rules.inRichard Yao2012-11-132-3/+3
| | | | | | | | | | | | | | | The udev data directory was hard coded in 60-vdev.rules.in. That causes a problem when a distribution changes the location of the directory. This was not an issue in the past because virtually all distributions used the same path, but that is beginning to change following a decision by the systemd developers to change the directory location to reflect their take-over of udev maintainership. The testing branch of Gentoo Linux adopted this change, which enabled the hardcoded directory location to trigger a regression. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1085
* Fix "allocating allocated segment" panicBrian Behlendorf2012-11-091-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | Gunnar Beutner did all the hard work on this one by correctly identifying that this issue is a race between dmu_sync() and dbuf_dirty(). Now in all cases the caller is responsible for preventing this race by making sure the zfs_range_lock() is held when dirtying a buffer which may be referenced in a log record. The mmap case which relies on zfs_putpage() was not taking the range lock. This code was accidentally dropped when the function was rewritten for the Linux VFS. This patch adds the required range locking to zfs_putpage(). It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which aren't strictly required due to the VFS holding a reference. However, this makes the code more consistent with the upsteam code and there's no harm in being extra careful here. Original-patch-by: Gunnar Beutner <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #541
* Fix zvol+btrfs hangBrian Behlendorf2012-11-091-0/+77
| | | | | | | | | | | | | | | | | | | | | | | | When using a zvol to back a btrfs filesystem the btrfs mount would hang. This was due to the bio completion callback used in btrfs assuming that lower level drivers would never modify the bio->bi_io_vecs after they were submitted via bio_submit(). If they are modified btrfs will miscalculate which pages need to be unlocked resulting in a hang. It's worth mentioning that other file systems such as ext[234] and xfs work fine because they do not make the same assumption in the bio completion callback. The most straight forward way to fix the issue is to present the semantics expected by btrfs. This is done by cloning the bios attached to each request and then using the clones bvecs to perform the required accounting. The clones are freed after each read/write and the original unmodified bios are linked back in to the request. Signed-off-by: Chris Wedgwood <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #469
* Merge remote branch 'eris/stats'Brian Behlendorf2012-11-067-11/+259
|\ | | | | | | | | | | | | Bring in support for the new KSTAT_TYPE_TXG type. This allows for additional visibility in to the txg handling. Signed-off-by: Brian Behlendorf <[email protected]>
| * Log I/Os longer than zio_delay_max (30s default)Brian Behlendorf2012-11-023-10/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There have been reports of ZFS deadlocking due to what appears to be a lost IO. This patch addes some debugging to determine the exact state of the IO which neither 1) completed, 2) failed, or 3) timed out after zio_delay_max (30) seconds. This information will be logged using the ZFS FMA infrastructure as a 'delay' event and posted to the internal zevent log. By default the last 64 events will be kept in the log but the limit is configurable via the zfs_zevent_len_max module option. To dump the contents of the log use the 'zpool events -v' command and look for the resource.fs.zfs.delay event. It will include various information about the pool, vdev, and zio which may shed some light on the issue. In the context of this change the 120 second kernel blocked thread watchdog has been disabled for synchronous IOs. Signed-off-by: Brian Behlendorf <[email protected]> Issue #930
| * Add txgs-<pool> kstat fileBrian Behlendorf2012-11-024-1/+231
|/ | | | | | | | | | | | | | | | | | | | Create a kstat file which contains useful statistics about the last N txgs processed. This can be helpful when analyzing pool performance. The new KSTAT_TYPE_TXG type was added for this purpose and it tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <[email protected]>
* Add ddt_object_count() error handlingBrian Behlendorf2012-10-294-19/+28
| | | | | | | | | | | | | | | | | | | The interface for the ddt_zap_count() function assumes it can never fail. However, internally ddt_zap_count() is implemented with zap_count() which can potentially fail. Now because there was no way to return the error to the caller a VERIFY was used to ensure this case never happens. Unfortunately, it has been observed that pools can be damaged in such a way that zap_count() fails. The result is that the pool can not be imported without hitting the VERIFY and crashing the system. This patch reworks ddt_object_count() so the error can be safely caught and returned to the caller. This allows a pool which has be damaged in this way to be safely rewound for import. Signed-off-by: Brian Behlendorf <[email protected]> Closes #910