summaryrefslogtreecommitdiffstats
path: root/cmd/zdb
Commit message (Collapse)AuthorAgeFilesLines
* Illumos #4101, #4102, #4103, #4105, #4106George Wilson2014-07-221-77/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Garrett D'Amore <[email protected]> Prior to this patch, space_maps were preferred solely based on the amount of free space left in each. Unfortunately, this heuristic didn't contain any information about the make-up of that free space, which meant we could keep preferring and loading a highly fragmented space map that wouldn't actually have enough contiguous space to satisfy the allocation; then unloading that space_map and repeating the process. This change modifies the space_map's to store additional information about the contiguous space in the space_map, so that we can use this information to make a better decision about which space_map to load. This requires reallocating all space_map objects to increase their bonus buffer size sizes enough to fit the new metadata. The above feature can be enabled via a new feature flag introduced by this change: com.delphix:spacemap_histogram In addition to the above, this patch allows the space_map block size to be increase. Currently the block size is set to be 4K in size, which has certain implications including the following: * 4K sector devices will not see any compression benefit * large space_maps require more metadata on-disk * large space_maps require more time to load (typically random reads) Now the space_map block size can adjust as needed up to the maximum size set via the space_map_max_blksz variable. A bug was fixed which resulted in potentially leaking an object when removing a mirrored log device. The previous logic for vdev_remove() did not deal with removing top-level vdevs that are interior vdevs (i.e. mirror) correctly. The problem would occur when removing a mirrored log device, and result in the DTL space map object being leaked; because top-level vdevs don't have DTL space map objects associated with them. References: https://www.illumos.org/issues/4101 https://www.illumos.org/issues/4102 https://www.illumos.org/issues/4103 https://www.illumos.org/issues/4105 https://www.illumos.org/issues/4106 https://github.com/illumos/illumos-gate/commit/0713e23 Porting notes: A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also, the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary. Ported-by: Tim Chase <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2488
* zdb: Introduce -V for verbatim importRichard Yao2014-07-171-6/+7
| | | | | | | | | | | | | When given a pool name via -e, zdb would attempt an import. If it failed, then it would attempt a verbatim import. This behavior is not always desirable so a -V switch is added to zdb to control the behavior. When specified, a verbatim import is done. Otherwise, the behavior is as it was previously, except no verbatim import is done on failure. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2372
* Illumos #3641 compressed block histograms with zdbMatthew Ahrens2014-07-161-24/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a zdb extension of the '-b' option, producing a histogram of the physical compressed block sizes per DMU object type on disk. The '-bbbb' option to zdb will uncover this new feature; here's an example usage on a new pool and snippet of the output it generates: # zpool create tank /dev/vd{b,c,d} # dd bs=1k if=/dev/urandom of=/tank/1kfile count=1 # dd bs=3k if=/dev/urandom of=/tank/3kfile count=1 # dd bs=64k if=/dev/urandom of=/tank/64kfile count=1 # zdb -bbbb tank ... 3 68.0K 68.0K 68.0K 22.7K 1.00 34.26 ZFS plain file psize (in 512-byte sectors): number of blocks 2: 1 * 3: 0 4: 0 5: 0 6: 1 * 7: 0 ... 127: 0 128: 1 * ... The blocks are also broken down by their indirection level. Expanding on the above example: # zfs set recordsize=1k tank # dd bs=1k if=/dev/urandom of=/tank/2x1kfile count=2 # zdb -bbbb tank ... 1 16K 1K 2K 2K 16.00 1.02 L1 ZFS plain file psize (in 512-byte sectors): number of blocks 2: 1 * 5 70.0K 70.0K 70.0K 14.0K 1.00 35.71 L0 ZFS plain file psize (in 512-byte sectors): number of blocks 2: 3 *** 3: 0 4: 0 5: 0 6: 1 * 7: 0 ... 127: 0 128: 1 * 6 86.0K 71.0K 72.0K 12.0K 1.21 36.73 ZFS plain file psize (in 512-byte sectors): number of blocks 2: 4 **** 3: 0 4: 0 5: 0 6: 1 * 7: 0 ... 127: 0 128: 1 * ... There's now a single 1K L1 block which is the indirect block needed for the '2x1kfile' file just created, as well as two more 1K L0 blocks from the same file. This can be used to get a distribution of the block sizes used within the pool, on a per object type basis. References: https://illumos.org/issues/3641 https://github.com/illumos/illumos-gate/commit/490d05b Ported by: Tim Chase <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Closes #2456
* cstyle: Resolve C style issuesMichael Kjorling2013-12-181-4/+6
| | | | | | | | | | | | | | | | | | The vast majority of these changes are in Linux specific code. They are the result of not having an automated style checker to validate the code when it was originally written. Others were caused when the common code was slightly adjusted for Linux. This patch contains no functional changes. It only refreshes the code to conform to style guide. Everyone submitting patches for inclusion upstream should now run 'make checkstyle' and resolve any warning prior to opening a pull request. The automated builders have been updated to fail a build if when 'make checkstyle' detects an issue. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1821
* Add missing libzfs_core to MakefilesMaximilian Mehnert2013-11-201-1/+2
| | | | | | | | | | | | On some platforms symbols provided by libzfs_core and used by libzfs were not available to the linker. To avoid this issue libzfs_core has been added to the list of required libraries when building utilities which depend on libzfs. This should have been handled properly by libtool and it's still not entirely clear why it wasn't on all platforms. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1841
* Illumos #3603, #3604: bobj improvementsMatthew Ahrens2013-10-311-17/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | 3603 panic from bpobj_enqueue_subobj() 3604 zdb should print bpobjs more verbosely 3871 GCC 4.5.3 does not like issue 3604 patch Reviewed by: Henrik Mattson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/3603 https://www.illumos.org/issues/3604 https://www.illumos.org/issues/3871 illumos/illumos-gate@d04756377ddd1cf28ebcf652541094e17b03c889 Ported-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1775 Note that the patch from Illumos issue 3871 is not accepted into Illumos at the time of this writing. It is something that I wrote when porting this. Documentation is in the Illumos issue.
* Generate libraries with correct DT_NEEDED entriesRichard Yao2013-10-101-1/+1
| | | | | | | | | | | | | | | | | | | | Libraries that depend on other libraries should list them in ELF's DT_NEEDED field so that programs linking to them do not need to specify those libraries unless they depend on them as well. This is not the case in the current code and the consequence is that anything that needs a library must know its dependencies. This is fragile and caused GRUB2's configure script to break when a dependency was added on libblkid in libzfs. This resolves that problem by using LIBADD/LDADD to specify libraries in Makefile.am instead of LDFLAGS. This ensures that proper DT_NEEDED entries are generated and prevents GRUB2's configure script from breaking in the presence of a libblkid dependency. This also removes unneeded dependencies from various files. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1751
* Illumos #3464Matthew Ahrens2013-09-041-3/+6
| | | | | | | | | | | | | | | | | 3464 zfs synctask code needs restructuring Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/3464 illumos/illumos-gate@3b2aab18808792cbd248a12f1edf139b89833c13 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1495
* Illumos #2882, #2883, #2900Matthew Ahrens2013-09-041-5/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2882 implement libzfs_core 2883 changing "canmount" property to "on" should not always remount dataset 2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once Reviewed by: George Wilson <[email protected]> Reviewed by: Chris Siden <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Bill Pijewski <[email protected]> Reviewed by: Dan Kruchinin <[email protected]> Approved by: Eric Schrock <[email protected]> References: https://www.illumos.org/issues/2882 https://www.illumos.org/issues/2883 https://www.illumos.org/issues/2900 illumos/illumos-gate@4445fffbbb1ea25fd0e9ea68b9380dd7a6709025 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1293 Porting notes: WARNING: This patch changes the user/kernel ABI. That means that the zfs/zpool utilities built from master are NOT compatible with the 0.6.2 kernel modules. Ensure you load the matching kernel modules from master after updating the utilities. Otherwise the zfs/zpool commands will be unable to interact with your pool and you will see errors similar to the following: $ zpool list failed to read pool configuration: bad address no pools available $ zfs list no datasets available Add zvol minor device creation to the new zfs_snapshot_nvl function. Remove the logging of the "release" operation in dsl_dataset_user_release_sync(). The logging caused a null dereference because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the logging functions try to get the ds name via the dsl_dataset_name() function. I've got no idea why this particular code would have worked in Illumos. This code has subsequently been completely reworked in Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring). Squash some "may be used uninitialized" warning/erorrs. Fix some printf format warnings for %lld and %llu. Apply a few spa_writeable() changes that were made to Illumos in illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and 3115 fixes. Add a missing call to fnvlist_free(nvl) in log_internal() that was added in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time (zfsonlinux/zfs@9e11c73) because it depended on future work.
* zdb: enhancement - Display SA xattrs.Tim Chase2013-07-091-0/+56
| | | | | | | | | | | | | | | | If the znode has SA xattrs, display them following the other standard attributes. The format used is similar to that used when listing the contents of a ZAP. It is as follows: $ zdb -vvv <pool>/<dataset> <object> ... SA xattrs: <size> bytes, <number> entries <name1> = <value1> <name2> = <value2> ... Signed-off-by: Brian Behlendorf <[email protected]> Closes #1581
* Avoid abort() in vn_rdwr(): libzpool/kernel.cMike Leddy2013-07-091-1/+1
| | | | | | | | Make sure that buffer is aligned to 512 bytes on linux so that pread call combined with O_DIRECT does not return EINVAL. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1570
* Illumos #3498 panic in arc_read()George Wilson2013-07-021-4/+3
| | | | | | | | | | | | | | 3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt) Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@1b912ec7100c10e7243bf0879af0fe580e08c73d https://www.illumos.org/issues/3498 Ported-by: Brian Behlendorf <[email protected]> Closes #1249
* Override default SPA config location via environmentCyril Plisko2013-07-011-0/+10
| | | | | | | | | | | | | When using zdb with non-default SPA config file it is not convenient to add -U <non-default-config-file-path> all the time. This commit introduces support for setting/overriding SPA config location via environment variable 'SPA_CONFIG_PATH'. If -U flag is specified in the command line it will override any other value as usual. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1545
* Add absent \n at the end of the help text lineCyril Plisko2013-06-281-1/+1
| | | | | Signed-off-by: Brian Behlendorf <[email protected]> Issue #1545
* Illumos #3552, #3564George Wilson2013-06-191-6/+6
| | | | | | | | | | | | | | | | | | 3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread 3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing) Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Richard Lowe <[email protected]> References: illumos/illumos-gate@16a4a8074274d2d7cc408589cf6359f4a378c861 https://www.illumos.org/issues/3552 https://www.illumos.org/issues/3564 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1513
* Illumos #3306, #3321George Wilson2013-05-031-29/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 3306 zdb should be able to issue reads in parallel 3321 'zpool reopen' command should be documented in the man page and help Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: illumos/illumos-gate@31d7e8fa33fae995f558673adb22641b5aa8b6e1 https://www.illumos.org/issues/3306 https://www.illumos.org/issues/3321 The vdev_file.c implementation in this patch diverges significantly from the upstream version. For consistenty with the vdev_disk.c code the upstream version leverages the Illumos bio interfaces. This makes sense for Illumos but not for ZoL for two reasons. 1) The vdev_disk.c code in ZoL has been rewritten to use the Linux block device interfaces which differ significantly from those in Illumos. Therefore, updating the vdev_file.c to use the Illumos interfaces doesn't get you consistency with vdev_disk.c. 2) Using the upstream patch as is would requiring implementing compatibility code for those Solaris block device interfaces in user and kernel space. That additional complexity could lead to confusion and doesn't buy us anything. For these reasons I've opted to simply move the existing vn_rdwr() as is in to the taskq function. This has the advantage of being low risk and easy to understand. Moving the vn_rdwr() function in to its own taskq thread also neatly avoids the possibility of a stack overflow. Finally, because of the additional work which is being handled by the free taskq the number of threads has been increased. The thread count under Illumos defaults to 100 but was decreased to 2 in commit 08d08e due to contention. We increase it to 8 until the contention can be address by porting Illumos #3581. Ported-by: Brian Behlendorf <[email protected]> Closes #1354
* Illumos #3397, #3398Christopher Siden2013-01-111-9/+17
| | | | | | | | | | | | | | | | | | | 3397 zdb <pool> <objnum> output is too verbose 3398 zdb can't dump feature flags zap objects Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Eric Schrock <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Dan McDonald <[email protected]> References: illumos/illumos-gate@e690fb27a7d1483f052505e1ff373d205f9dee99 https://www.illumos.org/issues/3397 https://www.illumos.org/issues/3398 Ported-by: Brian Behlendorf <[email protected]>
* Illumos #2619 and #2747Christopher Siden2013-01-081-7/+65
| | | | | | | | | | | | | | | | | | | | | | 2619 asynchronous destruction of ZFS file systems 2747 SPA versioning with zfs feature flags Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Dan Kruchinin <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@53089ab7c84db6fb76c16ca50076c147cda11757 illumos/illumos-gate@ad135b5d644628e791c3188a6ecbd9c257961ef8 illumos changeset: 13700:2889e2596bd6 https://www.illumos.org/issues/2619 https://www.illumos.org/issues/2747 NOTE: The grub specific changes were not ported. This change must be made to the Linux grub packages. Ported-by: Brian Behlendorf <[email protected]>
* Add ddt_object_count() error handlingBrian Behlendorf2012-10-291-1/+3
| | | | | | | | | | | | | | | | | | | The interface for the ddt_zap_count() function assumes it can never fail. However, internally ddt_zap_count() is implemented with zap_count() which can potentially fail. Now because there was no way to return the error to the caller a VERIFY was used to ensure this case never happens. Unfortunately, it has been observed that pools can be damaged in such a way that zap_count() fails. The result is that the pool can not be imported without hitting the VERIFY and crashing the system. This patch reworks ddt_object_count() so the error can be safely caught and returned to the caller. This allows a pool which has be damaged in this way to be safely rewound for import. Signed-off-by: Brian Behlendorf <[email protected]> Closes #910
* Illumos #2088 zdb could use a reasonable manual pageRichard Lowe2012-09-181-8/+11
| | | | | | | | | | | | | | | | Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Steve Gonczi <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/2088 Ported by: Cyril Plisko <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #682
* Fix zdb printf format string for ZIL data blocksCyril Plisko2012-09-131-1/+2
| | | | | | | | | | | | | | | | | | | | | Without this fix the zdb printouts of ZIL data blocks look full of FF due to printf() handling its arguments as int by default. Here is the output before the fix TX_WRITE len 4136, txg 1093817, seq 149231 foid 4242, offset 0, length f68 G FFFFFF8EFFFFFF87FFFFFF91FFFFFFCC 1c FFFFFFAFFFFFFFC9FFFFFFBAZ FFFFFFC3 And the same after the fix TX_WRITE len 4136, txg 1093817, seq 149231 foid 4242, offset 0, length f68 G 8E8791CC 1cAFC9BAZ C3 Signed-off-by: Brian Behlendorf <[email protected]> Closes #962
* Remove autotools productsBrian Behlendorf2012-08-271-711/+0
| | | | | | | | Remove all of the generated autotools products from the repository and update the .gitignore files accordingly. Signed-off-by: Brian Behlendorf <[email protected]> Closes #718
* Set zvol discard_granularity to the volblocksize.Etienne Dechamps2012-08-071-0/+1
| | | | | | | | | | | | | | | | | | Currently, zvols have a discard granularity set to 0, which suggests to the upper layer that discard requests of arbirarily small size and alignment can be made efficiently. In practice however, ZFS does not handle unaligned discard requests efficiently: indeed, it is unable to free a part of a block. It will write zeros to the specified range instead, which is both useless and inefficient (see dnode_free_range). With this patch, zvol block devices expose volblocksize as their discard granularity, so the upper layer is aware that it's not supposed to send discard requests smaller than volblocksize. Signed-off-by: Brian Behlendorf <[email protected]> Closes #862
* Linux 3.5 compat, end_writeback() changed to clear_inode()Richard Yao2012-07-231-0/+1
| | | | | | | | | | | | | | | | The end_writeback() function was changed by moving the call to inode_sync_wait() earlier in to evict(). This effecitvely changes the ordering of the sync but it does not impact the details of the zfs implementation. However, as part of this change end_writeback() was renamed to clear_inode() to reflect the new semantics. This change does impact us and clear_inode() now maps to end_writeback() for kernels prior to 3.5. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #784
* Linux 3.5 compat, iops->truncate_range() removedRichard Yao2012-07-231-0/+1
| | | | | | | | | The vmtruncate_range() support has been removed from the kernel in favor of using the fallocate method in the file_operations table. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #784
* Linux 3.5 compat, eops->encode_fh() takes inodesRichard Yao2012-07-231-0/+1
| | | | | | | | | | | | | | The export_operations member ->encode_fh() has been updated to take both the child and parent inodes. This interface used to take the child dentry and a bool describing if the parent is needed. NOTE: While updating this code I noticed that we do not currently cleanly handle the case where we're passed a connectable parent. This code should be audited to make sure we're doing the right thing. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #784
* Move partition scanning from userspace to module.Etienne Dechamps2012-07-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, zpool online -e (dynamic vdev expansion) doesn't work on whole disks because we're invoking ioctl(BLKRRPART) from userspace while ZFS still has a partition open on the disk, which results in EBUSY. This patch moves the BLKRRPART invocation from the zpool utility to the module. Specifically, this is done just before opening the device in vdev_disk_open() which is called inside vdev_reopen(). This requires jumping through some hoops to get to the disk device from the partition device, and to make sure we can still open the partition after the BLKRRPART call. Note that this new code path is triggered on dynamic vdev expansion only; other actions, like creating a new pool, are unchanged and still call BLKRRPART from userspace. This change also depends on API changes which are available in 2.6.37 and latter kernels. The build system has been updated to detect this, but there is no compatibility mode for older kernels. This means that online expansion will NOT be available in older kernels. However, it will still be possible to expand the vdev offline. Signed-off-by: Brian Behlendorf <[email protected]> Closes #808
* Linux 3.4 compat, d_make_root() replaces d_alloc_root()Richard Yao2012-06-111-0/+1
| | | | | | | | | | | | | | | | | | | | torvalds/linux@adc0e91ab142abe93f5b0d7980ada8a7676231fe introduced introduced d_make_root() as a replacement for d_alloc_root(). Further commits appear to have removed d_alloc_root() from the Linux source tree. This causes the following failure: error: implicit declaration of function 'd_alloc_root' [-Werror=implicit-function-declaration] To correct this we update the code to use the current d_make_root() interface for readability. Then we introduce an autotools check to determine if d_make_root() is available. If it isn't then we define some compatibility logic which used the older d_alloc_root() interface. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #776
* Linux 3.3 compat, iops->create()/mkdir()/mknod()Brian Behlendorf2012-04-301-0/+1
| | | | | | | | | | The mode argument of iops->create()/mkdir()/mknod() was changed from an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf check was added to detect the API change and then correctly set a zpl_umode_t typedef. There is no functional change. Signed-off-by: Brian Behlendorf <[email protected]> Closes #701
* Add --enable-debug-dmu-tx configure optionBrian Behlendorf2012-03-231-0/+1
| | | | | | | | | | | | | | | | | | Allow rigorous (and expensive) tx validation to be enabled/disabled indepentantly from the standard zfs debugging. When enabled these checks ensure that all txs are constructed properly and that a dbuf is never dirtied without taking the correct tx hold. This checking is particularly helpful when adding new dmu consumers like Lustre. However, for established consumers such as the zpl with no known outstanding tx construction problems this is just overhead. --enable-debug-dmu-tx - Enable/disable validation of each tx as --disable-debug-dmu-tx it is constructed. By default validation is disabled due to performance concerns. Signed-off-by: Brian Behlendorf <[email protected]>
* Add .zfs control directoryBrian Behlendorf2012-03-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for the .zfs control directory. This was accomplished by leveraging as much of the existing ZFS infrastructure as posible and updating it for Linux as required. The bulk of the core functionality is now all there with the following limitations. *) The .zfs/snapshot directory automount support requires a 2.6.37 or newer kernel. The exception is RHEL6.2 which has backported the d_automount patches. *) Creating/destroying/renaming snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory works as expected. However, this functionality is only available to root until zfs delegations are finished. * mkdir - create a snapshot * rmdir - destroy a snapshot * mv - rename a snapshot The following issues are known defeciences, but we expect them to be addressed by future commits. *) Add automount support for kernels older the 2.6.37. This should be possible using follow_link() which is what Linux did before. *) Accessing the .zfs/snapshot directory via NFS is not yet possible. The majority of the ground work for this is complete. However, finishing this work will require resolving some lingering integration issues with the Linux NFS kernel server. *) The .zfs/shares directory exists but no futher smb functionality has yet been implemented. Contributions-by: Rohan Puri <[email protected]> Contributiobs-by: Andrew Barnes <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #173
* Cleanly support debug packagesBrian Behlendorf2012-02-271-0/+1
| | | | | | | | | | | | | | | | | | | Allow a source rpm to be rebuilt with debugging enabled. This avoids the need to have to manually modify the spec file. By default debugging is still largely disabled. To enable specific debugging features use the following options with rpmbuild. '--with debug' - Enables ASSERTs # For example: $ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm Additionally, ZFS_CONFIG has been added to zfs_config.h for packages which build against these headers. This is critical to ensure both zfs and the dependant package are using the same prototype and structure definitions. Signed-off-by: Brian Behlendorf <[email protected]>
* Add support for DISCARD to ZVOLs.Etienne Dechamps2012-02-091-0/+1
| | | | | | | | | | | | | | | | | | | | DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus optimizing disk space usage by allowing a ZVOL to shrink instead of just grow. We can't use zfs_space() or zfs_freesp() here, since these functions only work on regular files, not volumes. Fortunately we can use the low-level function dmu_free_long_range() which does exactly what we want. Currently the discard operation is not added to the log. That's not a big deal since losing discard requests cannot result in data corruption. It would however result in disk space usage higher than it should be. Thus adding log support to zvol_discard() is probably a good idea for a future improvement. Signed-off-by: Brian Behlendorf <[email protected]>
* Support the fallocate() file operation.Etienne Dechamps2012-02-091-0/+1
| | | | | | | | | | | | | | Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported, since it's the only one that matches the behavior of zfs_space(). This makes it pretty much useless in its current form, but it's a start. To support other flag combinations we would need to modify zfs_space() to make it more flexible, or emulate the desired functionality in zpl_fallocate(). Signed-off-by: Brian Behlendorf <[email protected]> Issue #334
* Improve ZVOL queue behavior.Etienne Dechamps2012-02-071-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering. Detailed rationale: - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide. - max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide. - physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow. - The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix synchronicity for ZVOLs.Etienne Dechamps2012-02-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zvol_write() assumes that the write request must be written to stable storage if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does *not* mean what we think it means in the context of the Linux block layer. This is well explained in linux/fs.h: WRITE: A normal async write. Device will be plugged. WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down the hint that someone will be waiting on this IO shortly. WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush. WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on non-volatile media on completion. In other words, SYNC does not *mean* that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs. The request for stable storage is expressed by the FUA flag, so we must commit the ZIL after the write if the FUA flag is set. In addition, we must commit the ZIL before the write if the FLUSH flag is set. Also, we must inform the block layer that we actually support FLUSH and FUA. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.3 compat, sops->show_options()Brian Behlendorf2012-02-031-0/+1
| | | | | | | | | | The second argument of sops->show_options() was changed from a 'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check to detect the API change and then conditionally define the expected interface. In either case we are only interested in the zfs_sb_t. Signed-off-by: Brian Behlendorf <[email protected]> Closes #549
* Combine libraries: spl, avl, efi, share, unicode.Darik Horn2012-01-172-13/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | These libraries, which are an artifact of the ZoL development process, conflict with packages that are already in distribution: * libspl: SPL Programming Language * libavl: AVL for Linux * libefi: GRUB And these libraries are potential conflicts: * libshare: the Linux Mount Manager * libunicode: Perl and Python Recompose these five ZoL components into the four libraries that are conventionally provided by Solaris and FreeBSD systems: + libnvpair + libuutil + libzpool + libzfs This change resolves the name conflict, makes ZoL more compatible with existing software that uses autotools to detect ZFS, and allows pkg-zfs to better reflect the official Debian kFreeBSD packaging. Signed-off-by: Brian Behlendorf <[email protected]> Closes: #430
* Linux 3.1 compat, super_block->s_shrinkBrian Behlendorf2012-01-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux 3.1 kernel has introduced the concept of per-filesystem shrinkers which are directly assoicated with a super block. Prior to this change there was one shared global shrinker. The zfs code relied on being able to call the global shrinker when the arc_meta_limit was exceeded. This would cause the VFS to drop references on a fraction of the dentries in the dcache. The ARC could then safely reclaim the memory used by these entries and honor the arc_meta_limit. Unfortunately, when per-filesystem shrinkers were added the old interfaces were made unavailable. This change adds support to use the new per-filesystem shrinker interface so we can continue to honor the arc_meta_limit. The major benefit of the new interface is that we can now target only the zfs filesystem for dentry and inode pruning. Thus we can minimize any impact on the caching of other filesystems. In the context of making this change several other important issues related to managing the ARC were addressed, they include: * The dnlc_reduce_cache() function which was called by the ARC to drop dentries for the Posix layer was replaced with a generic zfs_prune_t callback. The ZPL layer now registers a callback to drop these dentries removing a layering violation which dates back to the Solaris code. This callback can also be used by other ARC consumers such as Lustre. arc_add_prune_callback() arc_remove_prune_callback() * The arc_reduce_dnlc_percent module option has been changed to arc_meta_prune for clarity. The dnlc functions are specific to Solaris's VFS and have already been largely eliminated already. The replacement tunable now represents the number of bytes the prune callback will request when invoked. * Less aggressively invoke the prune callback. We used to call this whenever we exceeded the arc_meta_limit however that's not strictly correct since it results in over zeleous reclaim of dentries and inodes. It is now only called once the arc_meta_limit is exceeded and every effort has been made to evict other data from the ARC cache. * More promptly manage exceeding the arc_meta_limit. When reading meta data in to the cache if a buffer was unable to be recycled notify the arc_reclaim thread to invoke the required prune. * Added arcstat_prune kstat which is incremented when the ARC is forced to request that a consumer prune its cache. Remember this will only occur when the ARC has no other choice. If it can evict buffers safely without invoking the prune callback it will. * This change is also expected to resolve the unexpect collapses of the ARC cache. This would occur because when exceeded just the arc_meta_limit reclaim presure would be excerted on the arc_c value via arc_shrink(). This effectively shrunk the entire cache when really we just needed to reclaim meta data. Signed-off-by: Brian Behlendorf <[email protected]> Closes #466 Closes #292
* Linux 3.2 compat: set_nlink()Darik Horn2011-12-161-0/+1
| | | | | | | | | | | Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170 Use the new set_nlink() kernel function instead. Signed-off-by: Brian Behlendorf <[email protected]> Closes: #462
* Add make rule for building Arch Linux packagesPrakash Surya2011-12-141-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added the necessary build infrastructure for building packages compatible with the Arch Linux distribution. As such, one can now run: $ ./configure $ make pkg # Alternatively, one can run 'make arch' as well on the Arch Linux machine to create two binary packages compatible with the pacman package manager, one for the zfs userland utilities and another for the zfs kernel modules. The new packages can then be installed by running: # pacman -U $package.pkg.tar.xz In addition, source-only packages suitable for an Arch Linux chroot environment or remote builder can also be build using the 'sarch' make rule. NOTE: Since the source dist tarball is created on the fly from the head of the build tree, it's MD5 hash signature will be continually influx. As a result, the md5sum variable was intentionally omitted from the PKGBUILD files, and the '--skipinteg' makepkg option is used. This may or may not have any serious security implications, as the source tarball is not being downloaded from an outside source. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #491
* Simplify BDI integrationBrian Behlendorf2011-11-081-0/+1
| | | | | | | | | | | | | | Update the code to use the bdi_setup_and_register() helper to simplify the bdi integration code. The updated code now just registers the bdi during mount and destroys it during unmount. The only complication is that for 2.6.32 - 2.6.33 kernels the helper wasn't available so in these cases the zfs code must provide it. Luckily the bdi_setup_and_register() function is trivial. Signed-off-by: Brian Behlendorf <[email protected]> Closes #367
* Autogen refresh for udev changesBrian Behlendorf2011-08-081-0/+3
| | | | | | | | Run autogen.sh using the same autotools versions as upstream: * autoconf-2.63 * automake-1.11.1 * libtool-2.2.6b
* Add backing_device_info per-filesystemBrian Behlendorf2011-08-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a long time now the kernel has been moving away from using the pdflush daemon to write 'old' dirty pages to disk. The primary reason for this is because the pdflush daemon is single threaded and can be a limiting factor for performance. Since pdflush sequentially walks the dirty inode list for each super block any delay in processing can slow down dirty page writeback for all filesystems. The replacement for pdflush is called bdi (backing device info). The bdi system involves creating a per-filesystem control structure each with its own private sets of queues to manage writeback. The advantage is greater parallelism which improves performance and prevents a single filesystem from slowing writeback to the others. For a long time both systems co-existed in the kernel so it wasn't strictly required to implement the bdi scheme. However, as of Linux 2.6.36 kernels the pdflush functionality has been retired. Since ZFS already bypasses the page cache for most I/O this is only an issue for mmap(2) writes which must go through the page cache. Even then adding this missing support for newer kernels was overlooked because there are other mechanisms which can trigger writeback. However, there is one critical case where not implementing the bdi functionality can cause problems. If an application handles a page fault it can enter the balance_dirty_pages() callpath. This will result in the application hanging until the number of dirty pages in the system drops below the dirty ratio. Without a registered backing_device_info for the filesystem the dirty pages will not get written out. Thus the application will hang. As mentioned above this was less of an issue with older kernels because pdflush would eventually write out the dirty pages. This change adds a backing_device_info structure to the zfs_sb_t which is already allocated per-super block. It is then registered when the filesystem mounted and unregistered on unmount. It will not be registered for mounted snapshots which are read-only. This change will result in flush-<pool> thread being dynamically created and destroyed per-mounted filesystem for writeback. Signed-off-by: Brian Behlendorf <[email protected]> Closes #174
* Provide a rc.d script for archlinuxzfs-0.6.0-rc5Kyle Fuller2011-07-111-0/+1
| | | | | | | | | | | Unlike most other Linux distributions archlinux installs its init scripts in /etc/rc.d insead of /etc/init.d. This commit provides an archlinux rc.d script for zfs and extends the build infrastructure to ensure it get's installed in the correct place. Signed-off-by: Brian Behlendorf <[email protected]> Closes #322
* Linux compat 2.6.39: mount_nodev()Brian Behlendorf2011-07-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The .get_sb callback has been replaced by a .mount callback in the file_system_type structure. When using the new interface the caller must now use the mount_nodev() helper. Unfortunately, the new interface no longer passes the vfsmount down to the zfs layers. This poses a problem for the existing implementation because we currently save this pointer in the super block for latter use. It provides our only entry point in to the namespace layer for manipulating certain mount options. This needed to be done originally to allow commands like 'zfs set atime=off tank' to work properly. It also allowed me to keep more of the original Solaris code unmodified. Under Solaris there is a 1-to-1 mapping between a mount point and a file system so this is a fairly natural thing to do. However, under Linux they many be multiple entries in the namespace which reference the same filesystem. Thus keeping a back reference from the filesystem to the namespace is complicated. Rather than introduce some ugly hack to get the vfsmount and continue as before. I'm leveraging this API change to update the ZFS code to do things in a more natural way for Linux. This has the upside that is resolves the compatibility issue for the long term and fixes several other minor bugs which have been reported. This commit updates the code to remove this vfsmount back reference entirely. All modifications to filesystem mount options are now passed in to the kernel via a '-o remount'. This is the expected Linux mechanism and allows the namespace to properly handle any options which apply to it before passing them on to the file system itself. Aside from fixing the compatibility issue, removing the vfsmount has had the benefit of simplifying the code. This change which fairly involved has turned out nicely. Closes #246 Closes #217 Closes #187 Closes #248 Closes #231
* Linux compat 2.6.39: security_inode_init_security()Brian Behlendorf2011-07-011-0/+1
| | | | | | | | | | | The security_inode_init_security() function now takes an additional qstr argument which must be passed in from the dentry if available. Passing a NULL is safe when no qstr is available the relevant security checks will just be skipped. Closes #246 Closes #217 Closes #187
* Tear down and flush the mmap regionPrasad Joshi2011-06-271-0/+1
| | | | | | | | | | | | | | The inode eviction should unmap the pages associated with the inode. These pages should also be flushed to disk to avoid the data loss. Therefore, use truncate_setsize() in evict_inode() to release the pagecache. The API truncate_setsize() was added in 2.6.35 kernel. To ensure compatibility with the old kernel, the patch defines its own truncate_setsize function. Signed-off-by: Prasad Joshi <[email protected]> Closes #255
* Always check -Wno-unused-but-set-variable gcc supportBrian Behlendorf2011-06-141-1/+1
| | | | | | | | | | | The previous commit 8a7e1ceefa430988c8f888ca708ab307333b4464 wasn't quite right. This check applies to both the user and kernel space build and as such we must make sure it runs regardless of what the --with-config option is set too. For example, if --with-config=kernel then the autoconf test does not run and we generate build warnings when compiling the kernel packages.
* Check for -Wno-unused-but-set-variable gcc supportBrian Behlendorf2011-06-141-1/+3
| | | | | | | | | | | | | Gcc versions 4.3.2 and earlier do not support the compiler flag -Wno-unused-but-set-variable. This can lead to build failures on older Linux platforms such as Debian Lenny. Since this is an optional build argument this changes add a new autoconf check for the option. If it is supported by the installed version of gcc then it is used otherwise it is omited. See commit's 12c1acde76683108441827ae9affba1872f3afe5 and 79713039a2b6e0ed223d141b4a8a8455f282d2f2 for the reason the -Wno-unused-but-set-variable options was originally added.