summaryrefslogtreecommitdiffstats
path: root/include/sys/Makefile.am
Commit message (Collapse)AuthorAgeFilesLines
* Illumos 5960, 5925Paul Dagnelie2016-01-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5960 zfs recv should prefetch indirect blocks 5925 zfs receive -o origin= Reviewed by: Prakash Surya <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> References: https://www.illumos.org/issues/5960 https://www.illumos.org/issues/5925 https://github.com/illumos/illumos-gate/commit/a2cdcdd Porting notes: - [lib/libzfs/libzfs_sendrecv.c] - b8864a2 Fix gcc cast warnings - 325f023 Add linux kernel device support - 5c3f61e Increase Linux pipe buffer size on 'zfs receive' - [module/zfs/zfs_vnops.c] - 3558fd7 Prototype/structure update for Linux - c12e3a5 Restructure zfs_readdir() to fix regressions - [module/zfs/zvol.c] - Function @zvol_map_block() isn't needed in ZoL - 9965059 Prefetch start and end of volumes - [module/zfs/dmu.c] - Fixed ISO C90 - mixed declarations and code - Function dmu_prefetch() 'int i' is initialized before the following code block (c90 vs. c99) - [module/zfs/dbuf.c] - fc5bb51 Fix stack dbuf_hold_impl() - 9b67f60 Illumos 4757, 4913 - 34229a2 Reduce stack usage for recursive traverse_visitbp() - [module/zfs/dmu_send.c] - Fixed ISO C90 - mixed declarations and code - b58986e Use large stacks when available - 241b541 Illumos 5959 - clean up per-dataset feature count code - 77aef6f Use vmem_alloc() for nvlists - 00b4602 Add linux kernel memory support Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>
* Add temporary mount optionsBrian Behlendorf2015-09-031-0/+1
| | | | | | | | | | | | | Add the required kernel side infrastructure to parse arbitrary mount options. This enables us to support temporary mount options in largely the same way it is handled on other platforms. See the 'Temporary Mount Point Properties' section of zfs(8) for complete details. Signed-off-by: Brian Behlendorf <[email protected]> Closes #985 Closes #3351
* Illumos 5497 - lock contention on arcs_mtxPrakash Surya2015-06-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> Porting notes and other significant code changes: The illumos 5368 patch (ARC should cache more metadata), which was never picked up by ZoL, is mostly reverted by this patch. Since ZoL relies on the kernel asynchronously calling the shrinker to actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every time it runs. The arc_adapt_thread() function no longer calls arc_do_user_evicts() since the newly-added arc_user_evicts_thread() calls it periodically. Notable conflicting ZoL commits which conflicted with this patch or whose effects are either duplicated or un-done by this patch: 302f753 - Integrate ARC more tightly with Linux 39e055c - Adjust arc_p based on "bytes" in arc_shrink f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c" 77765b5 - Remove "arc_meta_used" from arc_adjust calculation 94520ca - Prune metadata from ghost lists in arc_adjust_meta Trace support for multilist_insert() and multilist_remove() has been added and produces the following output: fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 } fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 } The following arcstats have been removed: recycle_miss - Used by arcstat.py and arc_summary.py, both of which have been updated appropriately. l2_writes_hdr_miss The following arcstats have been added: evict_not_enough - Number of times arc_evict_state() was unable to evict enough buffers to reach its target amount. evict_l2_skip - Number of times arc_evict_hdr() skipped eviction because it was being written to the l2arc. l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times l2arc_write_done() failed to acquire hash_lock (and re-tries). arc_meta_min - Shows the value of the zfs_arc_meta_min module parameter (see below). The "index" column of the "dbuf" kstat has been removed since it doesn't have a direct analog in the new multilist scheme. Additional multilist- related stats could be added in the future but would likely require extensions to the mulilist API. The following module parameters have been added: zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list before moving on to the next sub-list. zfs_arc_meta_min - Enforce a floor on the amount of metadata in the ARC. zfs_arc_num_sublists_per_state - Number of multilist sub-lists per ARC state. zfs_arc_overflow_shift - Controls amount by which the ARC must exceed the target size to be considered "overflowing". Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]
* Remove duplicate typedefs from trace.hNed Bass2015-01-061-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate typedef declarations with the same type. The trace.h header contains some typedefs to avoid 'unknown type' errors for C files that haven't declared the type in question. But this causes build failures for C files that have already declared the type. Newer versions of GCC (e.g. v4.6) allow duplicate typedefs with the same type unless pedantic error checking is in force. To support the older versions we need to remove the duplicate typedefs. Removal of the typedefs means we can't built tracepoints code using those types unless the required headers have been included. To facilitate this, all tracepoint event declarations have been moved out of trace.h into separate headers. Each new header is explicitly included from the C file that uses the events defined therein. The trace.h header is still indirectly included form zfs_context.h and provides the implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces. This makes those interfaces readily available throughout the code base. The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also still provided by trace.h, so it is a prerequisite for the other trace_*.h headers. These new Linux implementation-specific headers do introduce a small divergence from upstream ZFS in several core C files, but this should not present a significant maintenance burden. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2953
* Swap DTRACE_PROBE* with Linux tracepointsPrakash Surya2014-11-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch leverages Linux tracepoints from within the ZFS on Linux code base. It also refactors the debug code to bring it back in sync with Illumos. The information exported via tracepoints can be used for a variety of reasons (e.g. debugging, tuning, general exploration/understanding, etc). It is advantageous to use Linux tracepoints as the mechanism to export this kind of information (as opposed to something else) for a number of reasons: * A number of external tools can make use of our tracepoints "automatically" (e.g. perf, systemtap) * Tracepoints are designed to be extremely cheap when disabled * It's one of the "accepted" ways to export this kind of information; many other kernel subsystems use tracepoints too. Unfortunately, though, there are a few caveats as well: * Linux tracepoints appear to only be available to GPL licensed modules due to the way certain kernel functions are exported. Thus, to actually make use of the tracepoints introduced by this patch, one might have to patch and re-compile the kernel; exporting the necessary functions to non-GPL modules. * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux tracepoints are not available for unsigned kernel modules (tracepoints will get disabled due to the module's 'F' taint). Thus, one either has to sign the zfs kernel module prior to loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or newer. Assuming the above two requirements are satisfied, lets look at an example of how this patch can be used and what information it exposes (all commands run as 'root'): # list all zfs tracepoints available $ ls /sys/kernel/debug/tracing/events/zfs enable filter zfs_arc__delete zfs_arc__evict zfs_arc__hit zfs_arc__miss zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write zfs_new_state__mfu zfs_new_state__mru # enable all zfs tracepoints, clear the tracepoint ring buffer $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable $ echo 0 > /sys/kernel/debug/tracing/trace # import zpool called 'tank', inspect tracepoint data (each line was # truncated, they're too long for a commit message otherwise) $ zpool import tank $ cat /sys/kernel/debug/tracing/trace | head -n35 # tracer: nop # # entries-in-buffer/entries-written: 1219/1219 #P:8 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr... z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr... z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr... z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr... z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr... z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr... z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr... z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ... To highlight the kind of detailed information that is being exported using this infrastructure, I've taken the first tracepoint line from the output above and reformatted it such that it fits in 80 columns: lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr { dva 0x1:0x40082 birth 15491 cksum0 0x163edbff3a flags 0x640 datacnt 1 type 1 size 2048 spa 3133524293419867460 state_type 0 access 0 mru_hits 0 mru_ghost_hits 0 mfu_hits 0 mfu_ghost_hits 0 l2_hits 0 refcount 1 } bp { dva0 0x1:0x40082 dva1 0x1:0x3000e5 dva2 0x1:0x5a006e cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00 lsize 2048 } zb { objset 0 object 0 level -1 blkid 0 } For the specific tracepoint shown here, 'zfs_arc__miss', data is exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!). This kind of precise and detailed information can be extremely valuable when trying to answer certain kinds of questions. For anybody unfamiliar but looking to build on this, I found the XFS source code along with the following three web links to be extremely helpful: * http://lwn.net/Articles/379903/ * http://lwn.net/Articles/381064/ * http://lwn.net/Articles/383362/ I should also node the more "boring" aspects of this patch: * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to support a sixth paramter. This parameter is used to populate the contents of the new conftest.h file. If no sixth parameter is provided, conftest.h will be empty. * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced. This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro, except it has support for a fifth option that is then passed as the sixth parameter to ZFS_LINUX_COMPILE_IFELSE. These autoconf changes were needed to test the availability of the Linux tracepoint macros. Due to the odd nature of the Linux tracepoint macro API, a separate ".h" must be created (the path and filename is used internally by the kernel's define_trace.h file). * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This is to determine if we can safely enable the Linux tracepoint functionality. We need to selectively disable the tracepoint code due to the kernel exporting certain functions as GPL only. Without this check, the build process will fail at link time. In addition, the SET_ERROR macro was modified into a tracepoint as well. To do this, the 'sdt.h' file was moved into the 'include/sys' directory and now contains a userspace portion and a kernel space portion. The dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as well. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Move a few internal ARC strucutres to arc_impl.hNed Bass2014-11-171-0/+1
| | | | | | | | | | | Add a new file named arc_impl.h and move a few internal ARC structure definitions into this file. This is needed in order to allow the Linux tracepoint functions to grub around in the internals of these structures. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Kernel header installation should respect --prefixRichard Yao2014-10-281-1/+1
| | | | | | | | | | This is the upstream component of work that enables preliminary support for building Gentoo's ZFS packaging on other Linux systems via Gentoo Prefix. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #2641
* Illumos 4757, 4913Matthew Ahrens2014-08-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Max Grossman <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/4757 https://www.illumos.org/issues/4913 https://github.com/illumos/illumos-gate/commit/5d7b4d4 Porting notes: For compatibility with the fastpath code the zio_done() function needed to be updated. Because embedded-data block pointers do not require DVAs to be allocated the associated vdevs will not be marked and therefore should not be unmarked. Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2544
* Illumos 4368, 4369.Matthew Ahrens2014-07-291-0/+1
| | | | | | | | | | | | | | | | | 4369 implement zfs bookmarks 4368 zfs send filesystems from readonly pools Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/4369 https://www.illumos.org/issues/4368 https://github.com/illumos/illumos-gate/commit/78f1710 Ported by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2530
* Illumos #4101, #4102, #4103, #4105, #4106George Wilson2014-07-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Garrett D'Amore <[email protected]> Prior to this patch, space_maps were preferred solely based on the amount of free space left in each. Unfortunately, this heuristic didn't contain any information about the make-up of that free space, which meant we could keep preferring and loading a highly fragmented space map that wouldn't actually have enough contiguous space to satisfy the allocation; then unloading that space_map and repeating the process. This change modifies the space_map's to store additional information about the contiguous space in the space_map, so that we can use this information to make a better decision about which space_map to load. This requires reallocating all space_map objects to increase their bonus buffer size sizes enough to fit the new metadata. The above feature can be enabled via a new feature flag introduced by this change: com.delphix:spacemap_histogram In addition to the above, this patch allows the space_map block size to be increase. Currently the block size is set to be 4K in size, which has certain implications including the following: * 4K sector devices will not see any compression benefit * large space_maps require more metadata on-disk * large space_maps require more time to load (typically random reads) Now the space_map block size can adjust as needed up to the maximum size set via the space_map_max_blksz variable. A bug was fixed which resulted in potentially leaking an object when removing a mirrored log device. The previous logic for vdev_remove() did not deal with removing top-level vdevs that are interior vdevs (i.e. mirror) correctly. The problem would occur when removing a mirrored log device, and result in the DTL space map object being leaked; because top-level vdevs don't have DTL space map objects associated with them. References: https://www.illumos.org/issues/4101 https://www.illumos.org/issues/4102 https://www.illumos.org/issues/4103 https://www.illumos.org/issues/4105 https://www.illumos.org/issues/4106 https://github.com/illumos/illumos-gate/commit/0713e23 Porting notes: A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also, the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary. Ported-by: Tim Chase <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2488
* Illumos #4045 write throttle & i/o scheduler performance workMatthew Ahrens2013-12-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead. * the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in zfsonlinux/zfs@c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Ned Bass <[email protected]> Reviewed by: Brendan Gregg <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e Ported-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1913
* Illumos #3464Matthew Ahrens2013-09-041-1/+4
| | | | | | | | | | | | | | | | | 3464 zfs synctask code needs restructuring Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Christopher Siden <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/3464 illumos/illumos-gate@3b2aab18808792cbd248a12f1edf139b89833c13 Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1495
* Change zfs-kmod-devel install pathBrian Behlendorf2013-03-131-1/+1
| | | | | | | | | | | | | | | Install the common zfs kernel development headers under /usr/src/zfs-<version>/ rather than in a kernel specific directory. The kernel specific build products such as zfs_config.h and Modules.symvers are left installed under /usr/src/zfs-<version>/<kernel>. This was done to be consistent with where dkms expects kernel module source to be packaged. It also allows for a common zfs-kmod-devel package which includes the headers, and per-kernel zfs-kmod-devel-<kernel> packages. Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #2619 and #2747Christopher Siden2013-01-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | 2619 asynchronous destruction of ZFS file systems 2747 SPA versioning with zfs feature flags Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Lowe <[email protected]> Reviewed by: Dan Kruchinin <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@53089ab7c84db6fb76c16ca50076c147cda11757 illumos/illumos-gate@ad135b5d644628e791c3188a6ecbd9c257961ef8 illumos changeset: 13700:2889e2596bd6 https://www.illumos.org/issues/2619 https://www.illumos.org/issues/2747 NOTE: The grub specific changes were not ported. This change must be made to the Linux grub packages. Ported-by: Brian Behlendorf <[email protected]>
* Add .zfs control directoryBrian Behlendorf2012-03-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for the .zfs control directory. This was accomplished by leveraging as much of the existing ZFS infrastructure as posible and updating it for Linux as required. The bulk of the core functionality is now all there with the following limitations. *) The .zfs/snapshot directory automount support requires a 2.6.37 or newer kernel. The exception is RHEL6.2 which has backported the d_automount patches. *) Creating/destroying/renaming snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory works as expected. However, this functionality is only available to root until zfs delegations are finished. * mkdir - create a snapshot * rmdir - destroy a snapshot * mv - rename a snapshot The following issues are known defeciences, but we expect them to be addressed by future commits. *) Add automount support for kernels older the 2.6.37. This should be possible using follow_link() which is what Linux did before. *) Accessing the .zfs/snapshot directory via NFS is not yet possible. The majority of the ground work for this is complete. However, finishing this work will require resolving some lingering integration issues with the Linux NFS kernel server. *) The .zfs/shares directory exists but no futher smb functionality has yet been implemented. Contributions-by: Rohan Puri <[email protected]> Contributiobs-by: Andrew Barnes <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #173
* Add the release component to headersBrian Behlendorf2012-01-181-1/+1
| | | | | | | | | When the original build system code was added the release component was accidentally omited from the development header install path. This patch adds the missing path component so it's always clear exactly what release your compiling against. Signed-off-by: Brian Behlendorf <[email protected]>
* Add xvattr supportBrian Behlendorf2011-03-021-0/+1
| | | | | | | | | | | | | | | With the removal of the minimal xvattr support from the spl this support needs to be replaced in the zfs package. This is fairly easily accomplished by directly adding portions of the sys/vnode.h header from OpenSolaris. These xvattr additions have been placed in the sys/xvattr.h header file and included as needed where simply a sys/vnode.h was included before. In additon to the xvattr types and helper macros two functions were also included. The xva_init() and xva_getxoptattr() functions were included as static inline functions in xvattr.h. They are simple enough and it was simpler to place them here rather than in their own .c file.
* Linux 2.6.x compat, blkdev_compat.hBrian Behlendorf2011-02-231-1/+0
| | | | | | | | | | For legacy reasons the zvol.c and vdev_disk.c Linux compatibility code ended up in sys/blkdev.h and sys/vdev_disk.h headers. While there are worse places for this code to live it should be in a linux/blkdev_compat.h header. This change moves this block device Linux compatibility code in to the linux/blkdev_compat.h header and updates all the correct #include locations. This is not a functional change or bug fix, it is just code cleanup.
* Add Hooks for Linux File OperationsBrian Behlendorf2011-02-101-0/+1
| | | | | | | | | | | | | The Linux specific file operations have all been located in the file zpl_file.c. These functions primarily rely on the reworked zfs_* functions to do their job. They are also responsible for converting the possible Solaris style error codes to negative Linux errors. This first zpl_* commit also includes a common zpl.h header with minimal entries to register the Linux specific hooks. In also adds all the new zpl_* file to the Makefile.in. This is not a standalone commit, you required the following zpl_* commits.
* Remove zfs_ctldir.[ch]Brian Behlendorf2011-02-101-1/+0
| | | | | | | This code is used for snapshot and heavily leverages Solaris functionality we do not want to reimplement. These files have been removed, including references to them, and will be replaced by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
* Add basic uio supportBrian Behlendorf2011-02-101-0/+1
| | | | | | | | This code originates in OpenSolaris and was modified by KQ Infotech to be compatible with Linux. While supporting uios in the short term is useful to get something working this is not an abstraction we want to keep. This code is expected to be short lived and removed as soon as all the remaining uio based APIs and updated.
* Export required vfs/vn symbolsBrian Behlendorf2011-02-101-0/+1
|
* Support custom build directories and move includesBrian Behlendorf2010-09-081-0/+91
One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.