aboutsummaryrefslogtreecommitdiffstats
path: root/src/mesa/drivers/dri/i965/brw_compute.c
Commit message (Collapse)AuthorAgeFilesLines
* i965: Implement ARB_compute_variable_group_sizePlamena Manolova2020-04-091-0/+18
| | | | | | | | | | | | | This patch adds the implementation of ARB_compute_variable_group_size for i965. We do this by storing the local group size in a push constant. Additional changes made by Caio Marcelo de Oliveira Filho. Signed-off-by: Plamena Manolova <[email protected]> Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]> Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Paulo Zanoni <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
* i965/compute: Emit GPGPU_WALKER in genX_state_uploadJordan Justen2018-12-121-130/+1
| | | | | Signed-off-by: Jordan Justen <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* i965/batch: avoid reverting batch buffer if saved state is an emptyAndrii Simiklit2018-11-201-1/+2
| | | | | | | | | | | | | | | | | | | | | There's no point reverting to the last saved point if that save point is the empty batch, we will just repeat ourselves. v2: Merge with new commits, changes was minimized, added the 'fixes' tag v3: Added in to patch series v4: Fixed the regression which was introduced by this patch Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108630 Reported-by: Mark Janes <[email protected]> The solution provided by: Jordan Justen <[email protected]> CC: Chris Wilson <[email protected]> Fixes: 3faf56ffbdeb "intel: Add an interface for saving/restoring the batchbuffer state." Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107626 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108630 (fixed in v4) Signed-off-by: Andrii Simiklit <[email protected]> Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* Revert "i965/batch: avoid reverting batch buffer if saved state is an empty"Mark Janes2018-11-011-2/+1
| | | | | | This reverts commit a9031bf9b55602d93cccef6c926e2179c23205b4. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108630
* i965/batch: avoid reverting batch buffer if saved state is an emptyAndrii Simiklit2018-10-301-1/+2
| | | | | | | | | | | | There's no point reverting to the last saved point if that save point is the empty batch, we will just repeat ourselves. CC: Chris Wilson <[email protected]> Fixes: 3faf56ffbdeb "intel: Add an interface for saving/restoring the batchbuffer state." Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107626 Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Remove ring switching entirelyJason Ekstrand2018-05-221-1/+1
| | | | | Reviewed-by: Topi Pohjolainen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Replace draw_aux_buffer_disabled with draw_aux_usageJason Ekstrand2018-01-241-1/+1
| | | | | | | | | | | Instead of keeping an array of booleans, we now hang onto an array of isl_aux_usage enums. This means that the thing we are passing from brw_draw.c to surface state setup is the thing that surface state setup actually needs instead of an input to compute what it needs. Cc: [email protected] Reviewed-by: Topi Pohjolainen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Don't disable CCS for RT dependencies when dispatching compute.Kenneth Graunke2017-10-241-1/+1
| | | | | | | | | | | Compute shaders don't have access to the framebuffer, so there's no point in worrying whether a texture is bound as a render target. This saves a bunch of resolves in GFXBench4 Manhattan 3.1, but doesn't seem to impact performance at all, at least on Apollolake. Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Iago Toral Quiroga <[email protected]>
* i965: Rename brw->no_batch_wrap to intel_batchbuffer::no_wrapKenneth Graunke2017-10-131-2/+2
| | | | | | This really makes more sense in the intel_batchbuffer struct. Reviewed-by: Chris Wilson <[email protected]>
* i965: Disentangle batch and state buffer flushing.Kenneth Graunke2017-09-141-14/+4
| | | | | | | | | | | | | | | | | | | | | | We now flush the batch when either the batchbuffer or statebuffer reaches the original intended batch size, instead of when the sum of the two reaches a certain size (which makes no sense now that they're separate buffers). With this change, we also need to update our "are we near the end?" estimate to require separate batch and state buffer space. I obtained these estimates by looking at the size of draw calls in the Unreal 4 Elemental Demo (using INTEL_DEBUG=flush and always_flush_batch=true). This will significantly impact the size of our batches. I've adjusted both down to try and be roughly similar to what we had been doing. On various benchmarks, a 20kB batch and 16kB statebuffer seemed to about right, but we may need to adjust this further. I tried a 16kB batch, but that regressed Synmark OglMultithread performance by a fair bit. 32kB for both would have significantly increased our batch sizes. Reviewed-by: Matt Turner <[email protected]> Reviewed-by: Chris Wilson <[email protected]>
* i965: drop brw->gen in favor of devinfo->genLionel Landwerlin2017-08-301-6/+8
| | | | | | Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]> Reviewed-by: Emil Velikov <[email protected]>
* i965: Reduce passing 2x32b of reloc_domains to 2 bitsChris Wilson2017-08-041-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel only cares about whether the object is to be written to or not, only reduces (reloc.read_domains, reloc.write_domain) down to just !!reloc.write_domain. When we use NO_RELOC, the kernel doesn't even read those relocs and instead userspace has to pass that information in the execobject.flags. We can simplify our reloc api by also removing the unused read/write domains and only pass the resultant flags. The caveat to the above are when we need to make the kernel aware that certain objects need to take into account different work arounds. Previously, this was done using the magic (INSTRUCTION, INSTRUCTION) reloc domains. NO_RELOC requires this to be passed in the execobject flags as well, and now we push that up the callstack. The API is more compact, more expressive of what happens underneath, but unfortunately requires more knowledge of the system at the point of use. Conversely it also means that knowledge is specific and not generally applied and so not overused. text data bss dec hex filename 8502991 356912 424944 9284847 8dacef lib/i965_dri.so (before) 8500455 356912 424944 9282311 8da307 lib/i965_dri.so (after) v2: (by Ken) Rebase. Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Add a "write" parameter to intel_bufferobj_buffer.Kenneth Graunke2017-07-131-1/+1
| | | | | | | This doesn't do anything yet, but soon we'll want to know whether an access to a buffer section may write that data, or simply reads it. Reviewed-by: Jason Ekstrand <[email protected]>
* i965: Move surface resolves back to draw/dispatch timeJason Ekstrand2017-07-051-0/+2
| | | | | | | | | | | | | | | | | This is effectively a revert of 388f02729bbf88ba104f4f8ee1fdf005a240969c though much code has been added since. Kristian initially moved it to try and avoid locking problems with meta-based resolves. Now that meta is gone from the resolve path (for good this time, we hope), we can move it back. The problem with having it in intel_update_state was that the UpdateState hook gets called by core mesa directly and all sorts of things will cause a UpdateState to get called which may trigger resolves at inopportune times. In particular, it gets called by _mesa_Clear and, if we have a HiZ buffer in the INVALID_AUX state, causes a HiZ resolve right before the clear which is pointless. By moving it back to try_draw_prims time, we know it will only get called right before a draw which is where we want it. Reviewed-by: Kenneth Graunke <[email protected]>
* i965/drm: Rename drm_bacon_bo to brw_bo.Kenneth Graunke2017-04-101-2/+2
| | | | | | | | | | The bacon is all gone. This renames both the class and the related functions. We're about to run indent on the bufmgr code, so no need to worry about fixing bad indentation. Acked-by: Jason Ekstrand <[email protected]>
* i965/drm: Rewrite relocation handling.Kenneth Graunke2017-04-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The execbuf2 kernel API requires us to construct two kinds of lists. First is a "validation list" (struct drm_i915_gem_exec_object2[]) containing each BO referenced by the batch. (The batch buffer itself must be the last entry in this list.) Each validation list entry contains a pointer to the second kind of list: a relocation list. The relocation list contains information about pointers to BOs that the kernel may need to patch up if it relocates objects within the VMA. This is a very general mechanism, allowing every BO to contain pointers to other BOs. libdrm_intel models this by giving each drm_intel_bo a list of relocations to other BOs. Together, these form "reloc trees". Processing relocations involves a depth-first-search of the relocation trees, starting from the batch buffer. Care has to be taken not to double-visit buffers. Creating the validation list has to be deferred until the last minute, after all relocations are emitted, so we have the full tree present. Calculating the amount of aperture space required to pin those BOs also involves tree walking, which is expensive, so libdrm has hacks to try and perform less expensive estimates. For some reason, it also stored the validation list in the global (per-screen) bufmgr structure, rather than as an local variable in the execbuffer function, requiring locking for no good reason. It also assumed that the batch would probably contain a relocation every 2 DWords - which is absurdly high - and simply aborted if there were more relocations than the max. This meant the first relocation from a BO would allocate 180kB of data structures! This is way too complicated for our needs. i965 only emits relocations from the batchbuffer - all GPU commands and state such as SURFACE_STATE live in the batch BO. No other buffer uses relocations. This means we can have a single relocation list for the batchbuffer. We can add a BO to the validation list (set) the first time we emit a relocation to it. We can easily keep a running tally of the aperture space required for that list by adding the BO size when we add it to the validation list. This patch overhauls the relocation system to do exactly that. There are many nice benefits: - We have a flat relocation list instead of trees. - We can produce the validation list up front. - We can allocate smaller arrays and dynamically grow them. - Aperture space checks are now (a + b <= c) instead of a tree walk. - brw_batch_references() is a trivial validation list walk. It should be straightforward to make it O(1) in the future. - We don't need to bloat each drm_bacon_bo with 32B of reloc data. - We don't need to lock in execbuffer, as the data structures are context-local, and not per-screen. - Significantly less code and a better match for what we're doing. - The simpler system should make it easier to take advantage of I915_EXEC_NO_RELOC in a future patch. Improves performance in Synmark 7.0's OglBatch7: - Skylake GT4e: 12.1499% +/- 2.29531% (n=130) - Apollolake: 3.89245% +/- 0.598945% (n=35) Improves performance in GFXBench4's gl_driver2 test: - Skylake GT4e: 3.18616% +/- 0.867791% (n=229) - Apollolake: 4.1776% +/- 0.240847% (n=120) v2: Feedback from Chris Wilson: - Omit explicit zero initializers for garbage execbuf fields. - Use .rsvd1 = ctx_id rather than i915_execbuffer2_set_context_id - Drop unnecessary fencing assertions. - Only use _WR variant of execbuf ioctl when necessary. - Shrink the arrays to be smaller by default. Reviewed-by: Jason Ekstrand <[email protected]>
* i965/drm: Use our internal libdrm (drm_bacon) rather than the real one.Kenneth Graunke2017-04-101-3/+3
| | | | | | Now we can actually test our changes. Acked-by: Jason Ekstrand <[email protected]>
* i965: Stop using legacy dri_bufmgr_* and intel_* names.Kenneth Graunke2017-03-301-1/+1
| | | | | | | | Eric renamed these from dri_bufmgr_* and intel_bufmgr_* to drm_intel_* in libdrm commit 4b9826408f65976a1a13387beda748b65e03ec52, circa 2008, but we've been using the legacy names this whole time. Reviewed-by: Jason Ekstrand <[email protected]>
* i965: Use WARN_ONCE instead of open coding it.Kenneth Graunke2017-03-301-9/+4
| | | | Reviewed-by: Jason Ekstrand <[email protected]>
* i965: rename brw_state_cache_check_size() to brw_program_cache_check_size()Timothy Arceri2016-11-111-1/+1
| | | | Acked-by: Kenneth Graunke <[email protected]>
* i965: Eliminate brw->cs.prog_data pointer.Kenneth Graunke2016-10-051-1/+2
| | | | | | | | | | | | Just say no to: - brw->cs.base.prog_data = &brw->cs.prog_data->base.base; We'll just use the brw_stage_prog_data pointer in brw_stage_state and downcast it to brw_cs_prog_data as needed. Signed-off-by: Kenneth Graunke <[email protected]> Reviewed-by: Timothy Arceri <[email protected]>
* i965: fix unused variable warning in brw_emit_gpgpu_walker()Timothy Arceri2016-10-051-2/+1
| | | | Reviewed-by: Iago Toral Quiroga <[email protected]>
* i965: get rid of duplicated values from gen_device_infoLionel Landwerlin2016-09-231-1/+2
| | | | | | | | Now that we have gen_device_info mutable, we can update its values and drop all copies we had in brw_context. Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965/gen7: Use predicated rendering for indirect computeJordan Justen2016-02-171-14/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | On gen7 (Ivy Bridge, Haswell), we will get a GPU hang if an indirect dispatch is used, but one of the dimensions is 0. Therefore we use predicated rendering on the GPGPU_WALKER command to handle this case. Fixes piglit test: spec/arb_compute_shader/zero-dispatch-size From the ARB_compute_shader spec, under DispatchCompute: "If the work group count in any dimension is zero, no work groups are dispatched." And then for DispatchComputeIndirect: ... "is equivalent (assuming no errors are generated) to calling DispatchCompute with <num_groups_x>, <num_groups_y> and <num_groups_z>" ... Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94100 Signed-off-by: Jordan Justen <[email protected]> Reviewed-by: Ben Widawsky <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]> Tested-by: Ilia Mirkin <[email protected]>
* i965: Drop #include of main/glheader.h.Matt Turner2015-11-241-1/+0
| | | | | | It's never used. Reviewed-by: Ian Romanick <[email protected]>
* i965/cs: Setup surface binding for gl_NumWorkGroupsJordan Justen2015-09-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | This will only be setup when the prog_data uses_num_work_groups boolean is set. At this point nothing will set uses_num_work_groups, but soon code will set it when emitting code for the intrinsic that loads gl_NumWorkGroups. We can't emit this surface information earlier at the start of the DispatchCompute* call because we may not have generated the program yet. Until we generate the program, we don't know if the gl_NumWorkGroups variable is accessed. We also can't emit the surface as part of the brw_cs_state atom, because we might not need the surface if gl_NumWorkGroups is not used by the program. Lastly, we cannot emit the surface later (after state upload) in the DispatchCompute* call, because it needs to be run before the brw_cs_state atom is emitted, since it changes the surface state. Signed-off-by: Jordan Justen <[email protected]> Reviewed-by: Kristian Høgsberg <[email protected]>
* i965/cs: Store compute invocation information in brw contextJordan Justen2015-09-291-24/+26
| | | | | | | | We will need this in an atom to setup a surface to read the gl_NumWorkGroups values from. Signed-off-by: Jordan Justen <[email protected]> Reviewed-by: Kristian Høgsberg <[email protected]>
* i965/cs: Implement DispatchComputeIndirect supportJordan Justen2015-09-241-4/+53
| | | | | Signed-off-by: Jordan Justen <[email protected]> Reviewed-by: Kristian Høgsberg <[email protected]>
* i965/compute: Fix undefined code with right_mask for SIMD32Jordan Justen2015-06-181-1/+1
| | | | | | | | | Although we don't support SIMD32, krh pointed out that the left shift by 32 is undefined by C/C++ for 32-bit integers. Suggested-by: Kristian Høgsberg <[email protected]> Signed-off-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965/cs: Emit MEDIA_STATE_FLUSH after WALKERJordan Justen2015-05-021-0/+5
| | | | | Signed-off-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965/cs: Implement brw_emit_gpgpu_walkerJordan Justen2015-05-021-1/+38
| | | | | | | | | | | | | Tested on Ivybridge, Haswell and Broadwell. v2: * Use SET_FIELD. (Ken) * Use simd_size / 16 to support SIMD8/16/32. Ken suggested that we might be able to do it arithmetically rather than just supporting SIMD8 and SIMD16 with a conditional. Signed-off-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Implement DispatchCompute() back-endPaul Berry2015-05-021-0/+121
brw_emit_gpgpu_walker will be implemented in a subsequent patch. Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>