summaryrefslogtreecommitdiffstats
path: root/src/intel
Commit message (Collapse)AuthorAgeFilesLines
* Revert "i965/fs: Merge CMP and SEL into CSEL on Gen8+"Jason Ekstrand2019-11-202-108/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 52c7df1643ec9af119fd66f916f7fbdbcc798d2d. The pass, while clearly useful for some shaders, has at least three bugs that I was able to find fairly quickly: 1. It doesn't work for type-converting MOVs because f > 0 is not the same as f2i(f) > 0 2. CSEL is a 3src instruction and only supports one source type; it doesn't take this into account and tries to create instructions which do a F compare and a D select. This is especially nasty to debug because you don't see that in the dumped assembly because we don't properly assert that types are the same in codegen. 3. While you can handle 2, in theory, by reinterpreting types, you can't do that in the presence of source modifiers. This pass doesn't even attempt to detect that. Those are just the ones I found with the one almost trival shader I was debugging. There very likely may be more and. Best thing to do for now is just shut it off until someone has the time to figure out how to do this properly and write tests to ensure it's correct. Fixes: 3cb085e6d61a "i965/fs: Merge CMP and SEL into CSEL on Gen8+" Reviewed-by: Brian Paul <[email protected]>
* nir: move data.image.access to data.accessMarek Olšák2019-11-192-4/+4
| | | | | | The size of the data structure doesn't change. Reviewed-by: Connor Abbott <[email protected]>
* anv: add missing "fall-through" annotationEric Engestrom2019-11-191-0/+1
| | | | | | | CoverityID: 1455884 Fixes: c1c346f1667375e9330a ("anv: implement VK_KHR_separate_depth_stencil_layouts") Signed-off-by: Eric Engestrom <[email protected]> Acked-by: Jason Ekstrand <[email protected]>
* intel: Add workaround for stencil state.Rafael Antognolli2019-11-192-0/+28
| | | | | Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Sagar Ghuge <[email protected]>
* intel/compiler: Don't change hstride if not neededIván Briano2019-11-181-5/+6
| | | | | | | | | | | | Alignment requirements may have changed the horizontal stride already, so don't set it if not required to avoid breaking said requirements. Fixes several tests such as dEQP-VK.subgroups.vote.graphics.subgroupallequal_int8_t Signed-off-by: Iván Briano <[email protected]> Reviewed-by: Paulo Zanoni <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: Emit a NULL vertex for zero base_vertex/instanceJason Ekstrand2019-11-181-11/+16
| | | | | | | | | | If both are zero (the common case), we can emit a null vertex buffer rather than emitting a vertex buffer with zeros in it. The packing of the VERTEX_BUFFER_STATE is faster because no relocation is emitted and we can avoid creating the vertex buffer which means one less anv_state_stream_alloc. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Use an anv_state for the next binding tableJason Ekstrand2019-11-182-12/+15
| | | | | | | | | | This is a bit more natural because we're already getting an anv_state most places in the pipeline. The important part here, however, is that we're no longer calling anv_block_pool_map on every alloc_binding_table call. While it's probably pretty cheap, it is potentially a linear walk over the list of BOs and it was showing up in profiles. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: More carefully dirty state in BindPipelineJason Ekstrand2019-11-187-25/+101
| | | | | | | | | | | | | | | Instead of blindly dirtying descriptors and push constants the moment we see a pipeline change, check to see if it actually changes the bind layout or push constant layout. This doubles the runtime performance of one CPU-limited example running with the Dawn WebGPU implementation when running on my laptop. NOTE: This effectively reverts beca63c6c07. While it was a nice optimization, it was based on prog_data and we can't do that anymore once we start allowing the same binding table to be used with multiple different pipelines. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: More carefully dirty state in BindDescriptorSetsJason Ekstrand2019-11-184-22/+51
| | | | | | | | | | | | Instead of dirtying all graphics or all compute based on binding point, we're now much more careful. We first check to see if the actual descriptor set changed and then only dirty the stages used by that descriptor set. For dynamic offsets, we keep a bitfield per-stage of which offsets are actually used in that stage and we only dirty push constants and descriptors if that stage has dynamic offsets AND those offsets actually change. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Use a switch statement for binding table setupJason Ekstrand2019-11-181-117/+127
| | | | | | | | | It theoretically could be more efficient but the real point here is that it's no longer really a matter of dealing with special cases and then the "real" thing. The way we're handling binding tables, it's more of a multi-step process and a switch is more natural. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Rework push constant handlingJason Ekstrand2019-11-1811-228/+176
| | | | | | | | | | | | | | | | | | This substantially reworks both the state setup side of push constant handling and the pipeline compile side. The fundamental change here is that we're no longer respecting the prog_data::param array and instead are just instructing the back-end compiler to leave the array alone. This makes the state setup side substantially simpler because we can now just memcpy the whole block of push constants and don't have to upload one DWORD at a time. This also means that we can compute the full push constant layout up-front and just trust the back-end compiler to not mess with it. Maybe one day we'll decide that the back-end compiler can do useful things there again but for now, this is functionally no different from what we had before this commit and makes the NIR handling cleaner. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Re-arrange push constant data a bitJason Ekstrand2019-11-183-23/+46
| | | | | | | | | | This moves the compute stuff into a anv_push_constants::cs sub-struct. It also moves dynamic offsets into the push constants. This means we have to duplicate the data per-stage but that doesn't seem like the end of the world and one day we may wish to make dynamic offsets per-stage anyway. Reviewed-by: Lionel Landwerlin <[email protected]>
* intel/compiler: Add a flag to avoid compacting push constantsJason Ekstrand2019-11-184-145/+168
| | | | | | | In vec4, we can just not run the pass. In fs, things are a bit more deeply intertwined. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Pre-compute push ranges for graphics pipelinesJason Ekstrand2019-11-188-64/+137
| | | | | | | | | It turns off that emitting push constants is one of the hottest paths in the driver and ANY work we do there costs us. By pre-computing things a bit ahead of time, we shave 5% off the runtime of a CPU-limited example running with the Dawn WebGPU implementation. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Stop bounds-checking pushed UBOsJason Ekstrand2019-11-181-28/+10
| | | | | | | | | | | | | | | The bounds checking is actually less safe than just pushing the data. If the bounds checking actually ever kicks in and it's not on the last UBO push range, then the shrinking will cause all subsequent ranges to be pushed to the wrong place in the GRF. One of the behaviors we definitely don't want is for OOB UBO access to result in completely unrelated UBOs returning garbage values. It's safer to just push the UBOs as-requested. If we're really concerned about robustness, we can emit shader code to do bounds checking which should be stupid cheap (a CMP followed by SEL). Cc: [email protected] Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Delete dead shader constant pushing codeJason Ekstrand2019-11-182-13/+7
| | | | | | | | As of 2d78e55a8c5481, nir_intrinsic_load_constant with a constant offset is constant-folded so we should never end up with any that trigger brw_nir_analyze_ubo_ranges. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Flatten descriptor bindings in anv_nir_apply_pipeline_layoutJason Ekstrand2019-11-186-76/+54
| | | | | | | | This lets us stop tracking the pipeline layout. It also means less indirection on a very hot path. As an extra bonus, we can make some of our data structures smaller. No measurable CPU overhead improvement. Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: Input attachments are always single-planeJason Ekstrand2019-11-181-2/+3
| | | | Reviewed-by: Lionel Landwerlin <[email protected]>
* genxml: Mark everything in genX_pack.h always_inlineJason Ekstrand2019-11-181-8/+8
| | | | Reviewed-by: Lionel Landwerlin <[email protected]>
* anv/pipeline: Assume layout != NULLJason Ekstrand2019-11-181-21/+19
| | | | | | | | | In the early days of the driver we allowed layout to be VK_NULL_HANDLE and used that for some internal pipelines when we wanted to be lazy. Vulkan doesn't actually allow NULL layouts, however, so there's no reason to have this check. Reviewed-by: Lionel Landwerlin <[email protected]>
* intel/compiler: remove old commentItalo Nicola2019-11-181-3/+0
| | | | | | | This comment was correct some time ago, but since commit d3c10ad42729c1fe74a7f7c67465bd2, it isn't true anymore. Reviewed-by: Paulo Zanoni <[email protected]>
* intel/perf: add EHL performance query supportLionel Landwerlin2019-11-154-2/+11808
| | | | | Signed-off-by: Lionel Landwerlin <[email protected]> Acked-by: Rafael Antognolli <[email protected]>
* intel/dev: flag the Elkhart Lake platformLionel Landwerlin2019-11-152-0/+5
| | | | | | | We'll use this for performance metrics which are different from ICL. Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Rafael Antognolli <[email protected]>
* intel/fs: Do not lower large local arrays to scratch on gen7Danylo Piliaiev2019-11-141-1/+5
| | | | | | | | | | | | | | On gen7 and earlier the scratch space size is limited to 12kB. By enabling this optimization we may easily exceed this limit without having any fallback. arb_compute_shader/linker/bug-93840.shader_test crashes with this lowering on IVB due to exceeding scratch size limit. Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2092 Fixes: 69244fc7 Signed-off-by: Danylo Piliaiev <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: fix nir_op_{i,u}*32 on ICLPaulo Zanoni2019-11-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | On ICL we have the src1 restriction which is applied through fix_byte_src() and potentially changes the type of the operands from 8 to 32 bits. When this change happens, we fall into the "else if (bit_size < 32)" case and miscompute src_type because it takes into consideration bit_size (8) instead of the adjusted size of temp_op (32). This results in the shader reading unused memory, giving us mostly failures, but occasional passes due to whatever was already in the registers we were reading. This commit fixes a lot of dEQP subgroup i8vec2 tests on ICL, such as: dEQP-VK.subgroups.arithmetic.compute.subgroupadd_i8vec2 This can also be verified by simply changing fix_byte_src() to apply on all platforms. Fixes: 5847de6e9afe ("intel/compiler: don't use byte operands for src1 on ICL") Reviewed-by: Ivan Briano <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Signed-off-by: Paulo Zanoni <[email protected]>
* anv: Initialize depth_bounds_test_enable when not explicitly setCaio Marcelo de Oliveira Filho2019-11-131-2/+1
| | | | | | | | | This was causing uninitialized value to end up propagated to the 3DSTATE_DEPTH_BOUNDS packet, leading to asserts on packet building due to the value being greater than 1. Fixes: 939ddccb7a5 ("anv: Add support for depth bounds testing.") Reviewed-by: Plamena Manolova <[email protected]>
* anv: Use mocs settings from isl_dev.Rafael Antognolli2019-11-126-74/+15
| | | | | | | v2: Remove device->default_mocs and external_mocs (Jason). Reviewed-by: Jordan Justen <[email protected]> Acked-by: Lionel Landwerlin <[email protected]>
* intel/isl: Add MOCS settings to isl_device.Rafael Antognolli2019-11-122-0/+57
| | | | | | | Centralize mocs settings into isl. Reviewed-by: Jordan Justen <[email protected]> Acked-by: Lionel Landwerlin <[email protected]>
* intel/blorp: Fix usage of uninitialized memory in key hashingDanylo Piliaiev2019-11-121-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The automatically generated padding in structs contains undefined values, force pack the structs to eliminate the padding. Otherwise structs with the same values may generate different hashes. Valgrind output: Conditional jump or move depends on uninitialised value(s) util_fast_urem32 (fast_urem_by_const.h:71) hash_table_search (hash_table.c:262) _mesa_hash_table_search (hash_table.c:296) anv_pipeline_cache_search_locked (anv_pipeline_cache.c:318) anv_pipeline_cache_search (anv_pipeline_cache.c:335) lookup_blorp_shader (anv_blorp.c:38) blorp_params_get_mcs_partial_resolve_kernel (blorp_clear.c:1112) blorp_mcs_partial_resolve (blorp_clear.c:1205) anv_image_mcs_op (anv_blorp.c:1742) anv_cmd_predicated_mcs_resolve (genX_cmd_buffer.c:774) transition_color_buffer (genX_cmd_buffer.c:1159) cmd_buffer_end_subpass (genX_cmd_buffer.c:4840) Uninitialised value was created by a stack allocation blorp_params_get_mcs_partial_resolve_kernel (blorp_clear.c:1103) Signed-off-by: Danylo Piliaiev <[email protected]> Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: implement VK_KHR_timeline_semaphoreLionel Landwerlin2019-11-115-72/+734
| | | | | | | | | | | | | | | | | v2: Fix inverted condition in vkGetPhysicalDeviceExternalSemaphoreProperties() v3: Add anv_timeline_* helpers (Jason) v4: Avoid variable shadowing (Jason) Split timeline wait/signal device operations (Jason/Lionel) v5: s/point/signal_value/ (Jason) Drop piece of drm-syncobj timeline code (Jason) v6: Add missing sync_fd semaphore signaling (Jason) Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: Plumb timeline semaphore signal/wait values through from the APIJason Ekstrand2019-11-112-3/+22
| | | | Reviewed-by: Lionel Landwerlin <[email protected]>
* anv/wsi: signal the semaphore in the acquireNextImageLionel Landwerlin2019-11-111-4/+20
| | | | | | | | | | | We seem to have forgotten about the semaphore in the acquireNextImageInfo. v2: Signal semaphore/fence regardless of presentation status (Jason) Signed-off-by: Lionel Landwerlin <[email protected]> Cc: <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: Lock around fetching sync file FDs from semaphoresJason Ekstrand2019-11-111-13/+26
| | | | Reviewed-by: Lionel Landwerlin <[email protected]>
* anv: prepare the driver for delayed submissionsLionel Landwerlin2019-11-114-376/+616
| | | | | | | | | | | | | | | | | | | | | | | | Timeline semaphore introduce support for wait before signal behavior, which means that it is now allowed to call vkQueueSubmit() with wait semaphores not yet submitted for execution. Our kernel driver requires all of the wait primitives to be created before calling the execbuf ioctl. As a result, we must delay submissions in the userspace driver. This change store the necessary information to be able to delay a VkSubmitInfo submission to the kernel driver. v2: Fold count++ into array access (Jason) Move queue list to another patch (Jason) v3: Document cleanup of temporary semaphores (Jason) v4: Track semaphores of SYNC_FD type that needs updating after delayed submission v5: Don't forget to update sync_fd in signaled semaphores after submission (Jason) Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: refcount semaphoresLionel Landwerlin2019-11-112-6/+26
| | | | | | | | | | | | | | Delayed submissions required by timeline semaphores mean we need to be able to update the sync fd backed semaphores in a delayed fashion. This could mean a race between the application destroying the semaphore and the submission code trying to update it with the new sync fd. This change prepares semaphores to be refcounted, we'll most likely only take a reference for cases where we signal a sync fd semaphore. Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: prepare driver to report submission error through queuesLionel Landwerlin2019-11-115-24/+60
| | | | | | | | | | | | | | | | | When we will submit to i915 from a submission thread, we won't be able to directly report the error to the user (in particular through the debug report callbacks). So prepare 2 paths to report errors device -> notifying the user immediately, queue -> notifying the user the next time an entry point is called. In this change we still report directly for both paths, this will change in the next commit. v2: Split NULL batch parameter handling in anv_queue_submit_simple_batch() in a different commit Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: allow NULL batch parameter to anv_queue_submit_simple_batchLionel Landwerlin2019-11-112-19/+17
| | | | | | | We can reuse device->trivial_batch_bo Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: move queue init/finish to anv_queue.cLionel Landwerlin2019-11-113-22/+30
| | | | | | | | Prepare the queue initialization to take on more responsabilities and possibly fail. Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: expose timeout helpers outside of anv_queue.cLionel Landwerlin2019-11-112-50/+51
| | | | | Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: detach batch emission allocation from deviceLionel Landwerlin2019-11-111-56/+40
| | | | | | | | In the future we'll have 2 different allocations depending on whether we're using threaded submission or not. Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* anv: remove list items on batch finiLionel Landwerlin2019-11-111-1/+4
| | | | | | | | | | | | | This doesn't seem to fix anything because those destroy() calls happen right before the command buffer object & its list of batch_bo is also destroyed. Still looks a bit cleaner. v2: Found a second occurence Signed-off-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> (v2) Fixes: 26ba0ad54d ("vk: Re-name command buffer implementation files") Cc: <[email protected]>
* anv: invalidate file descriptor of semaphore sync fd at vkQueueSubmitLionel Landwerlin2019-11-111-2/+4
| | | | | | | | | | | | | | | | | | We always close the in_fence at the end the anv_cmd_buffer_execbuf() so when we take it from the semaphore, let's not forget to invalidate it. Note that the code leaks the fence_in if we get any error before reaching the close(). Let's fix that in another patch or better, rewrite the whole thing! v2: drop redundant fd = -1 (Jason) v3: Update commit message (Jason) Signed-off-by: Lionel Landwerlin <[email protected]> Cc: <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: Lower large local arrays to scratchJason Ekstrand2019-11-111-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shader-db results on Kaby Lake: total instructions in shared programs: 14929212 -> 14880028 (-0.33%) instructions in affected programs: 72428 -> 23244 (-67.91%) helped: 6 HURT: 2 helped stats (abs) min: 2165 max: 15981 x̄: 8590.00 x̃: 7624 helped stats (rel) min: 56.06% max: 74.52% x̄: 67.55% x̃: 72.08% HURT stats (abs) min: 1178 max: 1178 x̄: 1178.00 x̃: 1178 HURT stats (rel) min: 350.60% max: 361.35% x̄: 355.97% x̃: 355.97% 95% mean confidence interval for instructions value: -11947.03 -348.97 95% mean confidence interval for instructions %-change: -125.72% 202.37% Inconclusive result (%-change mean confidence interval includes 0). total cycles in shared programs: 368585300 -> 342557344 (-7.06%) cycles in affected programs: 28144921 -> 2116965 (-92.48%) helped: 6 HURT: 2 helped stats (abs) min: 1404978 max: 7766106 x̄: 4353922.00 x̃: 3890682 helped stats (rel) min: 82.01% max: 95.57% x̄: 89.95% x̃: 92.28% HURT stats (abs) min: 47778 max: 47798 x̄: 47788.00 x̃: 47788 HURT stats (rel) min: 278.20% max: 282.98% x̄: 280.59% x̃: 280.59% 95% mean confidence interval for cycles value: -5900438.73 -606550.27 95% mean confidence interval for cycles %-change: -140.79% 146.16% Inconclusive result (%-change mean confidence interval includes 0). total spills in shared programs: 9243 -> 8901 (-3.70%) spills in affected programs: 2718 -> 2376 (-12.58%) helped: 4 HURT: 4 total fills in shared programs: 21831 -> 10141 (-53.55%) fills in affected programs: 11804 -> 114 (-99.03%) helped: 6 HURT: 2 total sends in shared programs: 815912 -> 815912 (0.00%) sends in affected programs: 0 -> 0 helped: 0 HURT: 0 LOST: 1 GAINED: 3 The helped shaders are all compute shaders in Aztec Ruins. There is also a compute shader in synmark2 OglCSDof that's helped but it doesn't show up in above shader-db results because it went from SIMD8 to SIMD16. That shader improves enough to yield an 15-20% performance boost to the benchmark as a whole on my KBL laptop. The hurt shaders are a couple shaders in Kerbal Space Program and a couple in Aztec Ruins. Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* intel/fs: Implement the new load/store_scratch intrinsicsJason Ekstrand2019-11-115-17/+241
| | | | | | | | | | | | | | | | | This commit fills in a number of different pieces: 1. We add support to brw_nir_lower_mem_access_bit_sizes to handle the new intrinsics. This involves simple plumbing work as well as a tiny bit of extra logic to always scalarize scratch intrinsics 2. Add code to brw_fs_nir.cpp to turn nir_load/store_scratch intrinsics into byte/dword scattered read/write messages which use the A32 stateless model. 3. Add code to lower_surface_logical_send to handle dword scattered messages and the A32 stateless model. Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* intel/nir: Plumb devinfo through lower_mem_access_bit_sizesJason Ekstrand2019-11-113-9/+14
| | | | Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* intel/fs: refactor surface header setupJason Ekstrand2019-11-111-23/+16
| | | | Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* intel/fs: Add DWord scattered read/write opcodesJason Ekstrand2019-11-115-0/+66
| | | | Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* intel/nir: Use nir_extract_bits in lower_mem_access_bit_sizesJason Ekstrand2019-11-111-37/+15
| | | | | | | The new helper solves most of the annoying problems with data wrangling in brw_nir_lower_mem_access_bit_sizes. Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* anv: Unify GetDeviceQueue and GetDeviceQueue2Ricardo Garcia2019-11-111-4/+8
| | | | | | | | | Avoid duplicating some checks and code by making anv_GetDeviceQueue a subcase of anv_GetDeviceQueue2, like radv does. Signed-off-by: Ricardo Garcia <[email protected]> Reviewed-by: Lionel Landwerlin <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* Revert "intel/blorp: Fix usage of uninitialized memory in key hashing"Kenneth Graunke2019-11-071-6/+1
| | | | | | | This reverts commit 4432a2d14d80081d062f7939a950d65ea3a16eed. Pretty much every SKQP test dies with this assertion: skqp: ../src/mesa/drivers/dri/i965/brw_program_cache.c:102: hash_key: Assertion `item->key_size % 4 == 0' failed.