summaryrefslogtreecommitdiffstats
path: root/src/broadcom
Commit message (Collapse)AuthorAgeFilesLines
* nir: add nir_shader_compiler_options::lower_to_scalarMarek Olšák2019-10-101-0/+1
| | | | | | | | This will replace PIPE_SHADER_CAP_SCALAR_ISA. Reviewed-by: Timothy Arceri <[email protected]> Reviewed-by: Christian Gmeiner <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* v3d: Enable the late algebraic optimizations to get real subs.Eric Anholt2019-09-301-0/+16
| | | | | | | | | | | | | | | | | This worked better than my original v3d-local pass for just subs, and is a huge win over not producing subs. total instructions in shared programs: 6408469 -> 6167932 (-3.75%) total threads in shared programs: 153784 -> 154104 (0.21%) total uniforms in shared programs: 2157078 -> 1905823 (-11.65%) total max-temps in shared programs: 904546 -> 895796 (-0.97%) total spills in shared programs: 4959 -> 4993 (0.69%) total fills in shared programs: 6558 -> 6670 (1.71%) total sfu-stalls in shared programs: 25845 -> 25175 (-2.59%) total inst-and-stalls in shared programs: 6434314 -> 6193107 (-3.75%) Reviewed-by: Daniel Schürmann <[email protected]> Reviewed-by: Connor Abbott <[email protected]>
* broadcom/genxml: Stop manually scrubbing 'α' -> "alpha"Kenneth Graunke2019-09-231-1/+0
| | | | | | | 'α' has never appeared in any genxml files, so there's no need to replace it with the word "alpha". Reviewed-by: Eric Anholt <[email protected]>
* nir: allow specifying filter callback in lower_alu_to_scalarVasily Khoruzhick2019-09-061-1/+1
| | | | | | | | | | | | | Set of opcodes doesn't have enough flexibility in certain cases. E.g. Utgard PP has vector conditional select operation, but condition is always scalar. Lowering all the vector selects to scalar increases instruction number, so we need a way to filter only those ops that can't be handled in hardware. Reviewed-by: Qiang Yu <[email protected]> Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* v3d: writes to magic registers aren't RF writes after THRENDJose Maria Casanova Crespo2019-09-051-1/+3
| | | | | | | | | | | | | | Shaders must not attempt to write to the register files in the last three instructions, but that doesn't include the magic registers: nop ; nop ; thrsw; ldtmu.- *** ERROR *** nop ; nop nop ; nop v2: Simplify validation rules. (Eric Anholt) v3: Adjust validation even more. (Eric Anholt) Reviewed-by: Eric Anholt <[email protected]>
* nir: Fix num_ssbos when lowering atomic countersConnor Abbott2019-09-031-2/+1
| | | | | | | | | | | | Otherwise it's impossible to know the maximum SSBO index for both internal TGSI shaders from TTN (which don't have any notion of atomic counters and no offset) as well as shaders from GLSL. I fixed everything I could find while grepping for num_ssbos and num_abos, which hopefully is everything (iris was the only user I could find that uses it in a meaningful way). Reviewed-by: Marek Olšák <[email protected]>
* v3d: Use the correct opcodes for signed image min/maxJason Ekstrand2019-08-211-0/+2
| | | | Reviewed-by: Eric Anholt <[email protected]>
* nir: Add explicit signs to image min/max intrinsicsJason Ekstrand2019-08-212-4/+8
| | | | | | | | | | | This better matches all the other atomic intrinsics such as those for SSBOs and shared variables where the sign is part of the intrinsic opcode. Both generators (GLSL and SPIR-V) know the sign from the type of the image variable or handle. In SPIR-V, signed min/max are separate opcodes from unsigned. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Eric Anholt <[email protected]>
* v3d: clamp gl_PointSize to a minimum of 1.0Iago Toral Quiroga2019-08-131-0/+5
| | | | | | | | | | | | | | | The OpenGL ES spec requires that the value of gl_PointSize is clamped to an implementation-dependent range matching what is advertised by GL_ALIASED_POINT_SIZE_RANGE. For V3D this is [1.0, 512.0], but the hardware won't clamp to the minimum side of the range and won't render points with a size strictly smaller than 1.0 either, so we need to clamp manually. For points larger than the maximum size of the range the hardware clamps automatically. Fixes piglit test: spec/!opengl 2.0/vs-point_size-zero Reviewed-by: Eric Anholt <[email protected]>
* v3d: line length style fixesIago Toral Quiroga2019-08-131-26/+33
| | | | Reviewed-by: Eric Anholt <[email protected]>
* v3d: honor the write mask on store operationsIago Toral Quiroga2019-08-131-85/+120
| | | | | | | | | | | | | | | | | | | | | | | | | v2: - Fix incremental update of the const offset when we need to emit a sequence with more than one write because of the writemask. - Do not move the tmu write emission to a separate helper. v3: - Get the store writemask before the loop, use ffs to get the first component to write and clear writemask bits as we process the components (Eric). - Simplified the code that figured out the number of components for the TMU config based on the number of tmu writes for stores and atomics. v4: - Code clean-ups (Eric). Fixes: KHR-GLES31.core.shader_image_load_store.advanced-cast-cs KHR-GLES31.core.shader_image_load_store.advanced-cast-fs KHR-GLES31.core.shader_storage_buffer_object.advanced-switchBuffers-cs KHR-GLES31.core.shader_storage_buffer_object.advanced-switchPrograms-cs KHR-GLES31.core.shader_storage_buffer_object.basic-operations-case1-cs Reviewed-by: Eric Anholt <[email protected]>
* v3d: refactor ntq_emit_tmu_general() slightlyIago Toral Quiroga2019-08-131-24/+36
| | | | | | | | | | When we implement write masks on store operations we might need to emit multiple write sequences for a given store intrinsic. To make that easier, let's split the emission of the tmud instructions to their own block after we are done with the code that only needs to run once no matter how many write sequences we need to emit. Reviewed-by: Eric Anholt <[email protected]>
* nir: merge and extend nir_opt_move_comparisons and nir_opt_move_load_uboRhys Perry2019-08-121-1/+1
| | | | | | | | | | v2: add to series v3: update Makefile.sources v4: don't remove a comment and break statement v4: use nir_can_move_instr Signed-off-by: Rhys Perry <[email protected]> Reviewed-by: Eric Anholt <[email protected]>
* v3d: use the GPU to record primitives written to transform feedbackIago Toral Quiroga2019-08-081-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use the PRIMITIVE_COUNTS_FEEDBACK packet to write various primitive counts to a buffer, including the number of primives written to transform feedback buffers, which will handle buffer overflow correctly. There are a couple of caveats with this: Primitive counters are reset when we emit a 'Tile Binning Mode Configuration' packet, which can happen in the middle of a primitives query, so we need to read the buffer when we submit a job and accumulate the counts in the context so we don't lose them. We also need to do the same when we switch primitive type during transform feedback so we can compute the correct number of recorded vertices from the number of primitives. This is necessary so we can provide an accurate vertex count for draw from transform feedback. v2: - When computing the number of vertices for a primitive, pass in the base primitive, since that is what the hardware will count. - No need to update primitive counts when switching primitive types if the base primitives are the same. - Log perf warning when mapping the primitive counts BO for readback (Eric). - Only emit the primitive counts packet once at job end (Eric). - Use u_upload mechanism for the primitive counts buffer (Eric). - Use the XML to generate indices into the primitive counters buffer (Eric). Fixes piglit tests: spec/ext_transform_feedback/overflow-edge-cases spec/ext_transform_feedback/query-primitives_written-bufferrange spec/ext_transform_feedback/query-primitives_written-bufferrange-discard spec/ext_transform_feedback/change-size base-shrink spec/ext_transform_feedback/change-size base-grow spec/ext_transform_feedback/change-size offset-shrink spec/ext_transform_feedback/change-size offset-grow spec/ext_transform_feedback/change-size range-shrink spec/ext_transform_feedback/change-size range-grow spec/ext_transform_feedback/intervening-read prims-written Reviewed-by: Eric Anholt <[email protected]>
* v3d: add header guards in v3d_packet_helpers.hIago Toral Quiroga2019-08-081-0/+4
| | | | Reviewed-by: Eric Anholt <[email protected]>
* meson: replace libmesa_util with idep_mesautilEric Engestrom2019-08-032-3/+4
| | | | | | | | | | | This automates the include_directories and dependencies tracking so that all users of libmesa_util don't need to add them manually. Next commit will remove the ones that were only added for that reason. Signed-off-by: Eric Engestrom <[email protected]> Acked-by: Eric Anholt <[email protected]> Tested-by: Vinson Lee <[email protected]>
* tree-wide: replace MAYBE_UNUSED with ASSERTEDEric Engestrom2019-07-314-5/+5
| | | | | | Suggested-by: Jason Ekstrand <[email protected]> Signed-off-by: Eric Engestrom <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* v3d: Introduce a DRM shim for calling out to the simulator.Eric Anholt2019-07-257-0/+779
| | | | | | | | | | | | The goal is to enable testing of parts of drivers without depending on any particular kernel version or hardware being present. Simply set LD_PRELOAD=$PREFIX/lib/libv3d_drm_shim.so in your environment, and we'll fake a /dev/dri/renderD128 (or whatever the next available node is) using v3dv3. That node can then be used with the surfaceless or gbm EGL platforms. Acked-by: Iago Toral Quiroga <[email protected]>
* v3d: Avoid scheduling an instruction that stalls waiting for SFU retvalJose Maria Casanova Crespo2019-07-221-4/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we detect that a scheduling candidate will stall because having a register source that is the written by the SFU unit in the previous instruction we reduce its priority so any non stalling operation would be chosen. The latency of SFU operations is defined as 2. So they would be scheduled earlier if other candidates have the same priority. Finally we won't merge instructions that stall to a previously chosen one. As the result of the previous one would be waiting for an extra cycle. Although shader-db result show that instruction are hurt with an increase of 0.35% the sum of instructions + stalls is reduced a 0.52%. And the total of sfu-stalls is reduced a 63.51%. It implies also a small increase in the max-temps metric because of scheduling earlier SFU operations. total instructions in shared programs: 9102719 -> 9117851 (0.17%) instructions in affected programs: 4324628 -> 4339760 (0.35%) helped: 4162 HURT: 12128 helped stats (abs) min: 1 max: 10 x̄: 1.28 x̃: 1 helped stats (rel) min: 0.09% max: 4.76% x̄: 0.66% x̃: 0.51% HURT stats (abs) min: 1 max: 27 x̄: 1.69 x̃: 1 HURT stats (rel) min: 0.05% max: 7.69% x̄: 0.87% x̃: 0.68% 95% mean confidence interval for instructions value: 0.90 0.96 95% mean confidence interval for instructions %-change: 0.47% 0.50% Instructions are HURT. total max-temps in shared programs: 1327728 -> 1327812 (<.01%) max-temps in affected programs: 4730 -> 4814 (1.78%) helped: 61 HURT: 134 helped stats (abs) min: 1 max: 2 x̄: 1.08 x̃: 1 helped stats (rel) min: 2.70% max: 13.33% x̄: 4.89% x̃: 4.17% HURT stats (abs) min: 1 max: 3 x̄: 1.12 x̃: 1 HURT stats (rel) min: 1.54% max: 20.00% x̄: 6.10% x̃: 5.26% 95% mean confidence interval for max-temps value: 0.28 0.58 95% mean confidence interval for max-temps %-change: 1.80% 3.52% Max-temps are HURT. total sfu-stalls in shared programs: 99551 -> 36324 (-63.51%) sfu-stalls in affected programs: 95029 -> 31802 (-66.53%) helped: 25882 HURT: 0 helped stats (abs) min: 1 max: 27 x̄: 2.44 x̃: 2 helped stats (rel) min: 5.26% max: 100.00% x̄: 79.86% x̃: 100.00% 95% mean confidence interval for sfu-stalls value: -2.47 -2.42 95% mean confidence interval for sfu-stalls %-change: -80.18% -79.54% Sfu-stalls are helped. total inst-and-stalls in shared programs: 9202270 -> 9154175 (-0.52%) inst-and-stalls in affected programs: 5618516 -> 5570421 (-0.86%) helped: 22728 HURT: 855 helped stats (abs) min: 1 max: 31 x̄: 2.16 x̃: 1 helped stats (rel) min: 0.07% max: 16.67% x̄: 1.14% x̃: 0.92% HURT stats (abs) min: 1 max: 5 x̄: 1.25 x̃: 1 HURT stats (rel) min: 0.12% max: 5.26% x̄: 1.24% x̃: 0.86% 95% mean confidence interval for inst-and-stalls value: -2.07 -2.01 95% mean confidence interval for inst-and-stalls %-change: -1.07% -1.05% Inst-and-stalls are helped. v2: Rename v3d_qpu_generates_sfu_stalls to v3d_qpu_instr_is_sfu (Eric) Reviewed-by: Eric Anholt <[email protected]>
* v3d: add shader-db stat to count SFU stallsJose Maria Casanova Crespo2019-07-225-14/+74
| | | | | | | | | | | | | | SFU operations have a latency of 2 cicles, so if their results are used in the following cycle to a SFU instruction, the GPU stalls for an extra cycle until the result is available. This adds the number of stalls to the shader-db debug mode and sum of instruction + stalls to evaluate optimizations to schedule instructions that avoid generating sfu-stalls. v2: Rename v3d_qpu_generates_sfu_stalls to v3d_qpu_instr_is_sfu (Eric) Reviewed-by: Eric Anholt <[email protected]>
* v3d: Use nir_shader_lower_instructions() for txf_ms lowering.Eric Anholt2019-07-181-26/+16
| | | | | | Cuts out a bunch of boilerplate. Reviewed-by: Iago Toral Quiroga <[email protected]>
* v3d: Fix assertion failures in debug builds.Eric Anholt2019-07-181-0/+2
| | | | | | | | | nir_lower_io leaves around deref_var instructions after lowering away deref intrinsics. This ends up breaking validation after v3d_nir_lower_io removes variables not actually being stored by the shader's store_output()s. Reviewed-by: Iago Toral Quiroga <[email protected]>
* v3d: emit correct lowering for logic operations with MSAA render targetsIago Toral Quiroga2019-07-181-5/+54
| | | | | | | v2: - Drop the writemask from the per-sample color intrinsic (Eric) Reviewed-by: Eric Anholt <[email protected]>
* v3d: handle nir_intrinsic_store_tlb_sample_color_v3dIago Toral Quiroga2019-07-181-20/+44
| | | | | | | v2: - Move handling of output intrinsics to ntq_emit_intrinsic() (Eric). Reviewed-by: Eric Anholt <[email protected]>
* v3d: implement per-sample tlb color writesIago Toral Quiroga2019-07-181-30/+44
| | | | Reviewed-by: Eric Anholt <[email protected]>
* v3d: refactor the tlb color write codeIago Toral Quiroga2019-07-181-49/+39
| | | | | | | | We want to split the tlb specifier setup from the color writes, because when we implement per-sample color writes we want to do the latter for all the samples, but the former only once. Reviewed-by: Eric Anholt <[email protected]>
* v3d: move tlb color write emission to a helper functionIago Toral Quiroga2019-07-181-95/+99
| | | | | | | | | We will soon be adding per-sample color writes which means additional complexity and more indentation (we will need another loop to emit the writes for each individual sample), so this will help keeping things simple and a bit more readable. Reviewed-by: Eric Anholt <[email protected]>
* v3d: implement per-sample tlb color readsIago Toral Quiroga2019-07-181-39/+52
| | | | Reviewed-by: Eric Anholt <[email protected]>
* broadcom: Move v3d_get_device_info to commonAndreas Bergmeier2019-07-173-1/+86
| | | | In common we can use implementation for Vulkan.
* v3d: use inc/dec tmu operation with image atomic sub/add of 1Alejandro Piñeiro2019-07-121-5/+11
| | | | | | | | | | | | | | | | | | This allows to remove a mov of 1/-1, as it is implicit with the operation. As with atomic inc/dec/add, usual shader-db set doesn't include any GLES shader using it. So using as workaround vk-gl-cts shaders, we get this: total instructions in shared programs: 1217013 -> 1217006 (<.01%) instructions in affected programs: 53 -> 46 (-13.21%) helped: 2 HURT: 0 One of the helped shader went from 40 to 34 instructions. Reviewed-by: Eric Anholt <[email protected]>
* v3d: refactor some code from v3d40_vir_emit_image_load_storeAlejandro Piñeiro2019-07-121-33/+29
| | | | | | | And moved to new auxiliar method v3d40_image_load_store_tmu_op, equivalent to the nir_to_nir v3d_general_tmu_op, to clean-up a little. Reviewed-by: Eric Anholt <[email protected]>
* v3d: use inc/dec tmu operation with atomic sub/add of 1Alejandro Piñeiro2019-07-122-6/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Among other things, this avoid the need of loading 1/-1 constants (so one less operation). The removed comment suggest the option of adding support on NIR for inc/dec. Intel just uses an auxiliar method to get which hw operation is needed, so no lowering is needed. And at the same time, being so small, seems unreasonable to try to add a general one on NIR itself. It is more easy to just adapt the method here (that is what the patch does right now). It is worth to note that we are not getting any change on shader-db stats because all those methods are used on the usual shader-db set with shaders needing GLSL > 4.2. In general there aren't too many GLSL ES 3.1 tests. As an alternative, we captured the GLES3/GLSL31/GLS32 used on vk-gl-cts, even if that is not a real life usage of shaders. With those we get the following: total instructions in shared programs: 1217022 -> 1217013 (<.01%) instructions in affected programs: 117 -> 108 (-7.69%) helped: 6 HURT: 0 helped stats (abs) min: 1 max: 2 x̄: 1.50 x̃: 1 helped stats (rel) min: 3.57% max: 10.00% x̄: 8.09% x̃: 9.09% 95% mean confidence interval for instructions value: -2.07 -0.93 95% mean confidence interval for instructions %-change: -10.54% -5.64% Instructions are helped. Note that the shaders helped are really low because most of the vk-gl-cts tests using AtomicInc/Dec/Add are mostly used on compute shaders. Although right now there is a branch around with CS support, the usual is doing the stats against master. Reviewed-by: Eric Anholt <[email protected]>
* v3d: remove redefinition of tmu operations on nir_to_virAlejandro Piñeiro2019-07-121-38/+21
| | | | | | | | | | | | They are already defined, although is a slightly different format on the generated packet headers, so it was needed to change how it is used on nir_to_vir. In addition to allow to remove some duplicated headers, it will allow to define just one get_op_for_atomic_add aux method later to support using inc/dec instead of add of 1/-1. Reviewed-by: Eric Anholt <[email protected]>
* v3d: tweak initial comment on pack generator scriptAlejandro Piñeiro2019-07-121-1/+1
| | | | | | | As the files it mentions to use as reference has slightly different names. Reviewed-by: Eric Anholt <[email protected]>
* v3d: remove unused definitionsIago Toral Quiroga2019-07-121-7/+0
| | | | Reviewed-by: Eric Anholt <[email protected]>
* v3d: move implementation of some intrinsics to separate helpersIago Toral Quiroga2019-07-121-78/+90
| | | | Reviewed-by: Eric Anholt <[email protected]>
* v3d: emit correct lowering for logic ops with RGB10A2 render targetsIago Toral Quiroga2019-07-121-12/+64
| | | | Reviewed-by: Eric Anholt <[email protected]>
* v3d: emit correct lowering for logic ops with integer render targetsIago Toral Quiroga2019-07-122-9/+47
| | | | Reviewed-by: Eric Anholt <[email protected]>
* v3d: add lowering for OpenGL logic operationsIago Toral Quiroga2019-07-124-0/+279
| | | | | | | | | | | | | | | | | | | | | This implements support for OpenGL logic operations by emitting code to read from the TLB if needed and blending the fragment output accordingly. It is similar to VC4's blend lowering pass, but exclusive to logic operations, since blending is otherwise supported in hardware. The pass doesn't handle MSAA targets yet. Fixes the following piglit tests: spec/!opengl 1.0/gl-1.0-logicop/* spec/!opengl 1.1/gl-1.1-xor spec/!opengl 1.1/gl-1.1-xor-copypixels It also fixes text cursor rendering in Libreoffice with the GTK+2 theme, which is rendered via glamor using the XOR logic operation. v2: fix checks for allowed variable location and maximum render target (Eric) Reviewed-by: Eric Anholt <[email protected]>
* v3d: acquire scoreboard lock before first tlb readIago Toral Quiroga2019-07-123-0/+34
| | | | | | | | | | | | | | | | | | | Until now we have always been emitting our scoreboard locks on the last thread switch to improve parallelism. We did this by emitting our last thread switch right before our tlb writes at the very end of the program, where we know that we are outside control flow. Unfortunately, this strategy is not valid when we have tlb color reads too, as these will happen before this point in the program and can happen inside control flow. To fix this we always emit a thread switch before the first tlb load and if we see additional thread switches after that point, we change the strategy to lock on the first thread switch. v2: change the solution so it is expected to work in more scenarios (Eric). Reviewed-by: Eric Anholt <[email protected]>
* v3d: implement tile buffer color read intrinsicIago Toral Quiroga2019-07-121-0/+100
| | | | | | | | | | | | We will be emitting this intrinsic to signal TLB color loads when we implement OpenGL logic operations, where we need to blend the fragment shader color output with the existing color in the render target. Per-sample TLB reads are not supported yet. v2: fix the offset into the color_reads array (Eric). Reviewed-by: Eric Anholt <[email protected]>
* v3d: fix size of color_reads and sample_colors arraysIago Toral Quiroga2019-07-121-2/+2
| | | | | | | | | We need to scale the size of these arrays to consider up to V3D_MAX_DRAW_BUFFERS render targets and 4 components per color. v2: we want to store each color component separately, so scale by 4 too. Reviewed-by: Eric Anholt <[email protected]>
* v3d: add color formats and swizzles to the fragment shader keyIago Toral Quiroga2019-07-121-0/+9
| | | | | | | We are going to need these very soon to emit correct reads from the tlb to implement logic operations. Reviewed-by: Eric Anholt <[email protected]>
* v3d: add helpers to emit ldtlb and ldtlbu signalsIago Toral Quiroga2019-07-121-0/+24
| | | | | | | | | | The ldtlbu version will read an implicit uniform with the TLB read specifier and should be used for the first read in a sequence of TLB reads (unless the default configuration is valid, in which case we can use ldtlb). The ldtlb version is used for any subsequent TLB read in the sequence. Reviewed-by: Eric Anholt <[email protected]>
* v3d: handle tlb read dependency tracking as if they were writesIago Toral Quiroga2019-07-121-1/+1
| | | | | | Tile buffer reads are emitted as ordered sequences and cannot be reordered. Reviewed-by: Eric Anholt <[email protected]>
* v3d: instructions with the ldtlb and ldtlbu signals are tlb instructionsIago Toral Quiroga2019-07-121-0/+3
| | | | Reviewed-by: Eric Anholt <[email protected]>
* v3d: tlb loads cannot be removedIago Toral Quiroga2019-07-121-0/+2
| | | | | | | Loads from the tile buffer are emitted in ordered sequences so we cannot eliminate or reorder any of them. Reviewed-by: Eric Anholt <[email protected]>
* v3d: the ldtlbu signal reads an implicit uniformIago Toral Quiroga2019-07-121-0/+1
| | | | Reviewed-by: Eric Anholt <[email protected]>
* v3d: handle ldtlb and ldtlbu signals during disassemblyIago Toral Quiroga2019-07-121-0/+2
| | | | | | | | We already have code to print these signals but the early return in the code that checks if any signals are present present was missing the checks for them, so it would skip printing them unless they were paired with other signals. Reviewed-by: Eric Anholt <[email protected]>
* nir: Add lower_rotate flag and set to true in all driversSagar Ghuge2019-07-011-0/+1
| | | | | | Signed-off-by: Sagar Ghuge <[email protected]> Suggested-by: Matt Turner <[email protected]> Reviewed-by: Matt Turner <[email protected]>