aboutsummaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/radeonsi/si_state_shaders.c
Commit message (Collapse)AuthorAgeFilesLines
* radeonsi/gfx9: prevent shader-db crashesMarek Olšák2017-08-221-1/+11
| | | | | | | - don't precompile LS and ES (they don't exist on GFX9), compile as VS instead - don't precompile HS and GS (we don't have LS and ES parts) Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: make si_shader_selector_reference globally visibleNicolai Hähnle2017-08-221-15/+2
| | | | Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: add a separate dirty mask for prefetchesMarek Olšák2017-08-071-2/+31
| | | | | | | | | | so that we don't rely on si_pm4_state_enabled_and_changed, allowing us to move prefetches after draw calls. v2: ckear the dirty mask after unbinding shaders Tested-by: Dieter Nützel <[email protected]> (v1) Reviewed-by: Nicolai Hähnle <[email protected]> (v1)
* radeonsi: add and use si_pm4_state_enabled_and_changedMarek Olšák2017-08-071-6/+6
| | | | | Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: de-atomize L2 prefetchMarek Olšák2017-08-071-1/+1
| | | | | | | I'd like to be able to move the prefetch call site around. Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: tweak next-shader assumptions when streamout is usedNicolai Hähnle2017-07-311-5/+11
| | | | | | VS with streamout is always a HW VS. Reviewed-by: Marek Olšák <[email protected]>
* radeonsi/nir: perform lowering of input/output driver locationsNicolai Hähnle2017-07-311-0/+2
| | | | Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: bypass the shader cache for NIR shadersNicolai Hähnle2017-07-311-2/+3
| | | | Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: scan NIR shaders to obtain required infoNicolai Hähnle2017-07-311-6/+17
| | | | | | v2: set num_instruction to 2, i.e. 1 + END (Marek) Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: simplify computation of tessellation offchip buffersMarek Olšák2017-07-171-15/+4
| | | | | | This is overly cautious, but better safe than sorry. Reviewed-by: Nicolai Hähnle <[email protected]>
* gallium: use "ull" number suffix to keep the QtCreator parser happyMarek Olšák2017-07-101-5/+5
| | | | | | | It can't parse "llu". Reviewed-by: Thomas Helland <[email protected]> Reviewed-by: Eric Engestrom <[email protected]>
* radeonsi: move instance divisors into a constant bufferMarek Olšák2017-06-271-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shader key size: 107 -> 47 Divisors of 0 and 1 are encoded in the shader key. Greater instance divisors are loaded from a constant buffer. The shader code doing the division is huge. Is it something we need to worry about? Does any app use instance divisors >= 2? VS prolog disassembly: s_load_dwordx4 s[12:15], s[0:1], 0x80 ; C00A0300 00000080 s_nop 0 ; BF800000 s_waitcnt lgkmcnt(0) ; BF8C007F s_buffer_load_dword s14, s[12:15], 0x4 ; C0220386 00000004 s_waitcnt lgkmcnt(0) ; BF8C007F v_cvt_f32_u32_e32 v4, s14 ; 7E080C0E v_rcp_iflag_f32_e32 v4, v4 ; 7E084704 v_mul_f32_e32 v4, 0x4f800000, v4 ; 0A0808FF 4F800000 v_cvt_u32_f32_e32 v4, v4 ; 7E080F04 v_mul_hi_u32 v5, v4, s14 ; D2860005 00001D04 v_mul_lo_i32 v6, v4, s14 ; D2850006 00001D04 v_cmp_eq_u32_e64 s[12:13], 0, v5 ; D0CA000C 00020A80 v_sub_i32_e32 v5, vcc, 0, v6 ; 340A0C80 v_cndmask_b32_e64 v5, v6, v5, s[12:13] ; D1000005 00320B06 v_mul_hi_u32 v5, v5, v4 ; D2860005 00020905 v_add_i32_e32 v6, vcc, v5, v4 ; 320C0905 v_subrev_i32_e32 v4, vcc, v5, v4 ; 36080905 v_cndmask_b32_e64 v4, v4, v6, s[12:13] ; D1000004 00320D04 v_mul_hi_u32 v5, v4, v1 ; D2860005 00020304 v_add_i32_e32 v4, vcc, s8, v0 ; 32080008 v_mul_lo_i32 v6, v5, s14 ; D2850006 00001D05 v_add_i32_e32 v7, vcc, 1, v5 ; 320E0A81 v_cmp_ge_u32_e64 s[12:13], v1, v6 ; D0CE000C 00020D01 v_sub_i32_e32 v6, vcc, v1, v6 ; 340C0D01 v_cmp_le_u32_e32 vcc, s14, v6 ; 7D960C0E v_cndmask_b32_e64 v8, 0, -1, s[12:13] ; D1000008 00318280 v_cndmask_b32_e64 v6, 0, -1, vcc ; D1000006 01A98280 v_and_b32_e32 v6, v8, v6 ; 260C0D08 v_cmp_eq_u32_e32 vcc, 0, v6 ; 7D940C80 v_cndmask_b32_e32 v6, v7, v5, vcc ; 000C0B07 v_add_i32_e32 v5, vcc, -1, v5 ; 320A0AC1 v_cmp_eq_u32_e32 vcc, 0, v8 ; 7D941080 v_cndmask_b32_e32 v5, v6, v5, vcc ; 000A0B06 v_add_i32_e32 v5, vcc, s9, v5 ; 320A0A09 v2: set prefer_mono for fetched instance divisors Reviewed-by: Nicolai Hähnle <[email protected]>
* Revert "radeonsi: use uint32_t to declare si_shader_key.opt.kill_outputs"Marek Olšák2017-06-271-3/+2
| | | | | | This reverts commit 7b2240ac9ce3ba9bd86f4ae8aac53af8878c0b10. Reviewed-by: Nicolai Hähnle <[email protected]>
* Revert "radeonsi: remove 8 bytes from si_shader_key with uint32_t ↵Marek Olšák2017-06-271-6/+2
| | | | | | | | ff_tcs_inputs_to_copy" This reverts commit 6b6fed3a3c81c2b0d319ef121df20a0dc914705f. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: use the correct LLVMTargetMachineRef in si_build_shader_variantNicolai Hähnle2017-06-221-6/+22
| | | | | | | | | | | | si_build_shader_variant can actually be called directly from one of normal-priority compiler threads. In that case, the thread_index is only valid for the normal tm array. v2: - use the correct sel/shader->compiler_ctx_state Fixes: 86cc8097266c ("radeonsi: use a compiler queue with a low priority for optimized shaders") Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: track use of bindless samplers/images from tgsi_shader_infoSamuel Pitoiset2017-06-141-5/+26
| | | | | | | | | This adds some new helper functions to know if the current draw call (or dispatch compute) is using bindless samplers/images, based on TGSI analysis. Signed-off-by: Samuel Pitoiset <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: replace si_vertex_elements::elements with separate fieldsMarek Olšák2017-06-121-5/+2
| | | | | | It makes si_vertex_elements a little smaller. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: remove 8 bytes from si_shader_key with uint32_t ff_tcs_inputs_to_copyMarek Olšák2017-06-121-2/+6
| | | | | | The previous patch helps with this. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: use uint32_t to declare si_shader_key.opt.kill_outputsMarek Olšák2017-06-121-2/+3
| | | | | | the next patch will benefit from this Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: remove 8 bytes from si_shader_key by flattening opt.hw_vsMarek Olšák2017-06-121-6/+6
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: don't update dependent states if it has no effect (v2)Marek Olšák2017-06-081-5/+19
| | | | | | | | | | | | | | This and the previous clip_regs commit decrease IB sizes and the number of si_update_shaders invocations as follows: IB size si_update_shaders calls Borderlands 2 -10% -27% Deus Ex: MD -5% -11% Talos Principle -8% -30% v2: always dirty cb_render_state in set_framebuffer_state Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: update clip_regs on shader state changes only when it's neededMarek Olšák2017-06-071-3/+32
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: precompute some fields for PA_CL_VS_OUT_CNTL in si_shader_selectorMarek Olšák2017-06-071-0/+16
| | | | | Reviewed-by: Samuel Pitoiset <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: add a new helper si_get_vsMarek Olšák2017-06-071-4/+2
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: remove 8 bytes from si_shader_keyMarek Olšák2017-06-071-4/+4
| | | | | | | We can use a union in si_shader_key::mono. Reviewed-by: Samuel Pitoiset <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: move PSIZE and CLIPDIST unique IO indices after GENERICMarek Olšák2017-06-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Heaven LDS usage for LS+HS is below. The masks are "outputs_written" for LS and HS. Note that 32K is the maximum size. Before: heaven_x64: ls=1f1 tcs=1f1, lds=32K heaven_x64: ls=31 tcs=31, lds=24K heaven_x64: ls=71 tcs=71, lds=28K After: heaven_x64: ls=3f tcs=3f, lds=24K heaven_x64: ls=7 tcs=7, lds=13K heaven_x64: ls=f tcs=f, lds=17K All other apps have a similar decrease in LDS usage, because the "outputs_written" masks are similar. Also, most apps don't write POSITION in these shader stages, so there is room for improvement. (tight per-component input/output packing might help even more) It's unknown whether this improves performance. Tested-by: Edmondo Tommasina <[email protected]> Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: prevent a race when the previous shader's main part is missingMarek Olšák2017-06-071-0/+2
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: wait for main part compilation of 1st shaders of merged shadersMarek Olšák2017-06-071-0/+4
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: fix LS scratch buffer support without TCS for GFX9Marek Olšák2017-06-071-3/+18
| | | | | | | | | | LS is merged into TCS. If there is no TCS, LS is merged into fixed-func TCS. The problem is the fixed-func TCS was ignored by scratch update functions, so LS didn't have the scratch buffer set up. Note that Mesa 17.1 doesn't have merged shaders. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: move streamout state update out of si_update_shadersMarek Olšák2017-06-071-16/+24
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: move handling of DBG_NO_OPT_VARIANT into si_shader_selector_keyMarek Olšák2017-06-071-4/+3
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: use a compiler queue with a low priority for optimized shadersMarek Olšák2017-06-071-4/+4
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: drop unfinished shader compilations when destroying shadersMarek Olšák2017-06-071-2/+3
| | | | | | | If we enqueue too many jobs and destroy the GL context, it may take several seconds before the jobs finish. Just drop them instead. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: only upload (dump to L2) those descriptors that are used by shadersMarek Olšák2017-05-181-0/+6
| | | | | | | This decreases the size of CE RAM dumps to L2, or the size of descriptor uploads without CE. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: record which descriptor slots are used by shadersMarek Olšák2017-05-181-0/+27
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: rename tcs_tes_uses_prim_id for clarityNicolai Hähnle2017-05-161-6/+6
| | | | | | | | What we care about is whether PrimID is used while tessellation is enabled; whether it's used in TCS/TES or further down the pipeline is irrelevant. Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: fix gl_PrimitiveIDIn in geometry shader when using tessellationNicolai Hähnle2017-05-161-0/+2
| | | | | | | | | | | This builds on commit 0549ea15ec38 ("radeonsi: fix primitive ID in fragment shader when using tessellation"). Fixes piglit arb_tessellation_shader/execution/gs-primitiveid-instanced.shader_test Cc: 17.1 <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: get rid of secondary input/output wordNicolai Hähnle2017-05-121-19/+3
| | | | | | | By keeping track of fewer generics, everything can fit into 64 bits. Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: skip generic out/in indices without a shader IO indexNicolai Hähnle2017-05-121-1/+5
| | | | | | | | | | | | | OpenGL uses at most 32 generic outputs/inputs in any stage, and they always have a shader IO index and therefore fit into the outputs_written/ inputs_read/kill_outputs fields. However, Nine uses semantic indices more liberally. We support that in VS-PS pipelines, except that the optimization of killing outputs must be skipped. Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: use SI_MAX_IO_GENERIC instead of magic valuesNicolai Hähnle2017-05-121-2/+2
| | | | | Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: split per-patch from per-vertex indicesNicolai Hähnle2017-05-081-3/+3
| | | | | | | Make it a bit clearer that the index spaces are logically seperate by having them defined in different functions. Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: load patch_id for TES-as-ES when exporting for PSNicolai Hähnle2017-05-081-2/+2
| | | | | | | For some reason, this change is only necessary on SI. Cc: [email protected] Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: fix primitive ID in fragment shader when using tessellationNicolai Hähnle2017-05-081-10/+17
| | | | | | | | | | | In a VS->TCS->TES->PS pipeline, the primitive ID is read from TES exports, so it is as if TES were using the primitive ID. Specifically, this fixes a bug where the primitive ID is not reset at the start of a new instance. Cc: [email protected] Reviewed-by: Marek Olšák <[email protected]>
* radeonsi/gfx9: allow the scratch buffer in HS and GSMarek Olšák2017-05-051-10/+0
| | | | | | It works now. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: prevent race conditions when doing scratch patchingMarek Olšák2017-05-051-2/+30
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: separate scratch state patching code into its own functionMarek Olšák2017-05-051-46/+55
| | | | | | | Picked from a different branch. When we stop using the scratch patching, this function will not be called. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: also apply scratch relocations to the 1st shader of merged ↵Marek Olšák2017-05-051-0/+3
| | | | | | shaders Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: remove unused parameters from si_shader_apply_scratch_relocsMarek Olšák2017-05-051-1/+1
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: fix gl_ViewportIndexMarek Olšák2017-05-031-2/+11
| | | | | | | v2: remove unnecessary LLVMBuildAnd calls Cc: 17.1 <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: pass tessellation ring addresses via user SGPRsMarek Olšák2017-04-281-12/+38
| | | | | | | | | | | | | | | | | This removes s_load_dword latency for tess rings. We need just 1 SGPR for the address if we use 64K alignment. The final asm for recreating the descriptor is: // s2 is (address >> 16) s_mov_b32 s3, 0 s_lshl_b64 s[4:5], s[2:3], 16 s_mov_b32 s6, -1 s_mov_b32 s7, 0x27fac v2: bitcast the descriptor type from v2i64 to v4i32 Reviewed-by: Nicolai Hähnle <[email protected]>