aboutsummaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/radeonsi/si_state_shaders.c
Commit message (Collapse)AuthorAgeFilesLines
* radeonsi: make si_init_shader_selector_async staticNicolai Hähnle2017-09-131-1/+1
| | | | Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: optimize TCS epilog when invocation 0 writes tess factorsMarek Olšák2017-09-111-0/+3
| | | | | | | | | | This removes the barrier and LDS stores and loads for tess factors when it's possible. The removal of the barrier seems more important to me though. In one shader, it removes 17 * 4 bytes from the shader binary. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: don't read LS out vertex stride from an SGPR in monolithic HSMarek Olšák2017-09-071-1/+6
| | | | | | | -44 bytes in a monolithic LS-HS binary. Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: proper workaround for LS/HS VGPR initialization bugNicolai Hähnle2017-09-061-0/+9
| | | | | | | | | | | | | | | | | | | When the HS wave is empty, the hardware writes the LS VGPRs starting at v0 instead of v2. Workaround by shifting them back into place when necessary. For simplicity, this is always done in the LS prolog. According to the hardware team, this will be fixed in future chips, so take that into account already. Note that this is not a bug fix, as the bug was already worked around by commit 166823bfd26 ("radeonsi/gfx9: add a temporary workaround for a tessellation driver bug"). This change merely replaces the workaround by one that should be better. v2: add workaround code to shader only when necessary v3: clarify the prefer_mono comment Reviewed-by: Marek Olšák <[email protected]>
* radeonsi/gfx9: implement primitive binningMarek Olšák2017-09-051-0/+2
| | | | | | | This increases performance, but it was tuned for Raven, not Vega. We don't know yet how Vega will perform, hopefully not worse. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: eliminate PS color outputs when colormask kills themMarek Olšák2017-09-041-0/+1
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: stop leaking nirTimothy Arceri2017-08-291-0/+1
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* glsl: pass shader source keys to the disk cacheTimothy Arceri2017-08-251-1/+1
| | | | | | | We don't actually write them to disk here. That will happen in the following commit. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: emit VGT_REUSE_OFF in the right placeMarek Olšák2017-08-221-2/+9
| | | | | | | clip_regs aren't marked dirty when writes_viewport_index is changed. Cc: 17.2 <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: don't use GS scenario A for VS writing ViewportIndexMarek Olšák2017-08-221-7/+3
| | | | | | Vulkan doesn't do it anymore. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: prevent shader-db crashesMarek Olšák2017-08-221-1/+11
| | | | | | | - don't precompile LS and ES (they don't exist on GFX9), compile as VS instead - don't precompile HS and GS (we don't have LS and ES parts) Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: make si_shader_selector_reference globally visibleNicolai Hähnle2017-08-221-15/+2
| | | | Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: add a separate dirty mask for prefetchesMarek Olšák2017-08-071-2/+31
| | | | | | | | | | so that we don't rely on si_pm4_state_enabled_and_changed, allowing us to move prefetches after draw calls. v2: ckear the dirty mask after unbinding shaders Tested-by: Dieter Nützel <[email protected]> (v1) Reviewed-by: Nicolai Hähnle <[email protected]> (v1)
* radeonsi: add and use si_pm4_state_enabled_and_changedMarek Olšák2017-08-071-6/+6
| | | | | Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: de-atomize L2 prefetchMarek Olšák2017-08-071-1/+1
| | | | | | | I'd like to be able to move the prefetch call site around. Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: tweak next-shader assumptions when streamout is usedNicolai Hähnle2017-07-311-5/+11
| | | | | | VS with streamout is always a HW VS. Reviewed-by: Marek Olšák <[email protected]>
* radeonsi/nir: perform lowering of input/output driver locationsNicolai Hähnle2017-07-311-0/+2
| | | | Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: bypass the shader cache for NIR shadersNicolai Hähnle2017-07-311-2/+3
| | | | Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: scan NIR shaders to obtain required infoNicolai Hähnle2017-07-311-6/+17
| | | | | | v2: set num_instruction to 2, i.e. 1 + END (Marek) Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: simplify computation of tessellation offchip buffersMarek Olšák2017-07-171-15/+4
| | | | | | This is overly cautious, but better safe than sorry. Reviewed-by: Nicolai Hähnle <[email protected]>
* gallium: use "ull" number suffix to keep the QtCreator parser happyMarek Olšák2017-07-101-5/+5
| | | | | | | It can't parse "llu". Reviewed-by: Thomas Helland <[email protected]> Reviewed-by: Eric Engestrom <[email protected]>
* radeonsi: move instance divisors into a constant bufferMarek Olšák2017-06-271-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shader key size: 107 -> 47 Divisors of 0 and 1 are encoded in the shader key. Greater instance divisors are loaded from a constant buffer. The shader code doing the division is huge. Is it something we need to worry about? Does any app use instance divisors >= 2? VS prolog disassembly: s_load_dwordx4 s[12:15], s[0:1], 0x80 ; C00A0300 00000080 s_nop 0 ; BF800000 s_waitcnt lgkmcnt(0) ; BF8C007F s_buffer_load_dword s14, s[12:15], 0x4 ; C0220386 00000004 s_waitcnt lgkmcnt(0) ; BF8C007F v_cvt_f32_u32_e32 v4, s14 ; 7E080C0E v_rcp_iflag_f32_e32 v4, v4 ; 7E084704 v_mul_f32_e32 v4, 0x4f800000, v4 ; 0A0808FF 4F800000 v_cvt_u32_f32_e32 v4, v4 ; 7E080F04 v_mul_hi_u32 v5, v4, s14 ; D2860005 00001D04 v_mul_lo_i32 v6, v4, s14 ; D2850006 00001D04 v_cmp_eq_u32_e64 s[12:13], 0, v5 ; D0CA000C 00020A80 v_sub_i32_e32 v5, vcc, 0, v6 ; 340A0C80 v_cndmask_b32_e64 v5, v6, v5, s[12:13] ; D1000005 00320B06 v_mul_hi_u32 v5, v5, v4 ; D2860005 00020905 v_add_i32_e32 v6, vcc, v5, v4 ; 320C0905 v_subrev_i32_e32 v4, vcc, v5, v4 ; 36080905 v_cndmask_b32_e64 v4, v4, v6, s[12:13] ; D1000004 00320D04 v_mul_hi_u32 v5, v4, v1 ; D2860005 00020304 v_add_i32_e32 v4, vcc, s8, v0 ; 32080008 v_mul_lo_i32 v6, v5, s14 ; D2850006 00001D05 v_add_i32_e32 v7, vcc, 1, v5 ; 320E0A81 v_cmp_ge_u32_e64 s[12:13], v1, v6 ; D0CE000C 00020D01 v_sub_i32_e32 v6, vcc, v1, v6 ; 340C0D01 v_cmp_le_u32_e32 vcc, s14, v6 ; 7D960C0E v_cndmask_b32_e64 v8, 0, -1, s[12:13] ; D1000008 00318280 v_cndmask_b32_e64 v6, 0, -1, vcc ; D1000006 01A98280 v_and_b32_e32 v6, v8, v6 ; 260C0D08 v_cmp_eq_u32_e32 vcc, 0, v6 ; 7D940C80 v_cndmask_b32_e32 v6, v7, v5, vcc ; 000C0B07 v_add_i32_e32 v5, vcc, -1, v5 ; 320A0AC1 v_cmp_eq_u32_e32 vcc, 0, v8 ; 7D941080 v_cndmask_b32_e32 v5, v6, v5, vcc ; 000A0B06 v_add_i32_e32 v5, vcc, s9, v5 ; 320A0A09 v2: set prefer_mono for fetched instance divisors Reviewed-by: Nicolai Hähnle <[email protected]>
* Revert "radeonsi: use uint32_t to declare si_shader_key.opt.kill_outputs"Marek Olšák2017-06-271-3/+2
| | | | | | This reverts commit 7b2240ac9ce3ba9bd86f4ae8aac53af8878c0b10. Reviewed-by: Nicolai Hähnle <[email protected]>
* Revert "radeonsi: remove 8 bytes from si_shader_key with uint32_t ↵Marek Olšák2017-06-271-6/+2
| | | | | | | | ff_tcs_inputs_to_copy" This reverts commit 6b6fed3a3c81c2b0d319ef121df20a0dc914705f. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: use the correct LLVMTargetMachineRef in si_build_shader_variantNicolai Hähnle2017-06-221-6/+22
| | | | | | | | | | | | si_build_shader_variant can actually be called directly from one of normal-priority compiler threads. In that case, the thread_index is only valid for the normal tm array. v2: - use the correct sel/shader->compiler_ctx_state Fixes: 86cc8097266c ("radeonsi: use a compiler queue with a low priority for optimized shaders") Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: track use of bindless samplers/images from tgsi_shader_infoSamuel Pitoiset2017-06-141-5/+26
| | | | | | | | | This adds some new helper functions to know if the current draw call (or dispatch compute) is using bindless samplers/images, based on TGSI analysis. Signed-off-by: Samuel Pitoiset <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: replace si_vertex_elements::elements with separate fieldsMarek Olšák2017-06-121-5/+2
| | | | | | It makes si_vertex_elements a little smaller. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: remove 8 bytes from si_shader_key with uint32_t ff_tcs_inputs_to_copyMarek Olšák2017-06-121-2/+6
| | | | | | The previous patch helps with this. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: use uint32_t to declare si_shader_key.opt.kill_outputsMarek Olšák2017-06-121-2/+3
| | | | | | the next patch will benefit from this Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: remove 8 bytes from si_shader_key by flattening opt.hw_vsMarek Olšák2017-06-121-6/+6
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: don't update dependent states if it has no effect (v2)Marek Olšák2017-06-081-5/+19
| | | | | | | | | | | | | | This and the previous clip_regs commit decrease IB sizes and the number of si_update_shaders invocations as follows: IB size si_update_shaders calls Borderlands 2 -10% -27% Deus Ex: MD -5% -11% Talos Principle -8% -30% v2: always dirty cb_render_state in set_framebuffer_state Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: update clip_regs on shader state changes only when it's neededMarek Olšák2017-06-071-3/+32
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: precompute some fields for PA_CL_VS_OUT_CNTL in si_shader_selectorMarek Olšák2017-06-071-0/+16
| | | | | Reviewed-by: Samuel Pitoiset <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: add a new helper si_get_vsMarek Olšák2017-06-071-4/+2
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: remove 8 bytes from si_shader_keyMarek Olšák2017-06-071-4/+4
| | | | | | | We can use a union in si_shader_key::mono. Reviewed-by: Samuel Pitoiset <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: move PSIZE and CLIPDIST unique IO indices after GENERICMarek Olšák2017-06-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Heaven LDS usage for LS+HS is below. The masks are "outputs_written" for LS and HS. Note that 32K is the maximum size. Before: heaven_x64: ls=1f1 tcs=1f1, lds=32K heaven_x64: ls=31 tcs=31, lds=24K heaven_x64: ls=71 tcs=71, lds=28K After: heaven_x64: ls=3f tcs=3f, lds=24K heaven_x64: ls=7 tcs=7, lds=13K heaven_x64: ls=f tcs=f, lds=17K All other apps have a similar decrease in LDS usage, because the "outputs_written" masks are similar. Also, most apps don't write POSITION in these shader stages, so there is room for improvement. (tight per-component input/output packing might help even more) It's unknown whether this improves performance. Tested-by: Edmondo Tommasina <[email protected]> Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: prevent a race when the previous shader's main part is missingMarek Olšák2017-06-071-0/+2
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: wait for main part compilation of 1st shaders of merged shadersMarek Olšák2017-06-071-0/+4
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi/gfx9: fix LS scratch buffer support without TCS for GFX9Marek Olšák2017-06-071-3/+18
| | | | | | | | | | LS is merged into TCS. If there is no TCS, LS is merged into fixed-func TCS. The problem is the fixed-func TCS was ignored by scratch update functions, so LS didn't have the scratch buffer set up. Note that Mesa 17.1 doesn't have merged shaders. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: move streamout state update out of si_update_shadersMarek Olšák2017-06-071-16/+24
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: move handling of DBG_NO_OPT_VARIANT into si_shader_selector_keyMarek Olšák2017-06-071-4/+3
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: use a compiler queue with a low priority for optimized shadersMarek Olšák2017-06-071-4/+4
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: drop unfinished shader compilations when destroying shadersMarek Olšák2017-06-071-2/+3
| | | | | | | If we enqueue too many jobs and destroy the GL context, it may take several seconds before the jobs finish. Just drop them instead. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: only upload (dump to L2) those descriptors that are used by shadersMarek Olšák2017-05-181-0/+6
| | | | | | | This decreases the size of CE RAM dumps to L2, or the size of descriptor uploads without CE. Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: record which descriptor slots are used by shadersMarek Olšák2017-05-181-0/+27
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: rename tcs_tes_uses_prim_id for clarityNicolai Hähnle2017-05-161-6/+6
| | | | | | | | What we care about is whether PrimID is used while tessellation is enabled; whether it's used in TCS/TES or further down the pipeline is irrelevant. Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: fix gl_PrimitiveIDIn in geometry shader when using tessellationNicolai Hähnle2017-05-161-0/+2
| | | | | | | | | | | This builds on commit 0549ea15ec38 ("radeonsi: fix primitive ID in fragment shader when using tessellation"). Fixes piglit arb_tessellation_shader/execution/gs-primitiveid-instanced.shader_test Cc: 17.1 <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: get rid of secondary input/output wordNicolai Hähnle2017-05-121-19/+3
| | | | | | | By keeping track of fewer generics, everything can fit into 64 bits. Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: skip generic out/in indices without a shader IO indexNicolai Hähnle2017-05-121-1/+5
| | | | | | | | | | | | | OpenGL uses at most 32 generic outputs/inputs in any stage, and they always have a shader IO index and therefore fit into the outputs_written/ inputs_read/kill_outputs fields. However, Nine uses semantic indices more liberally. We support that in VS-PS pipelines, except that the optimization of killing outputs must be skipped. Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: use SI_MAX_IO_GENERIC instead of magic valuesNicolai Hähnle2017-05-121-2/+2
| | | | | Tested-by: Dieter Nützel <[email protected]> Reviewed-by: Marek Olšák <[email protected]>