| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were assuming 32-bit elements. Also, In SIMD8 we pack 2 vector components
in a single SIMD register, so for example, component Y of a 16-bit vec2
starts is at byte offset 16B. This means that when we compute the offset of
the elements to be differentiated we should not stomp whatever base offset we
have, but instead add to it.
v2
- Use byte_offset() helper (Jason)
- Merge the fix for SIMD8: using byte_offset() fixes that too.
Reviewed-by: Jason Ekstrand <[email protected]> (v1)
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Source0 and Destination extract the floating-point precision automatically
from the SrcType and DstType instruction fields respectively when they are
set to types :F or :HF. For Source1 and Source2 operands, we use the new
1-bit fields Src1Type and Src2Type, where 0 means normal precision and 1
means half-precision. Since we always use the type of the destination for
all operands when we emit 3-source instructions, we only need set Src1Type
and Src2Type to 1 when we are emitting a half-precision instruction.
v2:
- Set the bit separately for each source based on its type so we can
do mixed floating-point mode in the future (Topi).
v3:
- Use regular citation style for the comment referencing the PRM (Matt).
- Decided not to add asserts in the emission code to check that only
mixed HF/F types are used since such checks would break negative tests
for brw_eu_validate.c (Matt)
Reviewed-by: Topi Pohjolainen <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
| |
Reviewed-by: Topi Pohjolainen <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We are now using these bits, so don't assert that they are not set. In gen8,
if these bits are set compaction is not possible. On gen9 and CHV platforms
set_3src_control_index() checks these bits (and others) against a table to
validate if the particular bit combination is eligible for compaction or not.
v2
- Add more detail in the commit message explaining the situation for SKL+
and CHV (Jason)
Reviewed-by: Topi Pohjolainen <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is available since gen8.
v2: restore previously existing assertion.
v3: don't use separate tables for gen7 and gen8, just assert that we
don't use half-float before gen8 (Matt)
Reviewed-by: Topi Pohjolainen <[email protected]> (v1)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original SrcType is a 3-bit field that takes a subset of the types
supported for the hardware for 3-source instructions. Since gen8,
when the half-float type was added, 3-source floating point operations
can use use mixed precision mode, where not all the operands have the
same floating-point precision. While the precision for the first operand
is taken from the type in SrcType, the bits in Src1Type (bit 36) and
Src2Type (bit 35) define the precision for the other operands
(0: normal precision, 1: half precision).
Reviewed-by: Topi Pohjolainen <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
Acked-by: Jason Ekstrand <[email protected]>
|
|
|
|
| |
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
v2:
- make 16-bit be its own separate case (Jason)
v3:
- Drop the result_int temporary (Jason)
Reviewed-by: Topi Pohjolainen <[email protected]> (v1)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Extended math with half-float operands is only supported since gen9,
but it is limited to SIMD8. In gen8 we lower it to 32-bit.
v2: quashed together the following patches (Jason):
- intel/compiler: allow extended math functions with HF operands
- intel/compiler: lower 16-bit extended math to 32-bit prior to gen9
- intel/compiler: extended Math is limited to SIMD8 on half-float
Reviewed-by: Jason Ekstrand <[email protected]>
Reviewed-by: Topi Pohjolainen <[email protected]>
(allow extended math functions with HF operands,
extended Math is limited to SIMD8 on half-float)
|
|
|
|
|
|
|
| |
The hardware doesn't support half-float for these.
Reviewed-by: Topi Pohjolainen <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
There are some hardware restrictions that brw_nir_lower_conversions should
have taken care of before we get here.
v2:
- rebased on top of regioning lowering pass
Reviewed-by: Topi Pohjolainen <[email protected]> (v1)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since we handle booleans as integers this makes more sense.
v2:
- rebased to incorporate new boolean conversion opcodes
v3:
- rebased on top regioning lowering pass
Reviewed-by: Jason Ekstrand <[email protected]> (v1)
Reviewed-by: Topi Pohjolainen <[email protected]> (v2)
|
|
|
|
|
|
|
|
|
|
|
| |
Going forward having these split is a bit more convenient since these two
groups have different restrictions.
v2:
- Rebased on top of new regioning lowering pass.
Reviewed-by: Topi Pohjolainen <[email protected]> (v1)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some conversions are not directly supported in hardware and need to be
split in two conversion instructions going through an intermediary type.
Doing this at the NIR level simplifies a bit the complexity in the backend.
v2:
- Consider fp16 rounding conversion opcodes
- Properly handle swizzles on conversion sources.
v3
- Run the pass earlier, right after nir_opt_algebraic_late (Jason)
- NIR alu output types already have the bit-size (Jason)
- Use 'is_conversion' to identify conversion operations (Jason)
v4:
- Be careful about the intermediate types we use so we don't lose
range and avoid incorrect rounding semantics (Jason)
Reviewed-by: Topi Pohjolainen <[email protected]> (v1)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Similarly to 1cc17fb731466c68586915acbb916586457b19bc
Fixes gpu hangs with dEQP-VK.tessellation.shader_input_output.barrier
Reviewed-by: Anuj Phogat <[email protected]>
Signed-off-by: Topi Pohjolainen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The i965 driver has a bunch of code to compare two sets of program keys
and print out the differences. This can be useful for debugging why a
shader needed to be recompiled on the fly due to non-orthogonal state
dependencies. anv doesn't do recompiles, so we didn't need to share
this in the past - but I'd like to use it in iris.
This moves the bulk of the code to the compiler where it can be reused.
To make that possible, we need to decouple it from i965 - we can't get
at the brw program cache directly, nor use brw_context to print things.
Instead, we use compiler->shader_perf_log(), and simply pass in keys.
We put all of this debugging code in brw_debug_recompile.c, and only
export a single function, for simplicity. I also tidied the code a
bit while moving it, now that it all lives in one file.
Reviewed-by: Jordan Justen <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
If we write to the flag register changing the swizzle would change
what channels are written to the flag register.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110201
Fixes: 4cd1a0be
Signed-off-by: Danylo Piliaiev <[email protected]>
Reviewed-by: <[email protected]>
|
|
|
|
|
|
|
|
|
| |
v2: remove & operator in a couple of memsets
add some memsets
v3: fixup lima
Signed-off-by: Karol Herbst <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]> (v2)
|
|
|
|
|
|
| |
v2: replace nir_zero_vec with nir_imm_zero (Karol Herbst)
Reviewed-by: Karol Herbst <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Karol Herbst <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
| |
This makes things a bit simpler and it's also more robust because it no
longer has a hard dependency on the offset being a 32-bit value.
|
|
|
|
|
|
|
|
|
| |
We will never hit a condition where we have src1 and src2 as immediate
operands.
Signed-off-by: Sagar Ghuge <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
| |
This required to calculate sizes correctly when we have bindless
samplers/images.
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Karol Herbst <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
| |
libintel_common depends on libintel_compiler, but it contains debug
functionality that is needed by libintel_compiler. Break the circular
dependency by moving gen_debug files to libintel_dev.
Suggested-by: Kenneth Graunke <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This will make that step visible in NIR_PRINT=1.
v2: Also use the macro for the cleanup passes.
Reviewed-by: Ian Romanick <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This was needed when certain intrinsics were lowered to other ones
that were defined by the same pass. After 060817b2 "intel,nir: Move
gl_LocalInvocationID lowering to nir_lower_system_values" we don't
need the loop anymore.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When using quads, instead of mapping the elements to the next 4 local
invocation indices, we map the two next in the "current" row and two
next in the "next row". A side effect is that a thread will execute
the indices in a different order.
We now perform the lowering of both local invocation ID and index
together -- and don't rely anymore on lowering done by
nir_lower_system_values. That is convenient when doing the math for
quads, because we need X and Y to get the right invocation index.
When the pass progresses, fold the constants and clean up to reduce
the noise from the indexing math.
This implements the derivative_group_quadsNV semantics from
NV_compute_shader_derivatives.
v2: Take subgroup_id into account, otherwise only values in the first
subgroup would be used. (Jason)
v3: Calculate invocation index and ID together, to avoid duplicating
some math in the quads case when both index and ID are used. (Jason)
v4: Don't call cleanup passes as part of the lowering, let that to the
call site. (Jason)
Change calculation to use less instructions. (Jason)
Reviewed-by: Ian Romanick <[email protected]> (v3)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
| |
Make sure we include compute shaders that have a derivative group
defined.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When I implemented opt_if_loop_last_continue() I had restricted
this pass from moving other if-statements inside the branch opposite
the continue. At the time it was causing a bunch of spilling in
shader-db for i965.
However Samuel Pitoiset noticed that making this pass more aggressive
significantly improved the performance of Doom on RADV. Below are
the statistics he gathered.
28717 shaders in 14931 tests
Totals:
SGPRS: 1267317 -> 1267549 (0.02 %)
VGPRS: 896876 -> 895920 (-0.11 %)
Spilled SGPRs: 24701 -> 26367 (6.74 %)
Code Size: 48379452 -> 48507880 (0.27 %) bytes
Max Waves: 241159 -> 241190 (0.01 %)
Totals from affected shaders:
SGPRS: 23584 -> 23816 (0.98 %)
VGPRS: 25908 -> 24952 (-3.69 %)
Spilled SGPRs: 503 -> 2169 (331.21 %)
Code Size: 2471392 -> 2599820 (5.20 %) bytes
Max Waves: 586 -> 617 (5.29 %)
The codesize increases is related to Wolfenstein II it seems largely
due to an increase in phis rather than the existing jumps.
This gives +10% FPS with Doom on my Vega56.
Rhys Perry also benchmarked Doom on his VEGA64:
Before: 72.53 FPS
After: 80.77 FPS
v2: disable pass on non-AMD drivers
Reviewed-by: Ian Romanick <[email protected]> (v1)
Acked-by: Samuel Pitoiset <[email protected]>
|
|
|
|
|
|
|
| |
If we increase vector sizing later it would be nice to avoid
tripped over this again.
Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Almost all of the hurt shaders are repeated instances of the same shader
in synmark's compilation speed tests.
shader-db results:
All Gen6+ platforms had similar results. (Skylake shown)
total instructions in shared programs: 15256840 -> 15256389 (<.01%)
instructions in affected programs: 54137 -> 53686 (-0.83%)
helped: 288
HURT: 0
helped stats (abs) min: 1 max: 15 x̄: 1.57 x̃: 1
helped stats (rel) min: 0.06% max: 26.67% x̄: 1.99% x̃: 0.74%
95% mean confidence interval for instructions value: -1.76 -1.38
95% mean confidence interval for instructions %-change: -2.47% -1.50%
Instructions are helped.
total cycles in shared programs: 372286583 -> 372283851 (<.01%)
cycles in affected programs: 833829 -> 831097 (-0.33%)
helped: 265
HURT: 16
helped stats (abs) min: 2 max: 74 x̄: 11.81 x̃: 4
helped stats (rel) min: 0.04% max: 9.07% x̄: 0.99% x̃: 0.35%
HURT stats (abs) min: 2 max: 130 x̄: 24.88 x̃: 8
HURT stats (rel) min: <.01% max: 12.31% x̄: 1.44% x̃: 0.27%
95% mean confidence interval for cycles value: -12.30 -7.15
95% mean confidence interval for cycles %-change: -1.06% -0.64%
Cycles are helped.
Iron Lake and GM45 had similar results. (GM45 shown)
total instructions in shared programs: 5038653 -> 5038495 (<.01%)
instructions in affected programs: 13939 -> 13781 (-1.13%)
helped: 50
HURT: 1
helped stats (abs) min: 1 max: 15 x̄: 3.18 x̃: 4
helped stats (rel) min: 0.33% max: 13.33% x̄: 2.24% x̃: 1.09%
HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
HURT stats (rel) min: 0.83% max: 0.83% x̄: 0.83% x̃: 0.83%
95% mean confidence interval for instructions value: -3.73 -2.47
95% mean confidence interval for instructions %-change: -3.16% -1.21%
Instructions are helped.
total cycles in shared programs: 128118922 -> 128118228 (<.01%)
cycles in affected programs: 134906 -> 134212 (-0.51%)
helped: 50
HURT: 0
helped stats (abs) min: 2 max: 60 x̄: 13.88 x̃: 18
helped stats (rel) min: 0.06% max: 3.19% x̄: 0.74% x̃: 0.70%
95% mean confidence interval for cycles value: -16.54 -11.22
95% mean confidence interval for cycles %-change: -0.95% -0.53%
Cycles are helped.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix the order of src0_alpha and sample mask in fb payload.
From SKL PRM Volume 7, "Data Payload Register Order
for Render Target Write Messages":
Type S0A oM sZ oS M2 M3 M4
SIMD8 1 1 0 0 s0A oM R
SIMD16 1 1 0 0 1/0s0A 3/2s0A oM
It also fixes working of alpha to coverage with sample mask
on GEN6 since now they are in correct order.
Signed-off-by: Danylo Piliaiev <[email protected]>
Signed-off-by: Francisco Jerez <[email protected]>
Reviewed-by: Francisco Jerez <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
From "Alpha Coverage" section of SKL PRM Volume 7:
"If Pixel Shader outputs oMask, AlphaToCoverage is disabled in
hardware, regardless of the state setting for this feature."
From OpenGL spec 4.6, "15.2 Shader Execution":
"The built-in integer array gl_SampleMask can be used to change
the sample coverage for a fragment from within the shader."
From OpenGL spec 4.6, "17.3.1 Alpha To Coverage":
"If SAMPLE_ALPHA_TO_COVERAGE is enabled, a temporary coverage value
is generated where each bit is determined by the alpha value at the
corresponding sample location. The temporary coverage value is then
ANDed with the fragment coverage value to generate a new fragment
coverage value."
Similar wording could be found in Vulkan spec 1.1.100
"25.6. Multisample Coverage"
Thus we need to compute alpha to coverage dithering manually in shader
and replace sample mask store with the bitwise-AND of sample mask and
alpha to coverage dithering.
The following formula is used to compute final sample mask:
m = int(16.0 * clamp(src0_alpha, 0.0, 1.0))
dither_mask = 0x1111 * ((0xfea80 >> (m & ~3)) & 0xf) |
0x0808 * (m & 2) | 0x0100 * (m & 1)
sample_mask = sample_mask & dither_mask
Credits to Francisco Jerez <[email protected]> for creating it.
It gives a number of ones proportional to the alpha for 2, 4, 8 or 16
least significant bits of the result.
GEN6 hardware does not have issue with simultaneous usage of sample mask
and alpha to coverage however due to the wrong sending order of oMask
and src0_alpha it is still affected by it.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109743
Signed-off-by: Danylo Piliaiev <[email protected]>
Reviewed-by: Francisco Jerez <[email protected]>
|
|
|
|
|
|
|
|
|
| |
And enable it on Intel.
v2:
- Squash the change to enable it on Intel (Jason)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
| |
And enable it on Intel.
v2:
- Squash the change to enable this lowering on Intel (Jason)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
| |
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
| |
v2: turn on for turnip as well (Karol Herbst)
Reviewed-by: Karol Herbst <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes a serious performance issue with DXVK:
https://github.com/doitsujin/dxvk/issues/937
This was caused by a recent change that to improve performance on RADV
which back-fired on ANV and killed performance for some apps:
https://github.com/doitsujin/dxvk/commit/e5a06d3f4a103a54cd4eb51970fedee405d1d698
Throwing in this bit of lowering lets us come along and CSE those UBO
loads (or copy-prop for SSBO load) and get one load where we previously
would have gotten several.
VkPipeline-db results on Kaby Lake:
total instructions in shared programs: 5115361 -> 5073185 (-0.82%)
instructions in affected programs: 1754333 -> 1712157 (-2.40%)
helped: 5331
HURT: 63
total cycles in shared programs: 2544501169 -> 2481144545 (-2.49%)
cycles in affected programs: 2531058653 -> 2467702029 (-2.50%)
helped: 9202
HURT: 4323
total loops in shared programs: 3340 -> 3331 (-0.27%)
loops in affected programs: 9 -> 0
helped: 9
HURT: 0
total spills in shared programs: 3246 -> 3053 (-5.95%)
spills in affected programs: 384 -> 191 (-50.26%)
helped: 10
HURT: 5
total fills in shared programs: 4626 -> 4452 (-3.76%)
fills in affected programs: 439 -> 265 (-39.64%)
helped: 10
HURT: 5
All of the shaders with hurt spilling were in Rise of the Tomb Raider
which also had shaders solidly helped in the spilling department. Not
shown in those results (because I've not had success dumping the
shaders) is Witcher 3 where this reduces spilling and improves over-all
perf by around 20-25%. There were no shader-db changes. Apparently,
this just isn't a pattern that happens in OpenGL.
Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
Cc: "19.0" [email protected]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead, we do UBO and SSBO deref lowering in NIR after we've given it a
chance to optimize SSBO access:
Shader-db results on Kaby Lake:
total instructions in shared programs: 15235775 -> 15235484 (<.01%)
instructions in affected programs: 14992 -> 14701 (-1.94%)
helped: 19
HURT: 20
total cycles in shared programs: 339220331 -> 339027307 (-0.06%)
cycles in affected programs: 79831981 -> 79638957 (-0.24%)
helped: 540
HURT: 602
total loops in shared programs: 4402 -> 4348 (-1.23%)
loops in affected programs: 186 -> 132 (-29.03%)
helped: 27
HURT: 0
total spills in shared programs: 23261 -> 23234 (-0.12%)
spills in affected programs: 38 -> 11 (-71.05%)
helped: 1
HURT: 0
total fills in shared programs: 31442 -> 31371 (-0.23%)
fills in affected programs: 98 -> 27 (-72.45%)
helped: 1
HURT: 0
LOST: 12
GAINED: 12
Most of the help and hurt in instruction counts was just churn caused by
re-ordering of optimizations and the fact that the NIR deref lowering
code is emitting slightly different instructions. Nothing was hurt by
more than three instructions and most things weren't helped by more than
four. The primary exception to this is one Car Chase shader:
shaders/non-free/gfxbench4/carchase/341.shader_test CS SIMD32: 1144 -> 821 (-28.23%)
There is also one compute shader in Manhattan 3.1 and a fragment shader
in the UE4 Shooter Game demo that now get a loop partially unrolled.
Those showed up in the results as hurt instructions but were manually
removed to get the results above.
The lost/gained was a dozen Car Chase shaders that went from SIMD8 to
SIMD16 thanks to improved register pressure:
shaders/non-free/gfxbench4/carchase/366.shader_test CS
shaders/non-free/gfxbench4/carchase/368.shader_test CS
shaders/non-free/gfxbench4/carchase/370.shader_test CS
shaders/non-free/gfxbench4/carchase/372.shader_test CS
shaders/non-free/gfxbench4/carchase/376.shader_test CS
shaders/non-free/gfxbench4/carchase/378.shader_test CS
shaders/non-free/gfxbench4/carchase/380.shader_test CS
shaders/non-free/gfxbench4/carchase/382.shader_test CS
shaders/non-free/gfxbench4/carchase/384.shader_test CS
shaders/non-free/gfxbench4/carchase/388.shader_test CS
shaders/non-free/gfxbench4/carchase/4.shader_test CS
shaders/non-free/gfxbench4/carchase/6.shader_test CS
Given how much it appeared to be improved, I ran Car Chase on my laptop.
Unfortunately, I wasn't able to see any measurable improvement. It
might be helped by 1-2% but it's in the noise. It does render correctly
as far as I can tell so the improvement is legitimate.
All of the loops that got delete were in dolphin uber shaders. I've had
no opportunity to test them for correctness or performance.
Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
ARB_fragment_shader_interlock depends on memory fences to
ensure fragment ordering and this ordering guarantee is
only supported from GEN9 onwards.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109980
Fixes: 939312702e35 "i965: Add ARB_fragment_shader_interlock support."
Signed-off-by: Plamena Manolova <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to lack of write mask in SPIR-V store, generators may produce
multiple stores to the same vector but using different array derefs.
Use the combining store pass to clean this up. For example,
layout(binding = 3) buffer block {
vec4 v;
};
void main() {
v.x = 11;
v.y = 22;
}
after going to SPIR-V and NIR, ends up with in two store_derefs to
v[0] and v[1]
vec2 32 ssa_4 = deref_struct &ssa_3->field0 (ssbo vec4) /* &((block *)ssa_2)->field0 */
vec2 32 ssa_6 = deref_array &(*ssa_4)[0] (ssbo float) /* &((block *)ssa_2)->field0[0] */
intrinsic store_deref (ssa_6, ssa_7) (1, 0) /* wrmask=x */ /* access=0 */
vec1 32 ssa_13 = load_const (0x00000001 /* 0.000000 */)
vec2 32 ssa_14 = deref_array &(*ssa_4)[1] (ssbo float) /* &((block *)ssa_2)->field0[1] */
intrinsic store_deref (ssa_14, ssa_15) (1, 0) /* wrmask=x */ /* access=0 */
producing two different sends instructions in skl. The combining pass
transform the snippet above into
vec2 32 ssa_4 = deref_struct &ssa_3->field0 (ssbo vec4) /* &((block *)ssa_2)->field0 */
vec4 32 ssa_18 = vec4 ssa_7, ssa_15, ssa_16, ssa_17
intrinsic store_deref (ssa_4, ssa_18) (3, 0) /* wrmask=xy */ /* access=0 */
producing a single sends instruction.
v2: Move this from spirv_to_nir into the general optimization pass for
intel compiler. (Jason)
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Shader-db results for skl:
total instructions in shared programs: 15232903 -> 15224781 (-0.05%)
instructions in affected programs: 61246 -> 53124 (-13.26%)
helped: 221
HURT: 0
total cycles in shared programs: 371440470 -> 371398018 (-0.01%)
cycles in affected programs: 281363 -> 238911 (-15.09%)
helped: 221
HURT: 0
Results for bdw are very similar.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
| |
v2: (all from Jason)
Reuse existing function for the end of the block combinations.
Check the SSA values are coming from the right place in tests.
Document the case when the store to array_deref is reused.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were not copying the saturate bit from the original instruction
to the new replacement instruction. This caused major misrendering
in DiRT Rally on iris, where comparisons leading to discards failed
due to the missing saturate, causing lots of extra garbage pixels to
be drawn in text rendering, trees, and so on.
This did not show up on i965 because st/nir performs a more aggressive
version of nir_opt_peephole_select, yielding more b32csel operations.
Fixes: 52c7df1643e i965/fs: Merge CMP and SEL into CSEL on Gen8+
Reviewed-by: Lionel Landwerlin <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The IO scalarization pass that we run to help with linking end up
turning some shader I/O such as that for tessellation and geometry
shaders into many scalar URB operations rather than one vector one. To
alleviate this, we now vectorize the I/O once again. This fixes a 10%
performance regression in the GfxBench tessellation test that was caused
by scalarizing.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15224023 -> 15220871 (-0.02%)
instructions in affected programs: 342009 -> 338857 (-0.92%)
helped: 1236
HURT: 443
total spills in shared programs: 23471 -> 23465 (-0.03%)
spills in affected programs: 6 -> 0
helped: 1
HURT: 0
total fills in shared programs: 31770 -> 31766 (-0.01%)
fills in affected programs: 4 -> 0
helped: 1
HURT: 0
Cycles was just a lot of churn do to moves being different places. Most
of the pure churn in instructions was +/- one or two instructions in
fragment shaders.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107510
Fixes: 4434591bf56a "intel/nir: Call nir_lower_io_to_scalar_early"
Fixes: 8d8222461f9d "intel/nir: Enable nir_opt_find_array_copies"
Reviewed-by: Connor Abbott <[email protected]>
|
|
|
|
|
|
|
|
|
| |
It doesn't really matter where this pass goes as long as it's after we
call nir_lower_explicit_io and before we go into the back-end. Putting
it brw_postprocess_nir lets us move nir_lower_explicit_io significantly
later in the pipeline.
Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
|
|
|
|
| |
Reviewed-by: Lionel Landwerlin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we have a loop unrolling cost function and loop unrolling isn't
going to kill us the moment we have a 64-bit op in a loop, we can go
ahead and move 64-bit lowering later. This gives us the opportunity to
do more optimizations and actually let the full optimizer run even on
64-bit ops rather than hoping one round of opt_algebraic will fix
everything. This substantially reduces both fp64 shader compile times
and the resulting code size. On the vs-isnan-dvec test from piglit:
Before this commit:
1684.63s user 17.29s system 99% cpu 28:28.24 total
101479 instructions. 0 loops. 802452 cycles. 79:369 spills:fills.
Peak memory usage (according to massif): 1.435 GB
After this commit:
179.64s user 7.75s system 99% cpu 3:07.92 total
57316 instructions. 0 loops. 459287 cycles. 0:0 spills:fills.
Peak memory usage (according to massif): 531.0 MB
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Jordan Justen <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of trusting the caller to already have created a softfp64
function shader and added all its functions to our shader, we simply
take the softfp64 shader as an argument and do the function inlining
ouselves. This means that there's no more nasty functions lying around
that the caller needs to worry about cleaning up.
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Jordan Justen <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|