| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
Signed-off-by: Jonathan Marek <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Jonathan Marek <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
| |
The fdph opcode is not supported.
Signed-off-by: Jonathan Marek <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Fixes the following deqp test:
dEQP-GLES2.functional.shaders.builtin_variable.pointcoord
Signed-off-by: Jonathan Marek <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Some instructions generated by int/bool float lowering need to be lowered
by opt_algebraic.
Fixes: 43dbd7d6
Signed-off-by: Jonathan Marek <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Utgard PP has vector fcsel operation, but its condition is scalar. Add
filtering callback that checks whether {b,f}csel condition is not scalar
to lower {b,f}csel to scalar only in this case.
Reviewed-by: Qiang Yu <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Signed-off-by: Vasily Khoruzhick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Set of opcodes doesn't have enough flexibility in certain cases. E.g.
Utgard PP has vector conditional select operation, but condition is always
scalar. Lowering all the vector selects to scalar increases instruction
number, so we need a way to filter only those ops that can't be handled
in hardware.
Reviewed-by: Qiang Yu <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
Signed-off-by: Vasily Khoruzhick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that spilling ops can be inserted into existing instructions, it
makes sense to increase cost to spill registers that would cause the
creation of a new instruction.
Experimental results showed that penalizing too much due to this caused
worse results, however it is beneficial as a tie resolver between
registers with the same number of components.
Signed-off-by: Erico Nunes <[email protected]>
Reviewed-by: Vasily Khoruzhick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Avoid creating unnecessary instructions for the load/store temp nodes
when not required, to further reduce register pressure.
The store_temp operation seems to be unable to do any spilling.
At least the offline shader seems to never output instructions accessing
swizzled components, and attempting to output that in ppir results in
errors. So, force spilled registers to allocate a full vec4 register.
This seems to be the optimal way as it is possible to always keep stores
and temps in a single instruction that can be pipelined.
Signed-off-by: Erico Nunes <[email protected]>
Reviewed-by: Vasily Khoruzhick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
One ssa created in the spillinc code in ppir_update_spilled_src was not
properly being marked 'spilled', which made it a candidate for future
spilling attempts.
Since it was being inserted by the spilling code itself, let's mark it
unspillable to avoid an infinite spilling loop.
Signed-off-by: Erico Nunes <[email protected]>
Reviewed-by: Vasily Khoruzhick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
shader-db results:
Totals:
SGPRS: 3955968 -> 3954960 (-0.03 %)
VGPRS: 2220220 -> 2220092 (-0.01 %)
Spilled SGPRs: 11387 -> 11325 (-0.54 %)
Spilled VGPRs: 97 -> 97 (0.00 %)
Private memory VGPRs: 2528 -> 2528 (0.00 %)
Scratch size: 2656 -> 2656 (0.00 %) dwords per thread
Code Size: 76002204 -> 75994988 (-0.01 %) bytes
LDS: 740 -> 740 (0.00 %) blocks
Max Waves: 772776 -> 772787 (0.00 %)
Wait states: 0 -> 0 (0.00 %)
Totals from affected shaders:
SGPRS: 16840 -> 15832 (-5.99 %)
VGPRS: 16452 -> 16324 (-0.78 %)
Spilled SGPRs: 1416 -> 1354 (-4.38 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 2016 -> 2016 (0.00 %)
Scratch size: 2040 -> 2040 (0.00 %) dwords per thread
Code Size: 953624 -> 946408 (-0.76 %) bytes
LDS: 303 -> 303 (0.00 %) blocks
Max Waves: 1622 -> 1633 (0.68 %)
Wait states: 0 -> 0 (0.00 %)
There were a large number of regressions in code size, but they seem to
be because NIR unrolls some loop which results in the table being
replaced by a bunch of immediates on multiplies etc. -- this bloats code
size since the table size is now included, but means that there are less
loads so it's still a net positive.
Reviewed-by: Timothy Arceri <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
For radeonsi, we will prefer the NIR pass as it'll generate better code
(some index calculation and a single load vs. a load, then index
calculation, then another load) and oftentimes NIR optimization can kick
in and make all the access indices constant.
Reviewed-by: Kenneth Graunke <[email protected]>
Reviewed-by: Timothy Arceri <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
vkpipeline-db numbers:
Totals:
SGPRS: 1740306 -> 1741322 (0.06 %)
VGPRS: 1331124 -> 1331712 (0.04 %)
Spilled SGPRs: 21201 -> 21316 (0.54 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 256 -> 256 (0.00 %) dwords per thread
Code Size: 79022628 -> 78694788 (-0.41 %) bytes
LDS: 6500 -> 6500 (0.00 %) blocks
Max Waves: 301413 -> 301302 (-0.04 %)
Wait states: 0 -> 0 (0.00 %)
Totals from affected shaders:
SGPRS: 53633 -> 54649 (1.89 %)
VGPRS: 53000 -> 53588 (1.11 %)
Spilled SGPRs: 3454 -> 3569 (3.33 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 5284232 -> 4956392 (-6.20 %) bytes
LDS: 2 -> 2 (0.00 %) blocks
Max Waves: 4239 -> 4128 (-2.62 %)
Wait states: 0 -> 0 (0.00 %)
(The biggest VGPR and max wave regression is due to unrolling a loop,
which made the scheduler more aggressive, but in this case it's able to
effectively hide latency so it's actually probably a win.)
shader-db numbers with radeonsi NIR:
Totals:
SGPRS: 3526496 -> 3526512 (0.00 %)
VGPRS: 2198576 -> 2198576 (0.00 %)
Spilled SGPRs: 10463 -> 10463 (0.00 %)
Spilled VGPRs: 86 -> 86 (0.00 %)
Private memory VGPRs: 3182 -> 2528 (-20.55 %)
Scratch size: 3308 -> 2640 (-20.19 %) dwords per thread
Code Size: 74117280 -> 74106140 (-0.02 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 775846 -> 775844 (-0.00 %)
Wait states: 0 -> 0 (0.00 %)
Totals from affected shaders:
SGPRS: 856 -> 872 (1.87 %)
VGPRS: 680 -> 680 (0.00 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 654 -> 0 (-100.00 %)
Scratch size: 668 -> 0 (-100.00 %) dwords per thread
Code Size: 49652 -> 38512 (-22.44 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 182 -> 180 (-1.10 %)
Wait states: 0 -> 0 (0.00 %)
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
We usually use these counts as a simple way to figure out if a change
reduces the number of instructions or shrinks an instruction. However,
since .rodata sections aren't executed, we shouldn't be counting their
size for this analysis. Make the linker return the total executable
size, and use it to report the more useful size in both drivers.
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
../mesa/src/gallium/state_trackers/clover/llvm/invocation.cpp: In function ‘std::unique_ptr<clang::CompilerInstance> {anonymous}::create_compiler_instance(const clover::device&, const std::vector<std::__cxx11::basic_string<char> >&, std::string&)’:
../mesa/src/gallium/state_trackers/clover/llvm/invocation.cpp:203:81: error: no matching function for call to ‘clang::CompilerInvocation::CreateFromArgs(clang::CompilerInvocation&, const char* const*, const char* const*, clang::DiagnosticsEngine&)’
203 | c->getInvocation(), copts.data(), copts.data() + copts.size(), diag))
| ^
In file included from /opt/llvm64/include/clang/Frontend/CompilerInstance.h:15,
from ../mesa/src/gallium/state_trackers/clover/llvm/codegen.hpp:37,
from ../mesa/src/gallium/state_trackers/clover/llvm/invocation.cpp:49:
/opt/llvm64/include/clang/Frontend/CompilerInvocation.h:157:15: note: candidate: ‘static bool clang::CompilerInvocation::CreateFromArgs(clang::CompilerInvocation&, llvm::ArrayRef<const char*>, clang::DiagnosticsEngine&)’
157 | static bool CreateFromArgs(CompilerInvocation &Res,
| ^~~~~~~~~~~~~~
/opt/llvm64/include/clang/Frontend/CompilerInvocation.h:157:15: note: candidate expects 3 arguments, 4 provided
Signed-off-by: Hal Gentz <[email protected]>
Reviewed-by: Aaron Watry <[email protected]>
|
|
|
|
| |
Reviewed-by: Timothy Arceri <[email protected]>
|
|
|
|
|
|
| |
Noticed while looking at other OSMesa bugs.
Reviewed-by: Timothy Arceri <[email protected]>
|
|
|
|
|
|
|
|
| |
Given that we occasionally touch this code and probably nobody really
wants to think about it, introduce a minimal test so that we know we
haven't completely broken OSMesa.
Reviewed-by: Timothy Arceri <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
| |
This is ported from the fragment shader code.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
| |
This is mostly ported from the fragment shader code.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
| |
This just adds the CS invocations counter.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This adds the dispatch code. It creates a job for the number
of blocks in the grid, and dispatches them to the threadpool
implementation. The threadpool then calls the JIT code to
execute the coroutines.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
This creates the coroutine execution environment and the
main compute shaders that get executed inside it.
Each compute shader block is executed in it's own coroutine
execution shader, which each "thread" being a coroutine executed
inside it in sequence.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
| |
This doesn't actually build any of the shaders yet, but just
builds up the framework necessary to start building the shaders
and variants.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
| |
Compute doesn't share dirty state with the fragment pipeline
so create a separate path for it.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
| |
This is mostly a port of the fragment shader framework
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
| |
This adds the jit interface for compute shaders, it's based
on the fragment shader one.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
| |
These mirror the fragment shader structs, this is just a framework.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
| |
The compute shader will need it's own context like the frag shader
has, this just introduces the framework struct and allocates/frees
for it in the right places.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
| |
When the code is executing an hits a barrier, it will suspend
the coroutine and return control to the coroutine dispatcher.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
In order to efficiently run a number of compute blocks, use
a threadpool that just allows for jobs with unique sequential
ids to be dispatched.
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
| |
In order to share the texture/image/sampler code with compute
shaders we need to reorg them to be at the front of context
same as draw does for vs/gs sharing.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
| |
coroutines require a proper pass manager, so add the passes
to the correct places
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
| |
These wrap the coroutine intrinsics and also add some higher
level wrappers around coroutine begin, end and suspend procedures
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
| |
This allows the counter value to be forced to a certain value
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
We were only handling the modifiers case and not counting the number of
planes in actual planar images.
Fixes Piglit's ext_image_dma_buf_import-export.
Fixes: fc12fd05f56 ("iris: Implement pipe_screen::resource_get_param")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111509
Reviewed-by: Jordan Justen <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Each BSD has slightly different sysctl for retrieving per-CPU times.
FreeBSD returns long while NetBSD returns uint64_t. On OpenBSD return
type differs between summation and per-CPU times. DragonFly is
compatible with FreeBSD.
Signed-off-by: Jan Beich <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Based on the vc4 implementation.
Fixes Android RenderEngine::flush() routine:
android.googlesource.com/platform/frameworks/native/+/refs/tags/android-o-mr1-iot-release-smart-clock-fcs/services/surfaceflinger/RenderEngine/RenderEngine.cpp#225
Signed-off-by: Roman Stratiienko <[email protected]>
Reviewed-by: Qiang Yu <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Try more aggressive approach with cloning uniform and coord loads.
Uniform load can be inserted into any instruction, so let's do that. ARM site
claim that penalty for cache miss is one clock, so we don't lose anything if
we merge it into instruction that uses the result. As side effect we can also
pipeline it and thus decrease reg pressure.
Do the same for varyings that hold texture coords, but for different reason:
looks like there's a special path for coords that increases precision if
varying that holds it is pipelined. If we don't pipeline it and load coords
from a register its precision is fp16 and thus only 10 bits which is not enough
to accurately sample textures of size 1024 or larger.
Since instruction can hold only one uniform load and one varying load,
node_to_instr now creates a move using helper introduced in previous commit if
slot is already taken. As side effect of this change we can also try to
pipeline texture loads and create a move if attempt fails.
Reviewed-by: Erico Nunes <[email protected]>
Signed-off-by: Vasily Khoruzhick <[email protected]>
|
|
|
|
|
|
|
|
|
| |
It can load value from varying directly as well. Also load_regs is the
only op that has a source, so add src_num field to load node and set it
accordingly.
Reviewed-by: Erico Nunes <[email protected]>
Signed-off-by: Vasily Khoruzhick <[email protected]>
|
|
|
|
|
|
|
| |
Introduce common helper for creating movs to avoid code duplication
Reviewed-by: Erico Nunes <[email protected]>
Signed-off-by: Vasily Khoruzhick <[email protected]>
|