| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Fixes: 31fde8fa (nv50/ir: flip shl(add, imm) into add(shl, imm))
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
| |
It was accidentally using the store opcode.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
| |
This is enough for the plain TGSI BARRIER implementation.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we have a d24x8 format, there is no stencil. Therefore, we can always
clear these bits too, which means this will be some kind of memset rather
than read-modify-write.
This is good for some 7% increase or so in gears with huge window size -
seems to have a bigger effect if things aren't in caches. Of course, any
real app won't spend nearly as much time comparatively in clearing
depth buffer in the first place, so the speedup will be much lower.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
| |
Signed-off-by: Christian König <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Apparently, nobody has combined stippling with a fragment shader
containing immediates in almost five years...
Fixes a bug in Kodi with radeonsi reported by Christian König.
Cc: "11.0 11.1" <[email protected]>
Tested-by: Christian König <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The whole point of inlining sources is to reduce loads. We can end up in
a situation where one value is used a lot of times, and one value is
used only once per instruction. The once-per-instruction one is the one
that should get inlined, but with the previous algorithm, it was given
no preference.
This flips things around to preferring putting less-referenced values
into src1 which increases the likelihood of them being inlined.
While we're at it, adjust the heuristic to not treat 0 as an immediate,
as well as (effectively) check for situations where LIMMs can't be
loaded. All this yields improvements on nvc0:
total instructions in shared programs : 6261157 -> 6255985 (-0.08%)
total gprs used in shared programs : 945082 -> 943417 (-0.18%)
total local used in shared programs : 30372 -> 30288 (-0.28%)
total bytes used in shared programs : 50089256 -> 50047880 (-0.08%)
local gpr inst bytes
helped 21 822 3332 3332
hurt 0 278 565 565
And more importantly avoids generating really bad code with SSBOs, where
we end up checking a lot of different values (usually immediates) against
the length.
On nv50 we get comparable results, and even improve packing (bytes went
down more than instructions):
total instructions in shared programs : 6346564 -> 6341277 (-0.08%)
total gprs used in shared programs : 728719 -> 725131 (-0.49%)
total local used in shared programs : 3552 -> 3552 (0.00%)
total bytes used in shared programs : 43995688 -> 43932928 (-0.14%)
local gpr inst bytes
helped 0 1380 3252 3774
hurt 0 287 1710 1365
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
| |
Once we go past half of the "GPR" register file, it seems like we need
to run frag shader with smaller threadsize. (The vertex shader already
runs at TWO_QUADS, which is the minimum.)
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Some a4xx firmware doesn't implement the "PFD" (prefetch-disabled)
version of the CP_INDIRECT_BUFFER packet. So allow for PFD vs PFE per
generation. Switch a3xx and a4xx over to using prefetch-enabled version
(which is also what blob does.. it seems only on a2xx we cannot use
PFE).
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Emil Velikov <[email protected]>
|
|
|
|
|
|
| |
v2: minor cleanup
Signed-off-by: Christian König <[email protected]>
|
|
|
|
|
|
|
| |
And only call it from r600_invalidate_resource for buffer resources.
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
| |
Fixes crash in 4 EGL piglit tests with radeonsi.
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
|
|
|
|
|
|
|
| |
Say "LLVM" instead of "Compiler" for clarity.
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes a bug when building a pack instruction.
For POWER (altivec), in case the destination is signed and the
src width is 32, we need to use vpkswss. The original code used vpkuwus,
which emits an unsigned result.
This fixes the following piglit tests on ppc64le:
- spec@arb_color_buffer_float@gl_rgba8-drawpixels
- shaders@glsl-fs-fogscale
I've also corrected some coding style issues in the function.
v2: Returned else statements to vmware style
Signed-off-by: Oded Gabbay <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
| |
Apparently the IPA op decided to stop working with offsets. Need to
figure out if we need to do an AL2P situation or something similar. For
now just turn it back off.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
| |
Spotted by Coverity.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Edward O'Callaghan <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
| |
We already deref it earlier. And these are all allocated on load.
Spotted by Coverity.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes a classic "confuse the enemy" bug.
_mm_andnot_si128 (SSE) and vec_andc (VMX) do the same operation, but the
arguments are opposite.
_mm_andnot_si128 performs "r = (~a) & b" while
vec_andc performs "r = a & (~b)"
To make sure this error won't return in another place, I added a wrapper
function, vec_andnot_si128, in u_pwr8.h, which makes the swap inside.
Signed-off-by: Oded Gabbay <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
| |
In fad158a0 ("freedreno/ir3: array rework") the src # (n) shifted by
one, but missed updating delay-slot calc.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
| |
Detect arrays which don't conflict with each other and allow overlapping
register allocation.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It at least happens with some piglit tests, like
$piglit/bin/vp-address-01
VERT
DCL IN[0]
DCL IN[1]
DCL OUT[0], POSITION
DCL OUT[1], COLOR
DCL CONST[0..7]
DCL ADDR[0]
0: ARL ADDR[0].x, IN[1].xxxx
1: MOV_SAT OUT[1], CONST[ADDR[0].x-1]
2: DP4 OUT[0].x, CONST[4], IN[0]
3: DP4 OUT[0].y, CONST[5], IN[0]
4: DP4 OUT[0].z, CONST[6], IN[0]
5: DP4 OUT[0].w, CONST[7], IN[0]
6: END
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Seems like in certain cases, we cannot use c<a0.x+0> as the third src to
cat3 instructions. This may be slightly conservative, we may only have
this restriction when the first src is also const.
This fixes, for example, +24/-0 of the variable-indexing piglit tests.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
| |
Only user is freedreno, and after array-rework it can cope. Avoids
generating loads for a store.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
|
| |
If we handle separately the special case of eliminating output mov
(which includes keeps and various other cases where we don't have a
consuming instruction's src register to collapse things into), we
can simplify the logic.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
| |
Shuffle things slightly, passing instr-data to ra_name() to reduce the
number of places where we need to add support for array names.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
| |
Collapse two nested if's into one to reduce indent level.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
| |
This adds code that is basically the same as the code in umod, udiv and idiv.
However, unlike idiv we return -1.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The cleaning up was quite a performance hog (making pipe_resource_reference
the number two in profilers on the vertex path, and 3rd overall, with its
cousin pipe_reference_described not far behind) if there were lots
of tiny draw calls (ipers). Now the reason was really that it was blindly
calling this for all potential shader views (so 32 each for vs and gs) even
though the app never touched a single one which could have been fixed,
however I can't come up with a good reason why we refcount these. We've got
references, of course, in the sampler views, which should be quite sufficient
as we do all vertex and geometry shader execution fully synchronous.
(Calling prepare_shader_sampling for all draw calls even if there were no
changes looks quite suboptimal too, but generally we don't really expect vs/gs
shader sampling to be used much with llvmpipe, and there's even an early exit
if there aren't any views to avoid the "null loop" albeit it's now no longer
always trying to loop through all 32 slots. Maybe improve another time...).
Of course, if we manage to make vertex loads run asynchronously some day,
we need references again, but adding that back would be the least of the
problems...
Also only set LP_NEW_SAMPLER_VIEW for fragment sampler views. Nothing on the
vertex side depends on it (I suppose we'd really wanted a separate flag in
any case).
(Good for a 3% improvement or so in ipers under the right conditions.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was not really a leak per se, but we were referencing the textures for
longer than intended. If textures were set via llvmpipe_set_sampler_views()
(for fs) and then picked up by lp_setup_set_fragment_sampler_views(), they
were referenced in the setup state. However, the only way to unreference them
was by replacing them with another texture, and not when the texture slot
was replaced with a NULL sampler view. (They were then further also referenced
by the scene too which might have additional minor side effects as we limit
the memory size which is allowed to be referenced by a scene in a rather crude
way.) Only setup destruction (at context destruction time) then finally would
get rid of the references.
Fix this by noting the number of textures the last time, and unreference
things if the new view is NULL (avoiding having to unreference things
always up to PIPE_MAX_SHADER_SAMPLER_VIEWS which would also have worked).
Found by code inspection, no test...
v2: rename var
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reduces local memory usage in a lot of Metro 2033 Redux and a few KSP
shaders:
total local used in shared programs : 54116 -> 30372 (-43.88%)
Probably modest advantage to execution, but it's an imporant
prerequisite to dropping some of the TGSI optimizations done by the
state tracker.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we were treating any indirect temp array usage to mean that
everything should end up in lmem. The MemoryOpt pass would clean a lot
of that up later, but in the meanwhile we would lose a lot of
opportunity for optimization.
This helps a lot of Metro 2033 Redux and a handful of KSP shaders:
total instructions in shared programs : 6288373 -> 6261517 (-0.43%)
total gprs used in shared programs : 944051 -> 945131 (0.11%)
total local used in shared programs : 54116 -> 54116 (0.00%)
A typical case is for register usage to double and for instructions to
halve. A future commit can also optimize local memory usage size to be
reduced with better packing.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Indirect constbuf indexing works by using very large offsets. However if
an indirect constbuf index load is const-propagated, it becomes a very
large const offset. Take that into account when legalizing the SSA by
moving the high parts of that offset into the file index. Also disallow
very large (or small) indices on most other instructions.
This fixes regressions in ubo_array_indexing/*-two-arrays piglit tests.
Fixes: abd326e81b (nv50/ir: propagate indirect loads into instructions)
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
| |
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
I'm not sure about the consequences of this bug, but it's definitely
dangerous.
This applies to SI, CIK, VI.
Cc: 11.0 11.1 <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For those formats that support hw mipmap generation, use the
DXGenMips command. Otherwise fallback to the mipmap generation utility.
Tested with piglit, OpenGL apps (Heaven, Turbine, Cinebench)
v2: make sure the texture surface was created with the render target bind flag
set relocation flag to SVGA_RELOC_WRITE for the texture surface
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
| |
The actual increment of the num-generate-mipmap counter will be done
in a subsequent patch when hw generate mipmap is supported.
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds a new interface to support hardware mipmap generation.
PIPE_CAP_GENERATE_MIPMAP is added to allow a driver to specify
if this new interface is supported; if not supported, the state tracker will
fallback to mipmap generation by rendering/texturing.
v2: add PIPE_CAP_GENERATE_MIPMAP to the disabled section for all drivers
v3: add format to the generate_mipmap interface to allow mipmap generation
using a format other than the resource format
v4: fix return type of trace_context_generate_mipmap()
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
| |
The whole point of AMD_pinned_memory is that applications don't have to map
buffers via OpenGL - but they're still allowed to, so make sure we don't break
the link between buffer object and user memory unless explicitly instructed
to.
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
| |
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
| |
This accomodates a streaming pattern where the discard flag is set when the
application wraps back to the beginning of the buffer.
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
| |
It makes sense to re-use pipe->invalidate_resource for the purpose of
glInvalidateBufferData, but this function is already implemented in vc4
where it doesn't have the expected behavior. So add a capability flag
to indicate that the driver supports the expected behavior.
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
| |
Some confusion between pb_buffer and radeon_bo as well as between
radeon_drm_winsys and radeon_winsys.
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
| |
And FACE becomes integer instead of float.
Reviewed-by: Edward O'Callaghan <[email protected]>
|