| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We've seen some cases where performance can hurt quite a bit.
Technically, the more simple the function the more overhead there is
for using a function for this (and the less benefits this provides).
Hence don't do this if we expect the generated code to be simple.
There's an even more important reason why this hurts performance,
which is shaders reusing the same unit with some of the same inputs,
as llvm cannot figure out the calculations are the same if they
are performned in the function (even just reusing the same unit without
any input being the same provides such optimization opportunities though
not very much). This is something which would need to be handled by IPO
passes however.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When nvc0_push_vbo calls nouveau_scratch_done it does not mean
scratch buffers can be freed immediately. It means "when hardware
advances to this place in the command stream the scratch buffers
can be freed".
To fix it, just postpone scratch runout destruction after current
fence is signalled.
The bug existed for a very long time. Nobody noticed, because
"scratch runout" code path is rarely executed.
Fixes hang at the very beginning of first mission in "Serious Sam 3"
on nve7/gk107. It manifested as:
nouveau E[ PFIFO][0000:01:00.0] read fault at 0x000a9e0000 [PTE] from GR/GPC0/PE_2 on channel 0x007f853000 [Sam3[17056]]
Cc: "10.4 10.5" <[email protected]>
Reviewed-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
fails v2
v2:
- Don't use _errs map
Cc: 10.5 10.4 <[email protected]>
Reviewed-by: Francisco Jerez <[email protected]>
|
|
|
|
|
|
|
|
|
| |
types v2
v2:
- Fix typo
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This will help encoding VUI into the bitstream
v2: make backward compatible
Signed-off-by: Leo Liu <[email protected]>
Reviewed-by: Christian König <[email protected]>
|
|
|
|
|
|
|
| |
The framerate will be used for video usability info support by VCE driver
Signed-off-by: Leo Liu <[email protected]>
Reviewed-by: Christian König <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Just announce support for 4 components.
While here also increase the max/min texel offsets (the limit is completely
artificial, was chosen because that's what other hardware did, however there's
other drivers using larger limits).
Over a thousand little piglits skip->pass.
v2: update docs/GL3.txt
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is quite trivial, essentially just follow all the same code you'd
use with linear min/mag (and no mip) filter, then just skip the filtering
after looking up the texels in favor of direct assignment of the right channel
to the result. (This is though not true for the multi-offset version if we'd
want to support it - for this would probably need to do something along the
lines of 4x nearest sampling due to the necessity of doing coord wrapping
individually per texel.)
Supports multi-channel formats.
From the SM5 gather cap bit, should support non-constant offsets, plus shadow
comparisons (the former untested), but not component selection (should be
easy to implement but all this stuff is not really exposable anyway for now).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
| |
Luckily thanks to the revamped interface this is a lot less work now...
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This has got a bit out of control with more and more parameters added.
Worse, whenever something in there changes all callees have to be updated
for that, even though they don't really do much with any parameter in there
except pass it on to the actual sampling function.
Hence simply put almost everything into a struct. Also instead of relying
on some arguments being NULL, be explicit and set this in a key (which is
just reused for function generation for simplicity). (The code still relies
on them being NULL in the end for now.)
Technically there is a minimal functional change here for shadow sampling:
if shadow sampling is done is now determined explicitly by the texture
function (either sample_c or the gl-style tex func inherit this from target)
instead of the static texture state. These two should always match, however.
Otherwise, it should generate all the same code.
Reviewed-by: Jose Fonseca <[email protected]>
|
| |
|
|
|
|
| |
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
| |
This cleans up more instructions generated by uniform array indexing
multiplies.
total instructions in shared programs: 39989 -> 39961 (-0.07%)
instructions in affected programs: 896 -> 868 (-3.12%)
|
|
|
|
|
|
|
|
|
|
|
|
| |
This cleans up some pointless operations generated by the in-driver mul24
lowering (commonly generated by making a vec4 index for a matrix in a
uniform array).
I could fill in other operations, but pretty much anything else ought to
be getting handled at the NIR level, I think.
total uniforms in shared programs: 13423 -> 13421 (-0.01%)
uniforms in affected programs: 346 -> 344 (-0.58%)
|
|
|
|
|
|
|
|
|
|
| |
The hardware just uses the low 24 lines, saving us an AND to drop the high
bits.
total uniforms in shared programs: 13433 -> 13423 (-0.07%)
uniforms in affected programs: 356 -> 346 (-2.81%)
total instructions in shared programs: 40003 -> 39989 (-0.03%)
instructions in affected programs: 910 -> 896 (-1.54%)
|
|
|
|
|
| |
The hardware uses the low 24 bits in integer multiplies, so we can have
fewer high bits (and so probably drop them more frequently).
|
|
|
|
|
|
|
|
|
|
| |
Fixes a crash in genymotion with several threads compiling shaders
concurrently.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=89746
Cc: 10.5 <[email protected]>
Reviewed-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
|
| |
This does not (yet) support different coordinate origins, so the tests
still fail due to fbo flipping.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This appears to need the A2XX version of the point list, so select it at
draw time if necessary.
Experimentally, always using the A2XX version causes hangs when PSIZE
isn't actually emitted.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
The division is probably a holdover from the days when the fixed point
inline functions generated by headergen were broken.
Also reduce the maximum point size to 4092 (vs 4096), which is what the
blob does.
Cc: "10.4 10.5" <[email protected]>
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The SZ2 field contains the layer size of a lower miplevel. It only
contains 4 bits, which limits the maximum layer size it can describe. In
situations where the next miplevel would be too big, the hardware
appears to keep minifying the size until it hits one of that size.
Unfortunately the hardware's ideas about sizes can differ from
freedreno's which can still lead to issues. Minimize those by stopping
to minify as soon as possible.
Signed-off-by: Ilia Mirkin <[email protected]>
Cc: "10.4 10.5" <[email protected]>
|
|
|
|
| |
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These functions looked quite complicated, even though what they actually did
was trivial (ever since we dropped swizzled rendering). Also drop lookup of
format block per bytes done for each block, and do it once per scene instead.
This improves everybody's favorite "benchmark" by 3% or so, though
lp_rast_shade_quads_all() which calls this shows up still quite high for a
function which does little more than call the jit function.
(This would most likely be much better handled by the jit function itself,
the strides are passed through anyway already, though for being able to
handle layers it would definitely add some complexity.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
When using the texel fetch functions rather than ordinary texturing,
the arguments are all int vecs instead of float vecs, not to mention
the actual function would look completely different. Hence this must
be included in the texture function name (which serves as the key)
otherwise things crash badly when a shader accesses the same texture
and sampler unit with both txf/ld and ordinary texturing instructions
with otherwise matching keys.
|
|
|
|
|
| |
Cc: "10.4 10.5" <[email protected]>
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Multiply operations can have a post-factor on them, which other ops
don't support. Only perform the peephole optimizations when there is no
post-factor involved.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=89758
Cc: "10.4 10.5" <[email protected]>
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Jan Vesely <[email protected]>
Reviewed-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are issues with inlining everything, most notably llvm will use much
more memory (and be slower) when compiling. Ideally we'd probably use
functions for shader functions too but texture sampling usually is responsible
for quite some IR (it can easily reach 80% of total IR instructions) so this
seems like a good start.
This still generates a different function for all different combinations just
like before, however it is possible llvm is missing some optimization
opportunities - it is believed though such opportunities should be somewhat
rare, but at least for now it can still be switched off (at compile time only).
It should probably make compiled code also smaller because the same function
should be used for different variants in the same module (so for the
opaque/partial or linear/elts variants).
No piglit change (though it does indeed speed up unrealistic tests like
fp-indirections2 by a factor of 30 or so).
Has a small negative performance impact in openarena - I suspect this could
be fixed by running some IPO passes (despite the private linkage, llvm right
now does NO optimization at all wrt anything going past the call, even if
there's just one caller - so things like values stored before the call and then
always written by the function etc. will not be optimized away, nor will dead
arguments (which we mostly shouldn't have) be eliminated, always constant
arguments promoted etc.).
v2: use proper return values instead of pointer function arguments.
llvm supports aggregate return values, which do wonders here eliminating
unnecessary stack variables - everything in fact will be returned in registers
even without any IPO optimizations. It makes the code simpler too.
With this I could not measure a peformance impact in openarena any longer
(though since there's still no constant value propagation etc. into the tex
functions this does not mean it couldn't have a negative impact elsewhere).
v3: fix some minor issues suggested by Jose, and do disassembly (and the
profiling) without hacks.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The callbacks used for getting the dynamic texture/sampler state were using
the jit_context from the generated jit function. This works just fine, however
that way it's impossible to generate separate functions for texture sampling,
as will be done in the next commit. Hence, pass this pointer through all
interfaces so it can be passed to a separate function (technically, it would
probably be possible to extract this pointer from the current function instead,
but this feels hacky and would probably require some more hacks if we'd use
real functions instead of inlining all shader functions at some point).
There should be no difference in the generated code for now.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
| |
The data in memory is in big endian format and needs to be converted
into CPU byte order. So the patch actually reversed what needs to be done.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
The CUBE_ARRAY case uses r[4]. Make sure that the stack variable is
there.
Noticed by Coverity.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
| |
Does not appear to be used in tree. Coverity spotted some errors in the
bitmask stuff, but the whole thing appears to be unused.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
| |
Spotted by Coverity.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Make use of the builtin ffs macros and split out ffsll
to a seperate block. Needed for at least OpenBSD which
does not have ffsll in libc.
Signed-off-by: Jonathan Gray <[email protected]>
Reviewed-by: Emil Velikov <[email protected]>
|
|
|
|
|
| |
This has been useful once again while trying to debug stride issues
between render targets and texturing.
|
|
|
|
| |
The problem I'd seen before seems to be gone.
|
|
|
|
|
| |
Fixes some non-power-of-two texture rendering when I force ARGB8888 to
raster.
|
|
|
|
|
|
| |
16 / cpp happens to be the same as utile_w on the only raster format
supported (4 bytes per pixel), but simulator/hw source code generally
talks in terms of utiles.
|
|
|
|
| |
The enum compared to was 0, so it worked out, but it sure looked wrong.
|
|
|
|
|
|
| |
I'm experimenting with a workaround for raster texture misrendering on
hardware, and this lets me look at the format chosen when computing
strides.
|
|
|
|
|
|
| |
They don't do anything special for us, but I've been told by kernel
maintainers that relying on dumb for my acceleration-capable buffers
is not OK.
|
|
|
|
|
|
| |
I'd like to compile as much of the device-specific code as possible
when building for simulator, and using if (using_simulator) instead of
ifdefs helps.
|
|
|
|
| |
I keep rewriting these.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 20346808cf4f1ee4f320afaf18f94043fb146f2e.
The conversion is actually done since these are the *B macro variants
and no vtx format is supplied, which makes them go through the translate
module.
This restores the following piglit tests to passing:
draw-vertices user
gl-2.0-vertexattribpointer
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The storage size for local kernel args can be queried before the
arguments are set by using the CL_KERNEL_LOCAL_MEM_SIZE param
of clGetKernelWorkGroupInfo().
The spec says that if local kernel arguments have not been specified,
then we should assume their size is 0.
v2:
- Implement using c++11 member initialization.
Reviewed-by: Jan Vesely <[email protected]>
Reviewed-by: Francisco Jerez <[email protected]>
Cc: 10.5 10.4 <[email protected]>
|
|
|
|
| |
This fixes the build since llvm r232885 and also simplifies the code.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The pipe's get_vendor method returns something more akin to a driver
vendor string in most cases, instead of the actual device vendor. Use
get_device_vendor instead, which was introduced specifically for this
purpose.
Signed-off-by: Giuseppe Bilotta <[email protected]>
Reviewed-by: Michel Dänzer <[email protected]>
Reviewed-by: Francisco Jerez <[email protected]>
|
|
|
|
|
|
|
|
|
| |
The only hackish ones are llvmpipe and softpipe, which currently return
the same string as for get_vendor(), while ideally they should return
the CPU vendor.
Signed-off-by: Giuseppe Bilotta <[email protected]>
Reviewed-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This will be needed by Clover to return the correct information
to CL_DEVICE_VENDOR info queries.
Signed-off-by: Giuseppe Bilotta <[email protected]>
Reviewed-by: Michel Dänzer <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Giuseppe Bilotta <[email protected]>
Reviewed-by: Michel Dänzer <[email protected]>
|