| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
In the vbuf_render::set_primitive() functions.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Culling tris with zero area seems like a great idea, but apparently with
fill mode line (and point) we're supposed to draw them, at least some tests
for some other state tracker complained otherwise.
Such tris also always seem to be back facing (not sure if this can be
inferred from anything, since in a mathematical sense it cannot really be
determined), so make sure to account for this when filling in the face
information.
(For solid tris, this is of course unnecessary, drivers will throw the tris
away later in any case.)
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With GALLIVM_DEBUG=perf set, output the relevant stats for shader cache usage
whenever we have to evict shader variants.
Also add some output when shaders are deleted (but not with the perf setting
to keep this one less noisy).
While here, also don't delete that many shaders when we have to evict. For fs,
there's potentially some cost if we have to evict due to the required flush,
however certainly shader recompiles have a high cost too so I don't think
evicting one quarter of the cache size makes sense (and, if we're evicting
based on IR count, we probably typically evict only very few or just one
shader too). For vs, I'm not sure it even makes sense to evict more than
one shader at a time, but keep the logic the same for now.
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We're not particularly concerned with memory usage, if the tradeoff is
shader recompiles. And it's common for apps to have a lot of shaders
nowadays (and, since our shaders include a LOT of context state of course
we may create quite a bit more shaders even).
So quadruple the amount of shaders draw will cache (from 128 to 512).
For llvmpipe (fs shaders) quadruple the number of instructions, keep the
number of variants the same for now (only with very simple, non-texturing
shaders the variant limit could really be reached), and simplify the
definition, it's probably easier to just have one different definition
per branch...
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
| |
Trivial.
|
|
|
|
| |
Reviewed-by: Charmaine Lee <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It could only handle indices 0/1, otherwise what happened was bad (accessing
array out of bounds, no crash but kind of random). This is enough for the gl
state tracker (primary/secondary color) but not enough for some other state
trackers (d3d9 has no limits on the number of color interpolants).
The complexity with color semantics are all due to the front/back mapping (2
outputs in the vs map to one input in the fs) so this isn't extended to
indices > 1 - d3d9 has no use for back colors, therefore this isn't needed and
still only 2 back colors can be handled correctly.
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
| |
We shouldn't use the wide line stage if the line width is 1.
This check isn't strictly needed because all drivers are (now)
specifying a line wide threshold of at least 1.0 pixels, but
let's play it safe.
Reviewed-by: Charmaine Lee <[email protected]>
|
|
|
|
| |
Trivial.
|
|
|
|
|
|
|
|
|
| |
Simple search for a backslash followed by two newlines.
If one of the newlines were to be removed, this would cause issues, so
let's just remove these trailing backslashes.
Signed-off-by: Eric Engestrom <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
pipe_draw_info::indexed is replaced with index_size. index_size == 0 means
non-indexed.
Instead of pipe_index_buffer::offset, pipe_draw_info::start is used.
For indexed indirect draws, pipe_draw_info::start is added to the indirect
start. This is the only case when "start" affects indirect draws.
pipe_draw_info::index is a union. Use either index::resource or
index::user depending on the value of pipe_draw_info::has_user_indices.
v2: fixes for nine, svga
|
| |
|
|
|
|
| |
Remove trailing whitespace, fix formatting, etc. Trivial.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes the following Clang warning.
draw/draw_pipe_wide_line.c:48:38: warning: unused function 'wideline_stage' [-Wunused-function]
static inline struct wideline_stage *wideline_stage( struct draw_stage *stage )
^
1 warning generated.
v2: - remove commented code (Roland Scheidegger)
v3: - remove half_line_width in the struct
Signed-off-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes the following Clang warning.
draw/draw_pipe_vbuf.c:102:1: warning: unused function 'overflow' [-Wunused-function]
overflow( void *map, void *ptr, unsigned bytes, unsigned bufsz )
^
1 warning generated.
Signed-off-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
| |
pointed out by clang (stored value never read)
|
|
|
|
|
|
| |
and some s/uint/enum pipe_shader_type/
Reviewed-by: Edward O'Callaghan <[email protected]>
|
|
|
|
| |
Reviewed-by: Edward O'Callaghan <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts changes 903eb09b5fb78d47d0f8a4bdf826a113ca2aff40..1a0aa468f354f0ee94dd383cd40ae915584624aa:
Tobias Droste (5):
configure.ac: Rename MESA_LLVM to FOUND_LLVM
configure.ac: Only set LLVM_LIBS if LLVM is used
configure.ac: Only define HAVE_LLVM if LLVM is used
configure.ac: Set and use HAVE_GALLIUM_LLVM define
configure.ac: Don't check LLVM version in gallium_require_llvm
They break scons build, and I'm not convinced this is the right fix. In
particular changing HAVE_LLVM in the C code is something I'd rather
avoid no matter what. So it's better to discuss without the pressure of
broken builds.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Gallium code used HAVE_LLVM to check if it needs to compile code for
LLVM in header and source files.
With the new logic HAVE_LLVM is always set. Use extra define to figure
out if LLVM is used.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99010
Signed-off-by: Tobias Droste <[email protected]>
|
|
|
|
|
|
| |
It's redundant with the source modifier.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that there's some SoA fetch which never falls back, we should always get
results which are better or at least not worse (something like rgba32f will
stay the same).
For cases which get way better, think something like R16_UNORM with 8-wide
vectors: this was 8 sign-extend fetches, 8 cvt, 8 muls, followed by
a couple of shuffles to stitch things together (if it is smart enough,
6 unpacks) and then a (8-wide) transpose (not sure if llvm could even
optimize the shuffles + transpose, since the 16bit values were actually
sign-extended to 128bit before being cast to a float vec, so that would be
another 8 unpacks). Now that is just 8 fetches (directly inserted into
vector, albeit there's one 128bit insert needed), 1 cvt, 1 mul.
v2: ditch the old AoS code instead of just disabling it.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By using a dst_type in the the gather interface, gather has some more
knowledge about how values should be fetched.
E.g. if this is a 3x32bit fetch and dst_type is 4x32bit vector gather
will no longer do a ZExt with a 96bit scalar value to 128bit, but
just fetch the 96bit as 3x32bit vector (this is still going to be
2 loads of course, but the loads can be done directly to simd vector
that way).
Also, we can now do some try to use the right int/float type. This should
make no difference really since there's typically no domain transition
penalties for such simd loads, however it actually makes a difference
since llvm will use different shuffle lowering afterwards so the caller
can use this to trick llvm into using sane shuffle afterwards (and yes
llvm is really stupid there - nothing against using the shuffle
instruction from the correct domain, but not at the cost of doing 3 times
more shuffles, the case which actually matters is refusal to use shufps
for integer values).
Also do some attempt to avoid things which look great on paper but llvm
doesn't really handle (e.g. fetching 3-element 8 bit and 16 bit vectors
which is simply disastrous - I suspect type legalizer is to blame trying
to extend these vectors to 128bit types somehow, so fetching these with
scalars like before which is suboptimal due to the ZExt).
Remove the ability for truncation (no point, this is gather, not conversion)
as it is complex enough already.
While here also implement not just the float, but also the 64bit avx2
gathers (disabled though since based on the theoretical numbers the benefit
just isn't there at all until Skylake at least).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that noone actually cares if the address computations overflow,
be it the stride mul or the offset adds.
Wrap around seems to be explicitly permitted even by some other API (which
is a _very_ surprising result, as these overflow computations were added just
for that and made some tests pass at that time - I suspect some later fixes
fixed the actual root cause...). So the requirements in that other api were
actually sane there all along after all...
Still need to make sure the computed buffer size needed is valid, of course.
This ditches the shiny new widening mul from these codepaths, ah well...
And now that I really understand this, change the fishy min limiting
indices to what it really should have done. Which is simply to prevent
fetching more values than valid for the last loop iteration. (This makes
the code path in the loop minimally more complex for the non-indexed case
as we have to skip the optimization combining two adds. I think it should
be safe to skip this actually there, but I don't care much about this
especially since skipping that optimization actually makes the code easier
to read elsewhere.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Don't keep the ofbit. This is just a minor simplification, just adjust
the buffer size so that there will always be an overflow if buffers aren't
valid to fetch from.
Also, get rid of control flow from the instanced path too. Not worried about
performance, but it's simpler and keeps the code more similar to ordinary
fetch.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code for elts and linear paths was nearly 100% identical by now - with
the elts path simply having some additional gather for the elements in the
main loop (with some additional small differences before the main loop).
Hence nuke the separate functions and decide this at jit shader execution
time (simply based on the presence of the elts pointer).
Some analysis shows that the generated vs jit functions seem to be just very
minimally more complex than the former elts functions, and almost none of the
additional complexity is in the main loop (basically just the branch logic
for the branch fetching the actual indices).
Compared to linear, the codesize of the function is of course a bit larger,
however the actual executed code in the main loop appears to be near 100%
identical (the additional code looking up indices is skipped as expected).
So, I would not expect a (meaningful) performance difference with the
generated code, neither with elts nor linear, this does however roughly
half the compilation time (the compiled shaders should also use only half
the memory of course).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
| |
This is a bit simpler. Mostly to make it easier to unify the paths later...
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was kind of strange, since it replaced indices which were only
overflowing due to bias with MAX_UINT. This would cause an overflow later
in the shader, except if stride was 0, however the vertex id would be
essentially random then (-1 + eltBias). No test cared about it, though.
So, drop this and just use ordinary int arithmetic wraparound as usual.
This is much simpler to understand and the results are "more correct" or
at least more consistent (vertex id as well as actual fetch results just
correspond to wrapped around arithmetic).
There's only one catch, it is now possible to hit the cache initialization
value also with ushort and ubyte elts path (this wouldn't be an issue if
we'd simply handle the eltBias itself later in the shader). Hence, we need
to make sure the cache logic doesn't think this element has already been
emitted when it has not (I believe some seriously bad things could happen
otherwise). So, borrow the logic which handled this from the uint case, but
not before fixing it up...
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
vsplit_get_base_idx explicitly returned idx 0 and set the ofbit
in case of overflow. We'd then check the ofbit and use idx 0 instead of
looking it up. This was necessary because DRAW_GET_IDX used to return
DRAW_MAX_FETCH_IDX and not 0 in case of overflows.
However, this is all unnecessary, we can just let DRAW_GET_IDX return 0
in case of overflow. In fact before bbd1e60198548a12be3405fc32dd39a87e8968ab
the code already did that, not sure why this particular bit was changed
(might have been one half of an attempt to get these indices to actual draw
shader execution - in fact I think this would make things less awkward, it
would require moving the eltBias handling to the shader as well).
Note there's other callers of DRAW_GET_IDX - those code paths however
explicitly do not handle index buffer overflows, therefore the overflow
value doesn't matter for them.
Also do some trivial simplification - for (unsigned) a + b, checking res < a
is sufficient for overflow detection, we don't need to check for res < b too
(similar for signed).
And an index buffer overflow check looked bogus - eltMax is the number of
elements in the index buffer, not the maximum element which can be fetched.
(Drop the start check against the idx buffer though, this is already covered
by end check and end < start).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
lp_build_any_true_range is just what we need, though it will only produce
optimal code with sse41 (ptest + set) - but even without it on 64bit x86
the code is still better (1 unpack, 2 movq + or + set), on 32bit x86 it's
going to be roughly the same as before.
While here also make it a "real" 8bit boolean - cuts one instruction but
more importantly similar to ordinary booleans.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of doing all the math with scalars, use vectors. This means the
overflow math needs to be done manually, albeit that's only really
problematic for the stride/index mul, the rest has been pretty much
moved outside the shader loop (albeit the mul could actually be optimized
away too), where things are still scalar.
To eliminate control flow in the main shader loop fetch, provide fake
buffers (so index 0 is always valid to fetch).
Still uses aos fetch though in the end - mostly because some more code
would be needed to handle unaligned fetches in that path, and because for
most formats it won't make a difference anyway (we generate some truly
horrendous code for things like R16G16_something for instance).
Instanced fetch however stays roughly the same as before, except that
no longer the same element is fetched multiple times (I've seen a reduction
of ~3 times in main shader loop size due to llvm not recognizing it's all
the same fetch, since it would have been possible some of the fetches
getting replaced with zeros in case vector size exceeds remaining fetch
count - the values of such fetches don't matter at all though).
Also, for elts gathering, use vectorized code as well.
The generated shaders are smaller and faster to compile (not entirely sure
about execution speed, but generally unless there's just single vertices
to handle I would expect it to be faster - there's more opportunities
for future improvements by using soa fetch).
v3: skip the fake index buffer, not needed due to the jit code never seeing
the real index buffer in the first place.
Fix a bug with mask expansion (needs SExt, not ZExt).
Also, be really really careful to keep the behavior the same, even in cases
where it looks wrong, and add comments why the code is doing the seemingly
wrong stuff... Fortunately it's not actually more complex in the end...
Also change function order slightly just to make the diff more readable.
No piglit change. Passes some internal testing with another api too...
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
v2:
Fix adding parameter attributes with LLVM < 4.0.
v3:
Fix typo.
Fix parameter index.
Add a gallivm enum for function attributes.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
| |
Trivial. There's some regressions internally, related to overflow
behavior. I'll have to look at it at another time, some interactions
with vsplit/vcache are actually mind-blowing.
This reverts commit 3fa10ffb496cc4e6d1003891cf0381bb5bec2a74.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of doing all the math with scalars, use vectors. This means the
overflow math needs to be done manually, albeit that's only really
problematic for the stride/index mul, the rest has been pretty much
moved outside the shader loop (albeit the mul could actually be optimized
away too), where things are still scalar. Because llvm is complete fail
with the zero-extend widening mul, roll our own even...
To eliminate control flow in the main shader loop fetch, provide fake
buffers (so index 0 is always valid to fetch).
Still uses aos fetch though in the end - mostly because some more code
would be needed to handle unaligned fetches in that path, and because for
most formats it won't make a difference anyway (we generate some truly
horrendous code for things like R16G16_something for instance).
Instanced fetch however stays roughly the same as before, except that
no longer the same element is fetched multiple times (I've seen a reduction
of ~3 times in main shader loop size due to apparently llvm not being able
to deduce it's really all the same with a couple instanced elements).
Also, for elts gathering, use vectorized code as well - provide a fake
elt buffer if there's no valid one bound.
The generated shaders are smaller and faster to compile (not entirely sure
about execution speed, but generally unless there's just single vertices
to handle I would expect it to be faster - there's more opportunities
for future improvements by using soa fetch).
No piglit change.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous fixes were incomplete - some code still iterated through the number
of elements provided by velem layout instead of the number stored in the key
(which is the same as the number defined by the vs). And also actually
accessed the elements from the layout directly instead of those in the key.
This mismatch could still cause crashes.
(Besides, it is a very good idea to only use data stored in the key anyway.)
v2: move null format check, remove now unnecessary function parameter,
some minor prettify
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The per-element fetch has quite some calculations which are constant,
these can be moved outside both the per-element as well as the main
shader loop (llvm can figure out it's constant mostly on its own, however
this can have a significant compile time cost).
Similarly, it looks easier swapping the fetch loops (outer loop per attrib,
inner loop filling up the per vertex elements - this way the aos->soa
conversion also can be done per attrib and not just at the end though again
this doesn't really make much of a difference in the generated code). (This
would also make it possible to vectorize the calculations leading to the
fetches.)
There's also some minimal change simplifying the overflow math slightly.
All in all, the generated code seems to look slightly simpler (depending
on the actual vs), but more importantly I've seen a significant reduction
in compile times for some vs (albeit with old (3.3) llvm version, and the
time reduction is only really for the optimizations run on the IR).
v2: adapt to other draw change.
No changes with piglit.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous attempts to zero initialize all inputs were not really optimal
(though no performance impact was measurable). In fact this is not really
necessary, since we know the max number of inputs used.
Instead, just generate fetch for up to max inputs used by the shader,
directly replacing inputs for which there was no vertex element by zero.
This also cleans up key generation, which previously would have stored
some garbage for these elements.
And also drop the assertion which indicates such bogus usage by a
debug_printf (the whole point of initializing the undefined inputs was to
make this case safe to handle).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This should make the code more robust if a shader tries to use inputs which
aren't defined by the vertex element layout (which usually shouldn't happen).
No piglit change.
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Kai Wasserbäch <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
v1 → v2:
- Fixed indentation (noted by Brian Paul)
- Removed second assert from nouveau's switch statements (suggested by
Brian Paul)
Signed-off-by: Kai Wasserbäch <[email protected]>
Reviewed-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
| |
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
The way the HW works doesn't really fit with having
two semantics for this.
The GLSL compiler emits 2 vec4s and two properties,
this makes draw use those instead of CULLDIST semantics.
Reviewed-by: Roland Scheidegger <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
| |
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
| |
This will be used later to restart barriered execution
threads in compute, for now we just want to change the API.
Acked-by: Roland Scheidegger <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
For compute support some of the system values are .xyz types,
so move to using a vector instead of a single channel.
[airlied: squash swizzle fix from compute series].
Reviewed-by: Brian Paul <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was definitely bugs here mixing up the PIPE_ and TGSI_ defines,
hopefully they didn't cause any problems, since mostly it was special
cases for GEOMETRY.
This clarifies at shader machine create what type of shader this
machine will execute. This is needed also for compute shaders where
we don't want to allocate inputs/outputs.
Reviewed-by: Brian Paul <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
|
| |
It gets annoying that changing the tgsi exec rebuilds the state
tracker unnecessarily. Putting this include into draw_gs.h which
uses it causes a lot less rebuilds.
Reviewed-by: Brian Paul <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This isn't currently that easy to expand, so fix it up
before expanding it later to include dynamic samplers.
[airlied: use some local variables (Roland)]
Reviewed-by: Roland Scheidegger <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
| |
Like the image code, but for shader buffers this time.
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|