| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is for handling chained command buffers and secondary command
buffers. It doesn't handle the trace id for secondary command buffers
yet, but I don't think that is possible in general with just writes,
as we could call a secondary command buffer multiple times.
I think this is good enough for now, as the most useful case is the
chaining when we grow an IB.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Reviewed-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Reviewed-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Reviewed-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
| |
v2: do it properly
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98238
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
| |
This has no effect because no code uses those members with ranged decls.
Tested-by: Edmondo Tommasina <[email protected]>
Acked-by: Edward O'Callaghan <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
| |
Tested-by: Edmondo Tommasina <[email protected]>
Acked-by: Edward O'Callaghan <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
| |
r600g doesn't set pipe_context::clear_buffer.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99303
Reviewed-by: Alex Deucher <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Generally we should do tranpose after conversion, if the format has less than
32 bits per channel (if it has 32 bits, conversion is going to be a no-op
anyway...). This is obviously because there's less vectors to deal with.
Though the advantage for 16 bit formats isn't that big, and in fact with AVX
there isn't really any (as the 32bit unpacks can be done with 256bit, but
the smaller ones cannot, although that would change again with proper AVX2
support).
Only makes sense for 2d and not 1d cases. And to keep things easy, only handle
1,2 and 4 channels (rgbx is just fine).
For rgba unorm8 format the backend conversion sums up to these instruction
totals (not counting the movs for SSE2 due to 2-op syntax - generally every 2
unpacks need an additional mov).
SSE2 AVX
transpose: 32 unpack 16 unpack
untwiddle: 0 8 (128bit low/high permutes)
convert: 16 mul + 16 cvt 8 mul + 8 cvt
32->8bit: 12 pack 8 (128bit extract) + 12 pack
When doing transpose/untwiddle afterwards we get:
convert: 16 mul + 16 cvt 8 mul + 8 cvt
32->8bit: 12 pack 8 (128bit extract) + 12 pack
transpose/untwiddle 12 unpack 12 unpack
So for SSE2, this drops 20 unpacks (total instruction count 76->56)
whereas for AVX it replaces the 16 256bit unpacks with 8 128bit ones
and drops the 8 lo/hi permutes (in total 60->48). (Albeit to be fair,
the permutes could be dropped even when doing the transpose first,
they are extremely pointless but we'd need to be able to tell
lp_build_conv to reorder the vectors, for AVX2 we're going to need to
be able to tell lp_build_conv about ordering in any case.)
(With different ordering going into conversion, it would be possible
to do 4 unpacks + 4 pshufbs instead of 12 unpacks, but that might not
be better, and not all cpus can do it. Proper AVX2 support should eliminate
the 8 128bit extracts, reduce these 12 packs to 6 and the 12 unpacks to 2
pshufb + 2 permq ideally (+ 2 final 128bit extracts).)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For rgbx formats, there is no point in doing alpha conversion again (and
with different tranpose even, so llvm can't eliminate it).
Albeit it looks like there's some minimal changes needed in the blend code
(found by code inspection, no test seemed to complain) if we do this -
the blend factors are already sanitized if we have no destination alpha,
however for src_alpha_saturate it looks like it still might make a
difference (note that we forced has_alpha to true before for some formats
and nothing complained, but this seems safer).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
llvm has _huge_ problems trying to load things like <4 x i8> vectors and
stitching such loads together to form 128bit vectors. My understanding
of the problem is that the type legalizer tries to extend that to
really a <4 x i32> vector and not a <16 x i8> vector with the 4 elements
first then followed by padding, so the shuffles for then combining things
together are more or less impossible - you can in fact see the pmovzxd
llvm generates. Pre-4.0 llvm just gives up on it completely and does a 30+
pextrb/pinsrb sequence instead.
It looks like current llvm has fixed this behavior (my guess would be
due to better shuffle combination and load/shuffle folds), but we can
avoid this by just loading as <1 x i32> values, combine that and only
cast at the end. (I suspect it might also work if we'd pad the loaded
vectors immediately before shuffling them together, instead of directly
stitching 2 such vectors together pairwise before combining the pair.
But this _might_ lose the ability to load the values directly into
their right place in the vector with pinsrd.). But using 32bit values
is probably easier for llvm as it will never give it funny ideas how
the vector should look like.
(This is possibly only a problem for 1x8bit formats, since 2x8bit will
end up fetching 64bit hence only two vectors are stitched together,
not 4, but we use the same strategy anyway.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Edward O'Callaghan <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
| |
for L2 prefetch
Reviewed-by: Edward O'Callaghan <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Edward O'Callaghan <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
| |
it doesn't interact with compute shaders in any way
Reviewed-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
| |
it's exactly the same as the other ones
Reviewed-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
|
|
|
| |
Fix incorrect swizzling in SIMD16 Transpose_16_16 breaking the
two-channel 16-bpc formats like R16G16_FLOAT.
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
|
|
| |
Honor the colorHottileEnable mask when accessing colorBuffer pointers.
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
|
|
| |
Fix routines for 8-bit and 16-bit formats used by optimized tile store.
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixed Transpose_16 methods of following formats:
Transpose8_8_8_8
Transpose8_8
Transpose32_32
Transpose16_16_16_16
Transpose16_16_16
Transpose16_16
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
| |
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The T images are composed of effectively swizzled-around blocks of LT (4x4
utile) images, so we can reduce the t_utile_address() calls by 16x by
calling into the simpler LT loop.
This also adds support for calling down with non-utile-aligned
coordinates, which will be part of lifting the utile alignment requirement
on our callers and avoiding the RMW on non-utile-aligned stores.
Improves 1024x1024 TexSubImage by 2.55014% +/- 1.18584% (n=46)
Improves 1024x1024 GetTexImage by 2.242% +/- 0.880954% (n=32)
|
|
|
|
|
|
| |
I want these inlined in the callers, particularly with the tiling
changes coming up, but we're not building with lto so some caller
would suffer.
|
|
|
|
|
| |
They don't have any other callers outside of this file, and I'm hoping
they get inlined soon.
|
|
|
|
|
|
|
|
|
| |
They now have less of a dependency on the cpp, and don't have to do a
divide.
Hacking up mesa-demos teximage to do only one subtest and not draw
points, I saw 1024x1024 glTexSubImage2D() improve by 4.86939% +/-
1.40408% (n=30) and glGetTexImage() by 2.18978% +/- 0.140268% (n=5).
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
If we get up toward 256MB (or whatever the CMA area size is),
VC4_GEM_CREATE will start throwing errors. Even if we don't trigger
that, when we flush the kernel's BO allocation for the CLs or bin
memory may end up throwing an error, at which point our job won't get
rendered at all.
Just flush early (half of maximum CMA size) so that hopefully we never
get to that point.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using bit replication. This path now resembles something which might make
sense. (The logic was mostly copied from llvmpipe fs backend.)
I am not convinced though it is actually faster than SoA sampling (actually
I'm quite certain it's always a loss with AVX).
With SoA it's just shift/mask/cvt/mul for getting the colors, whereas
there's still roughly 3 shifts, 3 or/and per channel for AoS
(i.e. for SoA it's exactly the same as it would be for a rgba8 format,
whereas the extra effort for AoS is significant). The filtering
might still be faster (albeit with FMA the instruction count gets down
quite a bit there on the SoA float filtering path on new cpus). And those
small unorm formats often don't have an alpha channel (which makes things
worse relatively for AoS path).
(This also fixes a trivial bug in the llvmpipe fs code this was derived
from, albeit it was only relevant for 4-bit channels.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
simd instruction sets usually have comparisons for equal, not unequal.
So use a different comparison against the mask itself - which also means
we don't need a all-zero as well as a all-one (for the pxor) reg.
Also add code to avoid scalar expansion of i1 values which we definitely
shouldn't do. There's problems with this though with llvm select
interaction, so it's disabled (basically using llvm select instead of
intrinsics may still produce atrocious code, even in cases where we
figured it should not, albeit I think this could probably be fixed
with some better selection of optimization passes, but I have zero
idea there really).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
| |
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99214
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Draw calls no longer flush SDMA IBs. r600_need_dma_space is
responsible for synchronizing execution between both IBs.
Initial buffer clears and fast clears will stay unflushed in the SDMA IB
(up to 64 MB) as long as the GFX IB isn't flushed either.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
| |
Call r600_dma_emit_wait_idle only when there is a possibility of
a read-after-write hazard. Buffers not yet used by the SDMA IB don't
have to wait.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
| |
r600_dma_emit_wait_idle is going away in its current form.
The only difference is that the moved code is executed before DMA calls
instead of after them.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
| |
It's redundant with the source modifier.
Reviewed-by: Nicolai Hähnle <[email protected]>
|