| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This code uses a vector shift which has to be emulated on x86 unless
there's AVX2. Luckily in some cases we can actually avoid the shift
altogether, so do that.
Also make sure we hit the fast lp_build_conv() path when applicable,
albeit that's quite the hack...
That said, this path is taken for AoS sampling for small unorm (smaller
than rgba8) formats, and it is completely hopeless even with those
changes, with or without AVX.
(Probably should have some code similar to the one in the llvmpipe fs
backend code, using bit replication to extend to rgba8888 - rounding
is not quite 100% accurate but if it's good enough there it should be
here as well.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we only feed one source vector at a time, we cannot use pack intrinsics
(as we only have a 64bit destination dst vector). lp_bld_conv_auto is
specifically designed to alter the length and number of destination vectors,
so this works just fine (if we use single source vectors at a time, afterwards
we immediately reassemble the vectors).
For AVX though this isn't really possible, since we expect 128bit output
already for a single 256bit input. (One day we should handle AVX2 which again
would need multiple inputs, however there's the problem that we get different
ordered output there and we don't want to reorder, so would need to be able
to tell build_conv to handle upper and lower halfs independently.)
A similar strategy would probably work for 32->8bit too (if it doesn't hit
the special case) but I'm going to try something different for that...
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
simd instruction sets usually have comparisons for equal, not unequal.
So use a different comparison against the mask itself - which also means
we don't need a all-zero as well as a all-one (for the pxor) reg.
Also add code to avoid scalar expansion of i1 values which we definitely
shouldn't do. There's problems with this though with llvm select
interaction, so it's disabled (basically using llvm select instead of
intrinsics may still produce atrocious code, even in cases where we
figured it should not, albeit I think this could probably be fixed
with some better selection of optimization passes, but I have zero
idea there really).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
| |
It's redundant with the source modifier.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
| |
It's redundant with the source modifier.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
wrap lp_bld_type.h around extern "C".
Windows decorates global variables, so when used from .cpp files, need
to use an undecorated version.
Also, removed related and unneeded code from swr_screen.cpp
Reviewed-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This can now handle rgtc (unorm) too - this path no longer handles plain
formats, but that's unnecessary they now all have their proper SoA unpack
(this will still be dog-slow though due to the actual fetch being per-pixel
util fallbacks).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This previously always fell back to AoS conversion. Even for 4-float formats
(which is the optimal case by far for that fallback case) this was suboptimal,
since it meant the conversion couldn't be done with 256bit vectors. While this
may still only be partly possible for some formats, (unless there's AVX2
support) at least the transpose can be done with half the unpacks
(and before using the transpose for AoS fallbacks, it was worse still).
With less than 4 channels, things got way worse with the AoS fallback
quickly even with 128bit vectors.
The strategy is pretty much the same as the existing one for formats
which fit into 32 bits, except there's now multiple vectors to be
fetched (2 or 4 to be exact), which need to be shuffled first (if it's 4
vectors, this amounts to a transpose, for 2 it's a bit different),
then the unpack is done the same (with the exception that the shift
of the channels is now modulo 32, and we need to select the right
vector).
In fact the most complex part about it is to get the shuffles right
for separating into lo/hi parts for AVX/AVX2...
This also makes use of the new ability of gather to use provided type
information, which we abuse to outsmart llvm so we get decent shuffles,
and to fetch 3x32bit vectors without having to ZExt the scalar.
And just because we can, we handle double formats too, albeit they are
a bit different (draw sometimes needs to handle that).
v2: fix typo float/int bug (generating inefficient code).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By using a dst_type in the the gather interface, gather has some more
knowledge about how values should be fetched.
E.g. if this is a 3x32bit fetch and dst_type is 4x32bit vector gather
will no longer do a ZExt with a 96bit scalar value to 128bit, but
just fetch the 96bit as 3x32bit vector (this is still going to be
2 loads of course, but the loads can be done directly to simd vector
that way).
Also, we can now do some try to use the right int/float type. This should
make no difference really since there's typically no domain transition
penalties for such simd loads, however it actually makes a difference
since llvm will use different shuffle lowering afterwards so the caller
can use this to trick llvm into using sane shuffle afterwards (and yes
llvm is really stupid there - nothing against using the shuffle
instruction from the correct domain, but not at the cost of doing 3 times
more shuffles, the case which actually matters is refusal to use shufps
for integer values).
Also do some attempt to avoid things which look great on paper but llvm
doesn't really handle (e.g. fetching 3-element 8 bit and 16 bit vectors
which is simply disastrous - I suspect type legalizer is to blame trying
to extend these vectors to 128bit types somehow, so fetching these with
scalars like before which is suboptimal due to the ZExt).
Remove the ability for truncation (no point, this is gather, not conversion)
as it is complex enough already.
While here also implement not just the float, but also the 64bit avx2
gathers (disabled though since based on the theoretical numbers the benefit
just isn't there at all until Skylake at least).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
We should do transpose, not extract/insert, at least with "sufficient" amount
of channels (for 4 channels, extract/insert shuffles generated otherwise look
truly terrifying). Albeit we shouldn't fallback to that so often in any case.
v2: ditch the extract/insert path, not worth keeping (we're going to avoid
hitting the fallback that often with future patches).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
soa fetch so far always assumed that data was aligned. However, we want to
use this for vertex fetch, and data might not be aligned there, so handle
it in this path too (basically just pass through alignment through to other
functions). (It looks like it wouldn't work for for cached s3tc but this is
no different than with AoS fetch.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
LLVM can't really optimize anything which crosses scalar/vector boundaries,
so help a bit with some particular gather operations when the width is
expanded (only do it for 16->32bit expansion for now), by doing expansion
after fetch. That is probably a better solution anyway even if llvm would
recognize it, makes for cleaner IR...
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Note that we really want to _never_ reach the bottom of the function, which
resorts to AoS fetch.
Half floats can be handled just like other formats which fit into 32bit
vectors (so, only 1x16 and 2x16 formats, albeit with more channels things
are not THAT bad), with minimal plumbing. I've seen code size go down nearly
by a factor of 3 for a complete texture sampling function (including bilinear
filtering) using R16F.
(What we should do for everything not special cased is to do AoS gather,
shuffle/shift things into SoA vectors, and then do the conversion there.
Otherwise it's particularly bad with 1 or 2 channel formats - that r16f
format with either 4 or 8-wide vectors was still doing one element at a
time, essentially doing exactly the same work as for rgba16f. Also replacing
the channels with SWIZZLE0/1 (particularly the latter) adds even more
work, as it has to be done per aos vector, and not just straightforward
at the end with the SoA vector.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
| |
Use llvm provided API based on cpuid rather than our own
manually mantained list of mattr enabling/disabling.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
| |
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
| |
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
The fix in commit 88f791db75e9f065bac8134e0937e1b76600aa36 was insufficient
for radeonsi because the vector case was not handled properly. It seems
piglit only covers the scalar case, unfortunately.
Fixes GL45-CTS.shader_bitfield_operation.[iu]mulExtended.*
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
v2:
Fix adding parameter attributes with LLVM < 4.0.
v3:
Fix typo.
Fix parameter index.
Add a gallivm enum for function attributes.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch does two things:
1. It separates the host-CPU code generation from the generic code
generation. This guards against accidently breaking things for
radeonsi in the future.
2. It makes sure we actually use both arguments and don't just compute
a square :-p
Fixes a regression introduced by commit 29279f44b3172ef3b84d470e70fc7684695ced4b
Cc: Roland Scheidegger <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is used by shader umul_hi/imul_hi functions (and soon by draw).
It's actually useful separating this out on its own, however the real
reason for doing it is because we're using an optimized sse2 version,
since the code llvm generates is atrocious (since there's no widening
mul in llvm, and it does not recognize the widening mul pattern, so
it generates code for real 64x64->64bit mul, which the cpu can't do
natively, in contrast to 32x32->64bit mul which it could do).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Brian Paul <[email protected]>
Tested-by: Brian Paul <[email protected]>
|
|
|
|
|
|
| |
radeonsi needs these.
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The per-element fetch has quite some calculations which are constant,
these can be moved outside both the per-element as well as the main
shader loop (llvm can figure out it's constant mostly on its own, however
this can have a significant compile time cost).
Similarly, it looks easier swapping the fetch loops (outer loop per attrib,
inner loop filling up the per vertex elements - this way the aos->soa
conversion also can be done per attrib and not just at the end though again
this doesn't really make much of a difference in the generated code). (This
would also make it possible to vectorize the calculations leading to the
fetches.)
There's also some minimal change simplifying the overflow math slightly.
All in all, the generated code seems to look slightly simpler (depending
on the actual vs), but more importantly I've seen a significant reduction
in compile times for some vs (albeit with old (3.3) llvm version, and the
time reduction is only really for the optimizations run on the IR).
v2: adapt to other draw change.
No changes with piglit.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
| |
Compilation to actual machine code can easily take as much time as the
optimization passes on the IR if not more, so print this out too.
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For the texturing packs, things looked pretty terrible. For every
lerp, we were repacking the values, and while those look sort of cheap
with 128bit, with 256bit we end up with 2 of them instead of just 1 but
worse, plus 2 extracts too (the unpack, however, works fine with a
single instruction, albeit only with llvm 3.8 - the vpmovzxbw).
Ideally we'd use more clever pack for llvmpipe backend conversion too
since we actually use the "wrong" shuffle (which is more work) when doing
the fs twiddle just so we end up with the wrong order for being able to
do native pack when converting from 2x8f -> 1x16b. But this requires some
refactoring, since the untwiddle is separate from conversion.
This is only used for avx2 256bit pack/unpack for now.
Improves openarena scores by 8% or so, though overall it's still pretty
disappointing how much faster 256bit vectors are even with avx2 (or
rather, aren't...). And, of course, eliminating the needless
packs/unpacks in the first place would eliminate most of that advantage
(not quite all) from this patch.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
| |
v2: Use AVX2 gather for non aligned loads too.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
v2: Make sure that with num_lods > 1 and min_filter != mag_filter we
still enter the splitting path. So this case would still use 4-wide aos
path (as a side note, the 4-wide aos sampling path could actually be
improved quite a bit if we have avx2, by just doing the filtering with
256bit vectors).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
| |
v2: pblendb -> pblendvb
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
| |
This should be analogous to 32-bit integers.
Reviewed-by: Edward O'Callaghan <[email protected]>
Signed-off-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This enables 64-bit integer support in gallivm and
llvmpipe.
v2: add conversion opcodes.
v3:
- PIPE_CAP_INT64 is not there yet
- restrict DIV/MOD defaults to the CPU, as for 32 bits
- TGSI_OPCODE_I2U64 becomes TGSI_OPCODE_U2I64
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Edward O'Callaghan <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
Signed-off-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
| |
Reduces code duplication.
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
| |
It's used from both mesa main and gallium.
Signed-off-by: Jason Ekstrand <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Not sure if this is the right way to do it, but it seems to work.
v2: make it a no-op on LLVM <= 3.5
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
| |
RadeonSI stats: Mostly 0% difference, but Valley shows a small improvement:
Application Files SGPRs VGPRs SpillSGPR SpillVGPR Code Size LDS Max Waves Waits
unigine_valley 278 0.00 % -0.29 % 0.00 % 0.00 % 0.01 % 0.00 % 0.17 % 0.00 %
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Apparently, these are deprecated. There's some AutoUpgrade feature which
is supposed to promote these to cmp/select, which apparently doesn't work
with jit code. It is possible it's not actually even meant to work (see
the bug filed against llvm which couldn't provide an answer neither)
but in any case this is meant to be only temporary unless the intrinsics
are really illegal. So, just use the fallback code (which should be cmp/select,
we're actually doing cmp/sext/trunc/select, but in any case llvm 3.9 manages
to optimize this back to pmin/pmax in the end).
This addresses https://llvm.org/bugs/show_bug.cgi?id=28176
CC: <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
Tested-by: Vinson Lee <[email protected]>
Tested-by: Aaron Watry <[email protected]>
|
|
|
|
|
|
|
| |
v2: include whitespace fixes
Signed-off-by: Jan Vesely <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
| |
This converts one other place to using the new helper.
Reviewed-by: Nicolai Hähnle <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
| |
This just uses the same form across the fetches.
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This just makes some generic code that currently emits double
suitable for emitting 64-bit values.
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
|
| |
Besides the old JIT bug, it seems the X86 backend on LLVM 3.3 doesn't
handle llvm.fmuladd and instead it fall backs to a C function. Which in
turn causes a segfault on Windows.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
As suggested by Roland Scheidegger.
Use the same logic as f16c, since fma requires VEX encoding.
But disable FMA on LLVM 3.3 without MCJIT.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Signed-off-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Instead of doing a add and then mask out the upper bits, we can
simply do a add with a half wide type (this, of course, assumes
the hw can actually do it...), so we'll get the required zero
in the upper bits automatically.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Use GALLIVM_DEBUG=dumpbc for dumping of modules as bitcode.
Instead of a fixed llvmpipe.bc name, use ir_<modulename>.bc so multiple
modules can be dumped (albeit it might still overwrite previous modules,
particularly the modules from draw tend to always have the same name).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
| |
Those aren't really interesting, however outputting them is helpful when
trying to feed the IR to llvm llc (or opt) for debugging.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
| |
At least with MCJIT the disassembler will crash otherwise when trying to
disassemble such functions.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't target this yet, and some llvm versions incorrectly enable it based
on cpu string, causing crashes.
(Albeit this is a losing battle, it is pretty much guaranteed when the next
new feature comes along llvm will mistakenly enable it on some future cpu,
thus we would have to proactively disable all new features as llvm adds them.)
This should fix https://bugs.freedesktop.org/show_bug.cgi?id=94291 (untested)
Tested-by: Timo Aaltonen <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]
CC: <[email protected]>
|