| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
| |
The conversion code for srgb was tuned for n x 4x8bit AoS -> 4 x nxfloat SoA
(and vice versa), fix this to handle also 16bit 565-style srgb formats.
Still not really all that generic, things like r10g10b10a2_srgb or
r4g4b4a4_srgb wouldn't work (the latter trivial to fix, the former would not
require more work to not crash but near certainly need some higher precision
calculation) but not needed right now.
The code is not fully optimized for this (could use more direct calculation
instead of expanding to 8-bit range first) but should be good enough.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
| |
Similar to the other cases, shift some weight/coord calculations to int
space. This should be slightly faster (on x86 sse it should actually safe one
instruction, and generally int instructions are cheaper).
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous code used coords which were calculated as
(int) (f_coord * tex_size * 256) >> 8.
This is not only unnecessarily complex but can give the wrong texel due to
rounding for negative coords (as an example, after denormalization coords
from -1.0 to 0.0 should give -1, but this will give -1 for numbers from
-1.0-1/256 - 0.0-1/256.
Instead, juse use ifloor, dropping the shift stuff.
Unfortunately, this will most likely be slower - with arch rounding available
it shouldn't be too bad (trades a int shift for a round but also saves an int
mul (which is shared by all coords) but otherwise it's a mess.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous method for converting coords to ints was sligthly inaccurate
(effectively losing 1bit from the 8bit lerp weight). This is probably
especially noticeable when trying to draw a pixel-aligned texture.
As an example, for a 100x100 texture after dernormalization the texture
coords in this case would turn up as
0.5, 1.5, 2.5, 3.5, 4.5, ...
After the mul by 256, conversion to int and 128 subtraction, they end up as
0, 256, 512, 768, 1024, ...
which gets us the correct coords/weights of
0/0, 1/0, 2/0, 3/0, 4/0, ...
But even LSB errors (which are unavoidable) in the input coords may cause
these coords/weights to be wrong, e.g. for a coord of 3.49999 we'd get a
coord/weight of 2/255 instead.
Fix this by using round-to-nearest int instead of FPToSi (trunc). Should be
equally fast on x86 sse though other archs probably suffer a little.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous code relied on cpu denorm support for converting small float
formats (such r11g11b10_float and r16_float) to floats, otherwise denorms
are flushed to zero. We worked around that in llvmpipe blend code by
reenabling denorms, but this did nothing for texture sampling. Now it would
be possible to reenable it there too but I'm not really a fan of messing
with fpu flags (and it seems we can't actually do it reliably with llvm in
any case looking at some bug reports). (Not to mention if you actually have
a lot of denorms in there, you can expect some order-of-magnitude slowdown
with x86 cpus.)
So instead use code which adjusts exponents etc. directly hence not relying
on cpu denorm support for the rescaling mul.
(We still need the fpu flag handling as we can't do float-to-smallfloat
without using cpu denorms at least for now - I actually wanted to keep
both the old and new code and using one or the other depending on from where
it's called but that didn't work out as the parameter would have to be passed
through too many layers than I'd like.)
Reviewed-by: Zack Rusin <[email protected]>
Reviewed-by: Si Chen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
We need to handle a lot more immediates and in order to do that
we also switch from allocating this structure on the stack to
allocating it on the heap.
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We only supported up to 256 immediates, which isn't enough. We had
code which was allocating immediates as an allocated array, but it
was always used along a statically backed array for performance
reasons. This commit adds code to skip that performance optimization
and always use just the dynamically allocated immediates if the
number of them is too great.
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The number of allowed temporaries increases almost with every
iteration of an api. We used to support 128, then we started
increasing and the newer api's support 4096+. So if we notice
that the number of temporaries is larger than our statically
allocated storage would allow we just treat them as indexable
temporaries and allocate them as an array from the start.
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, we were really doing F2I. And also move it to generic section.
(Note that for llvmpipe the code generated is definitely bad, due to lack
of unsigned conversions with sse. I think though what llvm does (using scalar
conversions to 64bit signed either with x87 fpu (32bit) or sse (64bit)
including lots of domain changes is quite suboptimal, could do something like
is_large = arg >= 2^31
half_arg = 0.5 * arg
small_c = fptoint(arg)
large_c = fptoint(half_arg) << 1
res = select(is_large, large_c, small_c)
which should be much less instructions but that's something llvm should do
itself.)
This fixes piglit fs/vs-float-uint-conversion.shader_test (maybe more, needs
GL 3.0 version override to run.)
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Zack Rusin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
gallivm soa code supported only a single level of nesting for
control flow opcodes (if, switch, loops...) but the d3d10 spec
clearly states that those are nested within functions. To support
nesting of conditionals inside functions we need to store the
nesting data inside function contexts and keep a stack of those.
Furthermore we make sure that if nesting for subroutines is deeper
than 32 then we simply ignore all subsequent 'call' invocations.
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
| |
Trivial.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have code generation paths that carry out swizzles of AoS vectors via
bitwise shifts, as these tend to generate more efficient code than
straightforward byte shuffles. But when the input is a constant the
additional bitwise arithmetic operations somehow don't really get
constant propagated properly, evenutally causing assertion failure in
InstCombine pass.
Therefore avoid the bug by using the trivial shuffles for constant
inputs.
Although the sample LLVM IR can cause a crash with any LLVM version,
this was only seen in practice with LLVM 3.2.
Reviewed-by: Matthew McClure <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Tungsten Graphics Inc. was acquired by VMware Inc. in 2008. Leaving the
old copyright name is creating unnecessary confusion, hence this change.
This was the sed script I used:
$ cat tg2vmw.sed
# Run as:
#
# git reset --hard HEAD && find include scons src -type f -not -name 'sed*' -print0 | xargs -0 sed -i -f tg2vmw.sed
#
# Rename copyrights
s/Tungsten Gra\(ph\|hp\)ics,\? [iI]nc\.\?\(, Cedar Park\)\?\(, Austin\)\?\(, \(Texas\|TX\)\)\?\.\?/VMware, Inc./g
/Copyright/s/Tungsten Graphics\(,\? [iI]nc\.\)\?\(, Cedar Park\)\?\(, Austin\)\?\(, \(Texas\|TX\)\)\?\.\?/VMware, Inc./
s/TUNGSTEN GRAPHICS/VMWARE/g
# Rename emails
s/[email protected]/[email protected]/
s/[email protected]/[email protected]/g
s/jrfonseca-at-tungstengraphics-dot-com/jfonseca-at-vmware-dot-com/
s/jrfonseca\[email protected]/[email protected]/g
s/keithw\[email protected]/[email protected]/g
s/[email protected]/[email protected]/g
s/thomas-at-tungstengraphics-dot-com/thellstom-at-vmware-dot-com/
s/[email protected]/[email protected]/
# Remove dead links
s@Tungsten Graphics (http://www.tungstengraphics.com)@Tungsten Graphics@g
# C string src/gallium/state_trackers/vega/api_misc.c
s/"Tungsten Graphics, Inc"/"VMware, Inc"/
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's possible to bind a smaller buffer as a constant buffer, than
what the shader actually uses/requires. This could cause nasty
crashes. This patch adds the architecture to pass the maximum
allowable constant buffer index to the jit to let it make
sure that the constant buffer indices are always within bounds.
The behavior follows the d3d10 spec, which says the overflow
should always return all zeros, and overflow is only defined
as access beyond the size of the currently bound buffer. Accesses
beyond the declared shader constant register size are not
considered an overflow and expected to return garbage but consistent
garbage (we follow the behavior which some wlk tests expect which
is to return the actual values from the bound buffer).
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
| |
The argument is a i8 pointer not a i32 pointer (even though the value actually
stored/loaded IS i32). Older llvm versions didn't care but 3.2 and newer do
leading to crashes.
Reviewed-by: Zack Rusin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The fact that we flush denorms to zero breaks our half-float
conversion and blending. This patches enables denorms for
blending. It's a little tricky due to the llvm bug that makes
it incorrectly reorder the mxcsr intrinsics:
http://llvm.org/bugs/show_bug.cgi?id=6393
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: José Fonseca <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
Signed-off-by: Zack Rusin <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Ever since introducing separate sampler and sampler view max this was really
missing.
Every driver but llvmpipe reports the same number as number of samplers for
now, so nothing should break.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
| |
Fixes "Uninitialized pointer read" defect reported by Coverity.
Signed-off-by: Vinson Lee <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
The exec_mask must be taken in consideration, just like emit_kill above.
The tgsi_exec module has the same bug and should be fixed in a future
change.
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: José Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is similar to tgsi_exec.c's DEBUG_EXECUTION compile flag.
I had prototyped this for a while while debugging an issue, but finally
cleaned this up and added a few more bells and whistles.
v2: Use '$' as marker; better output. Thanks to Brian, Zack and Roland
reviews.
Here is a sample output.
CONST[0].x = 0.00625000009 0.00625000009 0.00625000009 0.00625000009
CONST[0].y = -0.00714285718 -0.00714285718 -0.00714285718 -0.00714285718
CONST[0].z = -1 -1 -1 -1
CONST[0].w = 1 1 1 1
IN[0].x = 143.5 175.5 175.5 143.5
IN[0].y = 123.5 123.5 155.5 155.5
IN[0].z = 0 0 0 0
IN[0].w = 1 1 1 1
$ 1: RCP TEMP[0].w, IN[0].wwww
TEMP[0].w = 1 1 1 1
$ 2: MAD TEMP[0].xy, IN[0], CONST[0], CONST[0].zwzw
TEMP[0].x = -0.103124976 0.0968750715 0.0968750715 -0.103124976
TEMP[0].y = 0.117857158 0.117857158 -0.110714316 -0.110714316
$ 3: MUL OUT[0].xy, TEMP[0], TEMP[0].wwww
OUT[0].x = -0.103124976 0.0968750715 0.0968750715 -0.103124976
OUT[0].y = 0.117857158 0.117857158 -0.110714316 -0.110714316
$ 4: MUL OUT[0].z, IN[0].zzzz, TEMP[0].wwww
OUT[0].z = 0 0 0 0
$ 5: MOV OUT[0].w, TEMP[0]
OUT[0].w = 1 1 1 1
$ 6: END
OUT[0].x = -0.103124976 0.0968750715 0.0968750715 -0.103124976
OUT[0].y = 0.117857158 0.117857158 -0.110714316 -0.110714316
OUT[0].z = 0 0 0 0
OUT[0].w = 1 1 1 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
d3d10 requires us to convert NaNs to zero for any float->int conversion.
We don't really do that but mostly seems to work. In particular I suspect the
very common float->unorm8 path only really passes because it relies on sse2
pack intrinsics which just happen to work by luck for NaNs (float->int
conversion in hw gives integer indeterminate value, which just happens to be
-0x80000000 hence gets converted to zero in the end after pack intrinsics).
However, float->srgb didn't get so lucky, because we need to clamp before
blending and clamping resulted in NaN behavior being undefined (and actually
got converted to 1.0 by clamping with sse2). Fix this by using a zero/one clamp
with defined nan behavior as we can handle the NaN for free this way.
I suspect there's more bugs lurking in this area (e.g. converting floats to
snorm) as we don't really use defined NaN behavior everywhere but this seems
to be good enough.
While here respecify nan behavior modes a bit, in particular the return_second
mode didn't really do what we wanted. From the caller's perspective, we really
wanted to say we need the non-nan result, but we already know the second arg
isn't a NaN. So we use this now instead, which means that cpu architectures
which actually implement min/max by always returning non-nan (that is adhering
to ieee754-2008 rules) don't need to bend over backwards for nothing.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
| |
There's only one minor functional change, for immediates the pixel offsets
are no longer added since the values are all the same for all elements in
any case (it might be better if those weren't stored as soa vectors in the
first place maybe).
Reviewed-by: Zack Rusin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
We weren't adding the soa offsets when constructing the indices
for the gather functions. That meant that we were always returning
the data in the first element.
(Copied straight from the same fix for temps.)
While here fix up a couple of broken comments in the fetch functions,
plus don't name a straight float type float4 which is just confusing.
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Zack Rusin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SSE can't handle true vector shifts (with variable shift count),
so llvm is turning them into a mess of extracts, scalar shifts and inserts.
It is however possible to emulate them in lp_build_minify with float muls,
which should be way faster (saves over 20 instructions per 8-wide
lp_build_minify). This wouldn't work for "generic" 32bit shifts though
since we've got only 24bits of mantissa (actually for left shifts it would
work by using sse41 int mul instead of float mul but not for right shifts).
Note that this has very limited scope for now, since this is only used with
per-pixel lod (otherwise we're avoiding the non-constant shift count by doing
per-quad shifts manually), and only 1d textures even then (though the latter
should change).
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LLVM 3.4 r193971 removed llvm::DisablePrettyStackTrace and made the
pretty stack trace opt-in rather than opt-out.
The default value of DisablePrettyStackTrace has changed to true in LLVM
3.4 and newer.
Signed-off-by: Vinson Lee <[email protected]>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=60929
Reviewed-by: Tom Stellard <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
d3d10 requires that cube corners are filtered with accurate weights (that
is, the weight of the non-existing corner texel should be evenly distributed
to the other 3 texels). OpenGL does not require this (but recommends it).
This requires us to use different filtering code, since we need per-texel
weights which our 2d lerp doesn't (and can't) do. And of course the (now
per element) weights need to be adjusted too for it to work.
Invoke the new filtering code whenever there's an edge to keep things simpler,
as it will work for edges too not just corners but of course it's only needed
with corners.
More ugly code for not much gain but at least a hacked up cubemap demo
shows very nice corners now... Not sure yet if and how this should be
configurable...
v2: incorporate feedback from Jose, only use special corner filtering code
when there's a corner not when there's only an edge (as corner filtering code
is slower, though a perf difference was only measureable when always
forcing edge code). Plus some minor style fixes.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For seamless cube filtering it is necessary to determine new faces and new
coords per sample. The logic for this is _seriously_ complex (what needs
to happen is very "asymmetric" wrt face, x/y under/overflow), further
complicated by the fact that if the 4 samples are in a corner (meaning we
only have actually 3 samples, and all 3 are on different faces) then
falling off the edge is happening _both_ on x and y axis simultaneously.
There was a noticeable performance hit in mesa's cubemap demo when seamless
filtering was forced on (just below 10 percent or so in a debug build, when
disabling all filtering hacks, otherwise it would probably be a bit more) and
when always doing the logic, hence use a branch which it only does it if any
of the pixels in a quad (or in two quads) actually hit this. With that there
was no measurable performance hit in the cubemap demo (neither in a debug nor
release buidl), but this will vary (cubemap demo very rarely hits edges).
Might also be different on other cpus, as this forces SoA sampling path which
potentially can be quite a bit slower.
Note that as for corners, this code gets all the 3 samples which actually
exist right, and the 4th texel will simply be the same as one of the others,
meaning that filter weights will be a bit wrong. This however should be
enough for full OpenGL (but not d3d10) compliance.
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Not used since ages, and it wouldn't work at all with explicit derivatives now
(not that it did before as it ignored them but now the code would just use
the derivs pre-projected which would be quite random numbers).
v2: also get rid of 3 helper functions no longer used.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
They need some special handling. Quite complicated.
Additionally, use the same code for implicit derivatives too if no_rho_approx
and no_quad_lod is set, because it seems while generally it should be ok
to use per quad lod for implicit derivatives there's at least some test which
insists that in case of cubemaps the shared lod value MUST come from a pixel
inside the primitive (due to the derivatives becoming different if a different
larger major axis is chosen).
v2: based on Brian's feedback, clean up code a bit.
And use sign bit of major axis instead of pre-select s/t/r sign for coord
mirroring (which should be the same in the end, saves 2 ands).
Also fix two bugs with select/mirror of derivatives, the minor axes need to
use major axis sign as well (instead of major derivative axis sign), and
don't mistakenly use absolute values of major derivative and inverse major
values.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's two reasons for this:
1) even when ignoring rho approximation for cube maps, the result is still
not correct, but it's better as the max error at edges is now sqrt(2) instead
of 2 (which was a full mip level), same as it is for ordinary 2d maps when
doing rho approximations (so the error actually goes from factor 2 at edges and
sqrt(2) completely inside a face to sqrt(2) at edges and 0 inside a face).
2) I want to repurpose rho_no_approx for cubemaps for fully correct cubemap
derivatives (so don't need yet another debug var).
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Both the imul_hi and umul_hi are working with this patch.
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: José Fonseca <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
| |
only 8 and 32 bit integers were supported before.
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: José Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Technically without seamless filtering enabled GL allows any wrap mode, which
made sense when supporting true borders (can get seamless effect with border
and CLAMP_TO_BORDER), but gallium doesn't support borders and d3d9 requires
wrap modes to be ignored and it's a pain to fix up the sampler state (as it
makes it texture dependent). It is difficult to imagine a situation where an
app really wants another behavior so just cheat here. (It looks like some
graphics hw (intel) actually requires this too hence it should be safe.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Simply adjust wrap mode to clamp_to_edge. This is all that's needed for a
correct implementation for nearest filtering, and it's way better than
using repeat wrap for instance for linear filtering (though obviously this
doesn't actually do seamless filtering).
v2: fix s/t wrap not r/s...
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
| |
It was wrong for EXP.y, as we clamped the source before computing the
fractional part, and this opcode should be rarely used, so it's not
worth the hassle.
|
|
|
|
|
|
|
|
|
|
|
| |
We support indirect addressing only on the vertex index, but some
shaders also use indirect addressing on attributes. This patch
adds support for indirect addressing on both dimensions inside
gs arrays.
Signed-off-by: Zack Rusin <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: José Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Turns out we don't need to do much extra work for detecting this case,
since we are guaranteed to get a empty static texture state in this case,
hence just rely on format being 0 and return all zero then.
Previously needed dummy textures (would just have crashed on format being 0
otherwise) which cannot return the correct result for size queries and when
sampling textures with wrap modes using border.
As a bonus should hugely increase performance when sampling unbound textures -
too bad it isn't a useful feature :-).
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Zack Rusin <[email protected]>
|
|
|
|
|
|
|
| |
The only reason this was needed was because the fetch texel function had to
get the (dynamic) border color, but this is now done much earlier.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
| |
Instead of enhancing the AoS path so it can deal with it, just use SoA. Fixing
AoS path wouldn't be all that difficult (use all the same logic as SoA) but
considered not worth it for now.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since we can have per-pixel lod we should also honor the filter per-pixel
(in fact we didn't honor it per quad neither in the multiple quad case).
Do this by running the linear path and simply beating the weights into shape
(the sample with the higher weight is the one which should have been chosen
with nearest filtering hence adjust filter weight to 1.0/0.0 based on that).
If all pixels use nearest filter (either min and mag) then still run just a
nearest filter as this is way cheaper (probably around 4 times faster for 2d,
more for 3d case) and it should be relatively rare that pixels really need
different filtering. OTOH if all pixels would require linear don't do anything
special since the linear path with filter adjustments shouldn't really be all
that much more expensive than ordinary linear, and we think it's rare that
min/mag filters are configured differently so there doesn't seem much value
in trying to optimize this further.
This does not yet fix the AoS path (though currently AoS is only used for
single quads hence it could be considered less broken, just never honoring
per-pixel filter decision but doing it per quad).
v2: simplify code a bit (unify min linear and min nearest cases)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While a sqrt here and there shouldn't hurt much (depending on the cpu) it is
possible to completely omit it since rho is only used for calculating lod and
there log2(x) == 0.5*log2(x^2). Depending on the exact path taken for
calculating lod this means we get a simple mul instead of sqrt (in case of
nearest mip filter in fact we don't need to replace the sqrt with something
else at all), only in some not very useful path this doesn't work (combined
brilinear calculation of int level and fractional lod, accurate rho calc but
brilinear filtering seems odd).
Apart from being faster as an added bonus this should increase our crappy
fractional accuracy of lod, since fast_log2 is only good for ~3bits and this
should increase accuracy by one bit (though not used if dimension is just one
as we'd need an extra mul there as we never had the squared rho in the first
place).
v2: use separate ilog2_sqrt function if we have squared rho.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is just preparation for per-pixel (or per-quad in case of multiple quads)
min/mag filter since some assumptions about number of miplevels being equal
to number of lods no longer holds true.
This change does not change behavior yet (though theoretically when forcing
per-element path it might be slower with different min/mag filter since the
code will respect this setting even when there's no mip maps now in this case,
so some lod calcs will be done per-element just ultimately still the same
filter used for all pixels).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, the min/mag switchover point when using nearest/none mip
filter was effectively -0.5 which can't be right. Looks like new OpenGL
thinks it's ok if it's always 0.0 (older versions required 0.5 in some
cases), let's hope everybody else thinks that's fine too.
Refactor this slightly and get the per-quad/per-pixel min/mag decision
values further down to sampling, though still only the first component
is used yet.
While here also fix code trying to skip lod bias application etc. when
mipfilter is none, as this is still needed for determining min/mag filter.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Except for explicit derivs with cube maps which are very bogus anyway.
Just like explicit lod this is only used if no_quad_lod is set in
GALLIVM_DEBUG env var.
Minification is terrible on cpus which don't support true vector shifts
(but should work correctly). Cannot do the min/mag filter decision (if
they are different) per pixel though, only selecting different mip levels
works.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Just a copy & paste error.
Fixes https://bugs.freedesktop.org/show_bug.cgi?id=68409.
Note that the test passing before probably simply means it doesn't verify
clamping of the border color itself as required by the OpenGL spec.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
|
| |
block size depth is always 1 even for compressed formats (unless someone
invents true 3d compressed formats at least which we can't represent).
Nearest (and soa) path had it right.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
| |
Same as PIPE_FORMAT_B10G10R10A2_UINT but without the swizzling.
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
The (complicated!) math is all identical, there's just minimal differences how
sign bit is calculated plus there's an additional subtraction for the argument
going into the polynomial for cos.
The logic stays 100% the same (with a small exception, sign bit calculation for
sin is minimally simplified, applying sign mask after xoring the arguments
instead of applying it to each argument).
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
| |
Detected this hunting some other bug, not sure if it really needs fixing but
it is definitely wrong.
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
|
|
| |
Was using wrong (undefined) vector element (the elements are at 0/2 position,
not 0/1).
Reviewed-by: Jose Fonseca <[email protected]>
|