| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
This will allow Raspbian's ARMv6 builds to take advantage of the new NEON
code, and could prevent problems if vc4 ends up getting used on a v7 CPU
without NEON.
v2: Drop dead NEON_SUFFIX (noted by Erik Faye-Lund)
|
|
|
|
|
|
|
|
|
| |
Android.mk was setting the flag across the entire driver, so we didn't
have non-NEON versions getting built. This was going to be a problem with
the next commit, when I start auto-detecting NEON support and use the
non-NEON version when appropriate.
Reviewed-by: Rob Herring <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
NEON is sufficiently different on arm64 that we can't just reuse this
code. Disable it on arm64 for now.
v2: Use PIPE_ARCH_ARM instead, as __ARM_ARCH may be 8 for a 32-bit build
for a v8 CPU.
Signed-off-by: Eric Anholt <[email protected]>
Cc: <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
| |
Most drivers don't need it and shouldn't need it because it can't be used
in some cases (indirect draws, primitive restart, count from streamout).
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
| |
This version of the chip is present on the Cygnus-based 911360 enterprise
phone platform. It appears to be completely backwards compatible.
|
|
|
|
|
| |
Reviewed-by: Marek Olšák <[email protected]>
Reviewed-by: Edward O'Callaghan <[email protected]>
|
|
|
|
| |
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
| |
v2:
- explain the resource_commit interface in more detail
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
| |
Changes since v1:
- Add pipe caps for etnaviv, freedreno, swr and virgl
Signed-off-by: Lyude <[email protected]>
Reviewed-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Neved used.
v2: gallivm: rename "pred" -> "exec_mask"
etnaviv: remove the cap
gallium: fix tgsi_instruction::Padding
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Lyude <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Ilia Mirkin <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The NIR story on conversion opcodes is a mess. We've had way too many
of them, naming is inconsistent, and which ones have explicit sizes was
sort-of random. This commit re-organizes things and makes them all
consistent:
- All non-bool conversion opcodes now have the explicit size in the
destination and are named <src_type>2<dst_type><size>.
- Integer <-> integer conversion opcodes now only come in i2i and u2u
forms (i2u and u2i have been removed) since the only difference
between the different integer conversions is whether or not they
sign-extend when up-converting.
- Boolean conversion opcodes all have the explicit size on the bool and
are named <src_type>2<dst_type>.
Making things consistent also allows nir_type_conversion_op to be moved
to nir_opcodes.c and auto-generated using mako. This will make adding
int8, int16, and float16 versions much easier when the time comes.
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Math results land in r4, regardless of the condition. To implement them,
we just need to ensure that the results are moved out of r4 (as often
happens anyway, the values is live across another math instruction), so
that we can attach the condition to the MOV.
Fixes dEQP-GLES2.functional.shaders.random.all_features.fragment.93 and a
couple others, that were assertion failing that their conditions hadn't
been handled during the QIR->QPU stage.
|
|
|
|
|
|
|
|
|
|
| |
This ended up confusing the scheduler for things like fabs (implemented as
fmaxabs x, x) or squaring a number, and it would try to avoid scheduling
them because it appeared more expensive than other instructions.
Fixes failure to register allocate in
dEQP-GLES2.functional.uniform_api.random.3 with almost no shader-db
effects (+.35% max temps)
|
|
|
|
|
| |
Doing instruction count analysis when we emit the thread switches that
will save us from tons of stalls is kind of missing the point.
|
|
|
|
|
|
| |
This reverts commit 292c24ddac5acc35676424f05291c101fcd47b3e. It broke a
lot of GLES2 deqp, and I see at least one problem that will require some
serious rework to fix.
|
|
|
|
| |
Reviewed-by: Edward O'Callaghan <[email protected]>
|
|
|
|
| |
Reviewed-by: Edward O'Callaghan <[email protected]>
|
|
|
|
| |
Reviewed-by: Edward O'Callaghan <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
pipe_mutex_unlock() was made unnecessary with fd33a6bcd7f12.
Replaced using:
find ./src -type f -exec sed -i -- \
's:pipe_mutex_unlock(\([^)]*\)):mtx_unlock(\&\1):g' {} \;
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
replace pipe_mutex_lock() was made unnecessary with fd33a6bcd7f12.
Replaced using:
find ./src -type f -exec sed -i -- \
's:pipe_mutex_lock(\([^)]*\)):mtx_lock(\&\1):g' {} \;
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
pipe_mutex_init() was made unnecessary with fd33a6bcd7f12.
Replace was done using:
find ./src -type f -exec sed -i -- \
's:pipe_mutex_init(\([^)]*\)):(void) mtx_init(\&\1, mtx_plain):g' {} \;
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
| |
pipe_mutex was made unnecessary with fd33a6bcd7f12.
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reduces register pressure in both types of shaders, by reordering the
input loads from the var->data.driver_location order to whatever order
they appear first in the NIR shader. These instructions aren't
reorderable at our QIR scheduling level because the FS takes two in
lockstep to do an interpolation, and the VS takes multiple read
instructions in a row to get a whole vec4-level attribute read.
shader-db impact:
total instructions in shared programs: 76666 -> 76590 (-0.10%)
instructions in affected programs: 42945 -> 42869 (-0.18%)
total max temps in shared programs: 9395 -> 9208 (-1.99%)
max temps in affected programs: 2951 -> 2764 (-6.34%)
Some programs get their max temps hurt, depending on the order that the
load_input intrinsics appear, because we end up being unable to copy
propagate an older VPM read into its only use.
|
|
|
|
| |
It's going gain most of ntq_setup_inputs(), so simplify it first.
|
|
|
|
|
| |
This will be used for delaying our VPM reads (which must be unconditional)
until just before they're used.
|
|
|
|
|
|
|
| |
We need to be paying attention to optimization's impact on this -- even if
we reduce instruction count, increasing max temps in general is likely to
cause us to fail to register allocate on some shaders, which means that
those won't run at all.
|
|
|
|
|
|
|
|
| |
all drivers support it
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
Tested-by: Brian Paul <[email protected]> (VMware driver only)
|
|
|
|
|
|
| |
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Andreas Boll <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Notes:
- make sure the default size is large enough to handle all state trackers
- pipe wrappers don't receive transfer calls from stream_uploader, because
pipe_context::stream_uploader points directly to the underlying driver's
stream_uploader (to keep it simple for now)
v2: add error handling to nv50, nvc0, noop
v3: set const_uploader
Reviewed-by: Nicolai Hähnle <[email protected]>
Tested-by: Edmondo Tommasina <[email protected]> (v1)
Tested-by: Charmaine Lee <[email protected]>
|
|
|
|
|
|
|
|
| |
gallium's blitter expects that it can set the sample mask even when the
rasterizer doesn't have the flag on.
Between this and the previous test, 10 new ext_framebuffer_multisample
tests start passing.
|
|
|
|
|
|
| |
gallium's quad-based blitter for copying MSAA depth textures expects to be
able to do 4 passes updating a sample at a time using glSampleMask, and
there's no color buffer bound when it's doing that.
|
| |
|
|
|
|
|
| |
We probably shouldn't be emitting different scaled viewport coordinates
between vertex and coord.
|
|
|
|
|
|
|
| |
In the hardware we only get to declare 8 vertex elements (GLES2's
minimum), so we should be exposing that number here. Fixes an assertion
failure in piglit texrect-many, at the expense of various GL 2.0-ish
minmax tests now complaining that our count is too low.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The kernel will reject our shader if we emit one here, and having 4, 8, or
12 as the top end of our UBO clamp rare is enough that it's not worth
making the kernel let us.
Fixes piglit fs-const-array-of-struct and
fs-const-array-of-struct-of-array since recent GLSL linking changes made
us get this as an indirect load of a uniform, instead of a tempoary.
Cc: "13.0 17.0" <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Nouveau does not currently have logic to implement this as a library
function. Even though such a library could be written, there's no big
advantage to do it that way for now given that int64 is a very uncommon
use-case. Allow a driver to expose INT64 without supporting division and
modulo operations.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make the cap consistent with PIPE_CAP_INT64.
Aside from the hypothetical case of using draw for vertex shaders (and
actually caring about doubles...), every implementation supports doubles
either nowhere or everywhere.
Also, st/mesa didn't even check the cap correctly in all supported
shader stages.
While at it, add a missing LLVM version check for 64-bit integers in
radeonsi. This is conservative: judging by the log, LLVM 3.8 might be
sufficient, but there are probably bugs that have been fixed since then.
v2: fix clover (Marek)
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Rob Herring <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
| |
The addition of Neon assembly breaks on arm64 builds because the assembly
syntax is different. For now, restrict Neon to ARMv7 builds.
Signed-off-by: Rob Herring <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
clang throws an error on "%r2" and similar. I couldn't find any
documentation on what "%r?" is supposed to mean and I've never seen any
use like that as far as I remember. The parameter is supposed to be
cpu_stride and just %2/%3 should be sufficient.
There's no need for trailing ";" either, so remove those, too.
Signed-off-by: Rob Herring <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
| |
This generally cuts an instruction when blending is enabled and we thus
have a single instruction generating the color value.
total instructions in shared programs: 91759 -> 91634 (-0.14%)
instructions in affected programs: 5338 -> 5213 (-2.34%)
|
|
|
|
|
|
|
|
|
|
| |
shader-db results:
total instructions in shared programs: 92611 -> 91764 (-0.91%)
instructions in affected programs: 27417 -> 26570 (-3.09%)
The star is one shader in glmark2's terrain (drops 16% of its
instructions), but there are also wins in mupen64plus and glb2.7.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This has almost no effect on shader-db:
total instructions in shared programs: 92572 -> 92611 (0.04%)
instructions in affected programs: 4486 -> 4525 (0.87%)
Looking at 2 of the 7 different shaders that were hurt (all of which were
in mupen64), they all appear to be just differences in order of
instructions at the NIR level.
The advantage is that this should significantly reduce time in the compiler.
|
|
|
|
|
|
|
|
|
| |
v1.1: move to using a normal CAP. (Marek)
v2: fill in the cap everywhere
Signed-off-by: Dave Airlie <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
| |
Improves 1024x1024 TexSubImage2D by 41.2371% +/- 3.52799% (n=10).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We had a lot of memcpy call overhead because gpu_stride wasn't being
inlined. But if you split out the stride==8 and stride==16 cases like
this code does while still using memcpy, you'd no longer have glibc's
NEON memcpy applied at which point we'd be doing 16 uncached reads
instead of 64/(NEON memcpy granularity), for about a 30% performance
hit. By hand writing the assembly, we can get a whole cacheline
loaded at a time.
Unfortunately, NEON intrinsics turned out to be unusable -- they
didn't have the vldm instruction available.
Note that, for now, the NEON code is only enabled when building for ARMv7
(Pi 2+). We may want to do runtime detection for the Raspbian case, in
the future.
Improves 1024x1024 GetTexImage by 208.256% +/- 7.07029% (n=10).
|