| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Apparently, Geminilake requires you to whack a chicken bit to select
either compute or tessellation mode for barriers. The recommendation
is to switch between them at PIPELINE_SELECT time.
We may not need to do this all the time, but I don't know that it hurts
either. PIPELINE_SELECT is already a pretty giant stall.
This appears to fix hangs in tessellation control shaders with barriers
on Geminilake. Note that this requires a corresponding kernel change,
drm/i915: Whitelist SLICE_COMMON_ECO_CHICKEN1 on Geminilake.
in order for the register write to actually happen. Without an updated
kernel, this register write will be noop'd and the fix will not work.
Reviewed-by: Rafael Antognolli <[email protected]>
|
|
|
|
|
|
|
|
|
| |
We need to be able to emit PIPE_CONTROLs from genX_state_upload.c,
which can't safely include brw_defines.h because it conflicts with
genxml. Move all the PIPE_CONTROL related stuff together into a
separate header.
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
| |
Fixes: 6165fda59b8 ("i965: Program DWord Length in MI_FLUSH_DW")
Cc: <[email protected]>
Signed-off-by: Anuj Phogat <[email protected]>
Reviewed-by: Nanley Chery <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This workaround doesn't fix any of the piglit hangs we've seen
on CNL. But it might be fixing something we haven't tested yet.
V2: Remove the bits enabling Float blend optimization. It is
enabled through CACHE_MODE_SS register.
Update the comment.
Move gen10 if block on top of gen9 if block.
Signed-off-by: Anuj Phogat <[email protected]>
Reviewed-by: Rafael Antognolli <[email protected]>
Reviewed-by: Nanley Chery <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This optimization is enabled for previous generations too.
See Mesa commit c17e214a6b
On CNL this bit has been moved to CACHE_MODE_SS register.
Signed-off-by: Anuj Phogat <[email protected]>
Reviewed-by: Nanley Chery <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are few other (duplicate) workarounds which have similar recommendations:
WaFlushHangWhenNonPipelineStateAndMarkerStalled
WaCSStallBefore3DSamplePattern
WaPipeControlBefore3DStateSamplePattern
WaPipeControlBefore3DStateSamplePattern has some extra recommendations if
driver is using mid batch context restore. Ignoring it for now because We're
not doing mid-batch context restore in Mesa.
This workaround doesn't fix any of the piglit hangs we've seen
on CNL. But it might be fixing something we haven't tested yet.
V2: Use brw_load_register_imm32() to program CACHE_MODE_0.
Get rid of brw_flush_gpu_caches().
V3: Make the workaround helper functions static.
Signed-off-by: Anuj Phogat <[email protected]>
Reviewed-by: Rafael Antognolli <[email protected]>
Reviewed-by :Nanley Chery <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As of 4.11, the kernel isn't bothering to set the subslice hashing mode
on Apollolake, leaving it at the default of 8x8. (It initializes it to
16x4 on most platforms.)
Performance data for GPUTest Triangle on Apollolake at 1024x640:
X-tiled RT:
-----------
8x8 -> 16x4: 2.4325% +/- 0.383683% (n=107)
8x8 -> 8x4: -3.75105% +/- 0.592491% (n=40)
8x8 -> 16x16: 6.17238% +/- 0.67157% (n=30)
Y-tiled RT:
-----------
8x8 -> 16x4: 1.30307% +/- 0.297292% (n=205)
8x8 -> 8x4: -0.769282% +/- 0.729557% (n=35)
8x8 -> 16x16: 3.00254% +/- 0.715503% (n=40)
8x MSAA RT (INTEL_FORCE_MSAA=8):
--------------------------------
8x8 -> 16x4: 1.38889% +/- 0.93729% (n=7)
8x8 -> 8x4: -2.10643% +/- 1.15153% (n=3)
8x8 -> 16x16: 3.87183% +/- 1.08851% (n=5)
Based on this, we choose 16x16 for Apollolake.
Skylake GT2 with X-tiled buffers appears to be a toss-up between 16x4
and 16x16, and with Y-tiled buffers it doesn't seem to really matter.
So we'll leave Skylake alone for now.
The hashing mode doesn't seem to make a measurable impact on more
complex benchmarks.
Acked-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By default, 3DSTATE_CONSTANT_* Constant Buffer 0 is relative to dynamic
state base address. This makes it unusable for pushing UBOs. I'd like
to be able to use all four push buffers.
There is a bit in the INSTPM register (or CS_DEBUG_MODE2 on Skylake)
which controls whether buffer 0 is relative to dynamic state base
address, or simply a normal pointer. Setting that gives us full
flexibility.
We can't currently write this on Haswell and earlier, and will need
to update the kernel command parser, and then do the whole version
checking song and dance.
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Enables access to OA unit metrics on Gen8+ via INTEL_performance_query.
v2: make use of new parameters coming from gen_device_info (Lionel)
Signed-off-by: Robert Bragg <[email protected]>
Signed-off-by: Lionel Landwerlin <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is woefully undocumented. It's some kind of optimization that
avoids unnecessary render target reads when blending with a floating
point render target, using independent alpha blending modes.
The internal documentation indicates that this bit exists on Cherryview
as well, but the other driver doesn't appear to set it on that platform.
There's also some confusing wording that indicates that it may exist on
Broadwell, but the documentation says it's reserved, so who knows.
I was not able to find any workload that benefited from setting this
bit, but it seems like a good idea to set it nonetheless.
Reviewed-by: Plamena Manolova <[email protected]>
Acked-by: Jason Ekstrand <[email protected]>
|
|
|
|
| |
Reviewed-by: Topi Pohjolainen <[email protected]>
|
|
|
|
| |
Reviewed-by: Topi Pohjolainen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
These macros are defined in brw_defines.h, which contains a lot of
macros that conflict with autogenerated code from genxml. But we need to
use them (the MOCS macros) in some of that same genxml code.
Moving them to brw_context.h solves that problem and we don't have to
include brw_defines.h.
Signed-off-by: Rafael Antognolli <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Yf/Ys tiling never got used in i965 due to not delivering
the expected performance benefits. So, this patch is deleting
this dead code in favor of adding it later in ISL when we
actually find it useful. ISL can then share this code between
vulkan and GL.
Signed-off-by: Anuj Phogat <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
| |
compiler/brw_vec4_gs_visitor.cpp:744:39: error:
‘GEN7_MAX_GS_OUTPUT_VERTEX_SIZE_BYTES’ was not declared in this scope
output_vertex_size_bytes <= GEN7_MAX_GS_OUTPUT_VERTEX_SIZE_BYTES);
Fixes: d0d4a5f43b4 ("i965: split EU defines to brw_eu_defines.h")
Reviewed-by: Emil Velikov <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Split out the EU defines from the 'generic' ones, as the former are more
compiler oriented.
With a later commit we'll move brw_eu_defines.h alongside the compiler
infra to src/intel/. Pulling all the defines in there seems overzealous.
Some defines are used by both i965 and the i965 compiler. Those are
moved to brw_eu_defines.h, and annotated accordingly. The i965 users
were updated to have the extre include to indicate that.
With future work we might provide a better, split but for now this seems
reasonable.
Cc: Kenneth Graunke <[email protected]>
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
| |
Cc: [email protected]
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
The follow three groups are not used by neither the DRI module nor the
compiler.
BRW_POLYGON_*_FACING
BRW_POLYGON_FACING_*
BRW_STATELESS_BUFFER_*
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We never actually used the resource streamer in any shipping build
of Mesa. We have no plans to do so in the future. We looked into
using it in Vulkan, and concluded that it was unusable. We're not
the only ones to arrive at the conclusion that it's not worth using.
So, drop the last vestiges of resource streamer support and move on.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
Reviewed-by: Emil Velikov <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
One less set of enums. Dropped the #defines from brw_defines.h and ran:
$ for file in *.cpp *.c *.h; do sed -i \
-e 's/BRW_SURFACEFORMAT_/ISL_FORMAT_/g' \
-e 's/ISL_FORMAT_ASTC_[A-Zxs0-9_]*/\U&/g' $file; \
done
Signed-off-by: Kenneth Graunke <[email protected]>
Acked-by: Jason Ekstrand <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
This is a relic of when we wired up meta to be able to use RECTLIST
primitives. It's no longer needed.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The opcodes are not specific for conversions to/from float since we need
the same for conversions to/from other 32-bit types. Rename the opcodes
accordingly and change the asserts to check the size of the types involved
instead.
v2:
- Rename to VEC4_OPCODE_TO_DOUBLE and VEC4_OPCODE_FROM_DOUBLE (Matt)
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These opcodes will set the low/high 32-bit in each 64-bit data element
using Align1 mode. We will use this to implement packDouble2x32.
We use Align1 mode because in order to implement this in Align16 mode
we would need to use 32-bit logical swizzles (XZ for low, YW for high),
but the IR works in terms of 64-bit logical swizzles for DF operands
all the way up to codegen.
v2:
- use suboffset() instead of get_element_ud()
- no need to set the width on the dst
Reviewed-by: Ian Romanick <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These opcodes will pick the low/high 32-bit in each 64-bit data element
using Align1 mode. We will use this, for example, to do things like
unpackDouble2x32.
We use Align1 mode because in order to implement this in Align16 mode
we would need to use 32-bit logical swizzles (XZ for low, YW for high),
but the IR works in terms of 64-bit logical swizzles for DF operands
all the way up to codegen.
v2:
- use suboffset() instead of get_element_ud()
- no need to set the width on the dst
Reviewed-by: Ian Romanick <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These need to be emitted as align1 MOV's, since they need to have a
stride of 2 on the float register (whether src or dest) so that data
from another thread doesn't cross the middle of a SIMD8 register.
v2 (Iago):
- The float-to-double needs to align 32-bit data to 64-bit before doing the
conversion. This was doable in align16 when we tried to use an execsize
of 4, but with an execsize of 8 we would need another align1 opcode to do
that (since we need data to cross the middle of a SIMD register). Just
making the opcode handle this internally seems more practical that adding
another opcode just for this purpose and having the caller know about this
before converting.
- The double-to-float conversion produces 32-bit elements aligned to 64-bit
so we make the opcode re-pack the result to 32-bit and fit in one register,
as expected by SIMD4x2 operation. This still requires that callers reserve
two registers for the float data destination because we need to produce
64-bit aligned data first, and repack it later on the same destination
register, but it saves the need for a re-pack opcode only to achieve this
making the operation complete in a single opcode. Hopefully that is worth
the weirdness of the double register allocation...
Signed-off-by: Connor Abbott <[email protected]>
Signed-off-by: Iago Toral Quiroga <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
| |
Not used anymore. It was just a scalar MOV.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
| |
We'll need roughly the same logic in other places and it would be
annoying to duplicate it. Instead factor it out into a function-like
macro that takes the number of dwords per block (which will prove more
convenient than taking the same value in owords or some other unit).
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
git grep -l comparitor | xargs sed -i 's/comparitor/comparator/g'
Just happened to notice this in a patch that was sent and included one
of the tokens in question.
Signed-off-by: Ilia Mirkin <[email protected]>
Acked-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Lionel Landwerlin <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, we had an OFFSET_VALUE source for logical texture instructions
that was intended to mean exactly what it says, "offset". In reality, we
only fully used it for tg4 offsets. We used offset_value.file == IMM to
mean, "you have a constant offset, go look in instr->offset" and didn't
actually use the contents of the register at all in that case except for
in nir_emit_texture where we used it as a temporary before we copy it into
instr->offset.
This commit renames OFFSET_VALUE to TG4_OFFSET and restricts its usage to
indirect tg4 offsets only. The nir_emit_texture code is refactored so that
we explicitly build a header_bits value which is placed in instr->offset
and the constant offset values (both for tg4 and regular texture
operations) are used to construct header_bits and don't go through the
offset source at all. Finally, we stop passing offset_value in to
lower_sampler_logical_send_gen5 because we can't do indirect offsets until
gen7 anyway.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Gen6-7.5 specify the user clip distance enable bitmask in 3DSTATE_CLIP.
Gen8+ normally uses the new internal signalling mechanism to select the
one specified in the last enabled shader stage (3DSTATE_VS, DS, or GS).
This is a pretty good fit for Vulkan, or even newer GL, where the
bitmask comes entirely from the shader. But with glClipPlane(),
this is dynamic state, and we have to listen to _NEW_TRASNFORM.
Clip plane enables are the only reason the VS/DS/GS atoms need to
listen to _NEW_TRANSFORM. 3DSTATE_CLIP already has to listen to it
in order to support ARB_clip_control settings.
Setting the "Use the 3DSTATE_CLIP bitmask" force enable bit allows
us to drop _NEW_TRANSFORM from all the shader stage atoms, so we can
re-emit them less often.
Improves performance of OglBatch7 (version 6) by 2.70773% +/- 0.491257%
(n = 38) at 1024x768 on Cherryview.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
| |
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
Most likely we had only ever used this macro on bitfields of less than
31 bits -- That's going to change shortly.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
| |
More than half of the stuff in intel_reg.h had nothing whatsoever to do
with registers and really belongs in brw_defines.h anyway.
Signed-off-by: Jason Ekstrand <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
| |
We no longer use this message. As far as I can tell, it's fairly
useless - the equivalent information is provided in the payload.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
brw_wm_barycentric_interp_mode is wordy, brw_barycentric_mode is less
typing and suffers from fewer line wrapping problems.
The enum values themselves don't really benefit from "WM" in the name,
either. Put "BARYCENTRIC" first instead of at the end and drop "WM".
Generated by:
for file in *.c *.cpp *.h; do sed -i \
-e 's/brw_wm_barycentric_interp_mode/brw_barycentric_mode/g' \
-e 's/BRW_WM_\([A-Z_]*\)_BARYCENTRIC/BRW_BARYCENTRIC_\1/g' \
-e 's/BRW_WM_BARYCENTRIC_INTERP_MODE_COUNT/BRW_BARYCENTRIC_MODE_COUNT/g' \
$file;
done
with a few whitespace changes.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
v2 (Matt):
- Take a DF source argument for the DIM instruction emission
in the visitors.
- Indentation.
Signed-off-by: Samuel Iglesias Gonsálvez <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
| |
Signed-off-by: Eric Engestrom <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The cross thread constant support appears on Haswell. It allows us to
upload a set of uniform data for all threads without duplicating it
per thread.
We also support per-thread data which allows us to store a per-thread
ID in one of the uniforms that can be used to calculate the
gl_LocalInvocationIndex and gl_LocalInvocationID variables.
v4:
* Support the old local ID push constant layout as well (Jason)
Cc: "12.0" <[email protected]>
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes the whole LOAD_PAYLOAD munging unnecessary which simplifies
the code and will allow the optimization to succeed in more cases
independent of whether the LOAD_PAYLOAD instruction can be found or
not.
The following patch is squashed in:
SQUASH: i965/fs: Add basic dataflow check to opt_sampler_eot().
The sampler EOT optimization pass naively assumes that the texturing
instruction provides all the data used by the FB write just because
they're standing next to each other. The least we should be checking
is whether the source and destination regions of the FB write and
texturing instructions match. Without this the previous seemingly
harmless patch would have caused opt_sampler_eot() to misoptimize a
shader from dota-2 causing DCE to eliminate all of its 78 instructions
except for the final sampler EOT message (!).
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Alternatively we could have extended the current semantics to 32-wide
mode by changing brw_broadcast() to emit multiple indexed MOV
instructions in the generator copying the selected value to all
destination registers, but it seemed rather silly to waste EU cycles
unnecessarily copying the exact same value 32 times in the GRF.
The vstride change in the Align16 path is required to avoid assertions
in validate_reg() since the change causes the execution size of the
MOV and SEL instructions to be equal to the source region width.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
| |
It's just a byte MOV with strided source.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
| |
These can be easily represented in the IR as a MOV instruction with
strided source so they seem rather redundant.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Seems like this texturing opcode was missing its logical counterpart
which would prevent it from taking advantage of the SIMD lowering
infrastructure, define it and plumb it through the back-end. At some
point we'll likely want to emit a single SAMPLEINFO message shared
among all channels irrespective of this change, but for the moment
this should be enough to get the intrinsic working in SIMD32 mode.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
| |
For consistency with the Gen7 variant. I'm not doing the same to the
uniform pull constant message at this point because the non-GEN7 one
is still overloaded to be either an expression-like logical
instruction or a Gen4-specific physical send message.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
| |
This will allow the SIMD lowering pass to split 32-wide varying pull
constant loads (not natively supported by the hardware) into 16-wide
instructions.
Reviewed-by: Jason Ekstrand <[email protected]>
|