| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Background: Prior to Skylake and since Ivybridge Intel hardware has had the
ability to use a MCS (Multisample Control Surface) as auxiliary data in
"compression" operations on the surface. This reduces memory bandwidth. This
hardware was either used for MSAA compression, or fast clear operations. On
Gen8, a similar mechanism exists to allow the hiz buffer to be sampled from, and
therefore this feature is sometimes referred to more generally as "AUX buffers".
Skylake adds the ability to have the display engine directly source compressed
surfaces on top of the ability to sample from them. Inference dictates that
enabling this display features adds a restriction to the formats which could
actually be compressed. This is backed up by a blurb in the AUX_CCS_D section
from the RENDER_SURFACE_STATE: "In addition, if the surface is bound to the
sampling engine, Surface Format must be supported for Render Target Compression
for surfaces bound to the sampling engine." The current set of surfaces seems
to be a subset as compared to previous gens (see the next patch). Also, if I had
to guess I would guess that future gens add support for more surface formats. To
make handling this a bit easier to read, and more future proof, the support for
this is moved into the surface formats table.
Along with the modifications to the table, a helper function is also provided to
determine if a surface is CCS_E compatible. Because fast clears are currently
disabled on SKL, we can plumb the helper all the way through here, and not
actually have anything break.
v2:
- rename ccs to ccs_e; Requested-by: Chad
- rename lossless_compression to lossless_compression Requested-by: Chad
- change meaning of brw_losslessly_compressible_format Requested-by: Chad
- related changes to the code to reflect this.
- remove excess ccs (Chad)
v3:
- Commit message changes (Topi)
- Const some things which could be const (Topi)
Requested-by: Chad Versace <[email protected]>
Requested-by: Neil Roberts <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Reviewed-by: Topi Pohjolainen <[email protected]>
Reviewed-by: Chad Versace <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Tapani Pälli <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
A while back, we moved to directly emitting the Gen7+ state when
constructing the binding tables. These flags are only used on
Gen4-6, which emit all the binding table pointers at once.
We gain nothing by having separate flags, so combine them.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Topi Pohjolainen <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is only one function that can be called, which is well known at
compilation time.
The abstraction used here seems unnecessary, so let's use a direct call
to brw_stage_prog_data_free() when appropriate, cut down the size of
struct brw_cache.
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is more optimal as it means we no longer have to upload the same set
of ABO surfaces to all stages in the program.
This also fixes a bug where since commit c0cd5b var->data.binding was
being used as a replacement for atomic buffer index, but they don't have
to be the same value they just happened to end up the same when binding is 0.
Reviewed-by: Francisco Jerez <[email protected]>
Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]>
Cc: Ilia Mirkin <[email protected]>
Cc: Alejandro Piñeiro <[email protected]>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=90175
|
|
|
|
|
|
|
| |
v2: externalize pred_ctrl_align16 from brw_disasm.c instead of adding
a copy on brw_vec4.c, as suggested by Matt Turner
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
At this point, the compiler API has been substantially simplified. In the
spirit of Kristian's making a compiler library, this commit makes a single
header file that contains, more-or-less, the entire compiler API.
There's still a bit of cleanup to do particularly in the area of geometry
shaders. However, this gets us much closer to having a separate compiler.
Reviewed-by: Topi Pohjolainen <[email protected]>
|
|
|
|
| |
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
|
|
|
| |
For scalar VS, I'll need this in brw_fs.cpp as well. It seems silly to
redeclare it in three places.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Both the vec4 and scalar VS backends had virtually identical URB entry
size and read length calculations. We can move those up a level to
backend-agnostic code and reuse it for both.
Unfortunately, the backends need to know nr_attributes to compute
first_non_payload_grf, so I had to store that in prog_data. We could
use urb_read_length, but that's nr_attributes rounded up to a multiple
of two, so doing so would waste a register in some cases.
There's more code to be removed in the vec4 backend, but that will
come in a follow-on patch.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function computes the next power of two, but at least 1024. We can
do that by bitwise or'ing in 1023 and calling util_next_power_of_two().
We use brw_get_scratch_size() from the compiler so we need it out of
brw_program.c. We could move it to brw_shader.cpp, but let's make it a
small inline function instead.
Reviewed-by: Topi Pohjolainen <[email protected]>
Signed-off-by: Kristian Høgsberg Kristensen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The initial motivation for this patch was to avoid calling
brw_cs_prog_local_id_payload_dwords() in gen7_cs_state.c from the
compiler. This commit ends up refactoring things a bit more so as to
split out the logic to build the local id payload to brw_fs.cpp. This
moves the payload building closer to the compiler code that uses the
payload layout and makes it available to other users of the compiler.
Reviewed-by: Topi Pohjolainen <[email protected]>
Signed-off-by: Kristian Høgsberg Kristensen <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Matt Turner <[email protected]>
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
| |
In theory we can't break this assertion since the compiler frontend checks
that we don't exceed any of the individual limits, but it does not hurt to
be extra safe.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
These share the space with UBO surfaces but we need to make sure we
allocate enough space for both sets (12 of each)
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
| |
Instead of using hard-coded values.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
| |
Instead of using hard-coded values.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]>
|
|
|
|
|
|
| |
They are no longer used.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
They haven't been used since 1bba29ed403e735ba0bf04ed8aa2e571884fcaaf so
there's no good reason to keep them around.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will only be setup when the prog_data uses_num_work_groups
boolean is set.
At this point nothing will set uses_num_work_groups, but soon code
will set it when emitting code for the intrinsic that loads
gl_NumWorkGroups.
We can't emit this surface information earlier at the start of the
DispatchCompute* call because we may not have generated the program
yet. Until we generate the program, we don't know if the
gl_NumWorkGroups variable is accessed.
We also can't emit the surface as part of the brw_cs_state atom,
because we might not need the surface if gl_NumWorkGroups is not used
by the program.
Lastly, we cannot emit the surface later (after state upload) in the
DispatchCompute* call, because it needs to be run before the
brw_cs_state atom is emitted, since it changes the surface state.
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If glDispatchComputeIndirect is used, then the value for this variable
must be read from the indirect BO.
To allow the same generated code to support indirect and
glDispatchCompute, we will also setup a BO for the number of work
groups using the intel_upload_data mechanism. This will only be
required if the gl_NumWorkGroups variable is accessed.
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
|
|
|
| |
We will need this in an atom to setup a surface to read the
gl_NumWorkGroups values from.
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Broadwell's 3DSTATE_GS contains new "Static Output" and "Static Vertex
Count" fields, which control a new optimization. Normally, geometry
shaders can output arbitrary numbers of vertices, which means that
resource allocation has to be done on the fly. However, if the number
of vertices is statically known, the hardware can pre-allocate resources
up front, which is more efficient.
Thanks to the new NIR GS intrinsics, this is easy. We just call the
function introduced in the previous commit to get the vertex count.
If it obtains a count, we stop emitting the extra 32-bit "Vertex Count"
field in the VUE, and instead fill out the 3DSTATE_GS fields.
Improves performance of Gl32GSCloth by 5.16347% +/- 0.12611% (n=91)
on my Lenovo X250 laptop (Broadwell GT2) at 1024x768.
shader-db statistics for geometry shaders only:
total instructions in shared programs: 3227 -> 3207 (-0.62%)
instructions in affected programs: 242 -> 222 (-8.26%)
helped: 10
v2: Don't break non-NIR paths (just skip this optimization).
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Jordan Justen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The old code was disasterously complex - spread across multiple atoms
which may not even run, inspecting the dirty bits to try and decide
whether it was necessary to do checks...storing VS information in
brw_context...extra flagging...
This code tripped me and Carl up very badly when working on the
shader cache code. It's very fragile and hard to maintain.
Now that geometry shaders only depend on their inputs and don't have
to worry about the VS VUE map, we can dramatically simplify this:
just compute the VUE map coming out of the geometry shader stage
in brw_upload_programs. If it changes, flag it. Done.
v2: Also check vue_map.separable.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, our VUE map code always assigned slots to varyings
sequentially, in one contiguous block.
This was a bad fit for separate shaders - the GS input layout depended
or the VS output layout, so if we swapped out vertex shaders, we might
have to recompile the GS on the fly - which rather defeats the point of
using separate shader objects. (Tessellation would suffer from this
as well - we could have to recompile the HS, DS, and GS.)
Instead, this patch makes the VUE map for separate shaders use a fixed
layout, based on the input/output variable's location field. (This is
either specified by layout(location = ...) or assigned by the linker.)
Corresponding inputs/outputs will match up by location; if there's a
mismatch, we're allowed to have undefined behavior.
This may be less efficient - depending what locations were chosen, we
may have empty padding slots in the VUE. But applications presumably
use small consecutive integers for locations, so it hopefully won't be
much worse in practice.
3% of Dota 2 Reborn shaders are hurt, but only by 2 instructions.
This seems like a small price to pay for avoiding recompiles.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Since these are a special kind of UBOs we emit them together reusing the
same infrastructure, however, we use a RAW surface so we can reuse
existing untyped read/write/atomic messages which include a pixel mask
header that we need to set to obtain correct behavior with helper
invocations of the fragment shader.
Reviewed-by: Jordan Justen <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Enable barrier in MEDIA_INTERFACE_DESCRIPTOR if the program uses the
barrier() GLSL function.
On Ivy Bridge and Haswell, this allows the piglit test
tests/spec/arb_compute_shader/execution/simple-barrier-atomics.shader_test
to pass. On gen8, this enables a similar test with a local group size
of 896 to pass.
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We only support geometry shaders in core profiles, where gl_ClipVertex
doesn't exist. Presumably the even older behavior of clipping to
gl_Position isn't supported either. In fact, GLSL 1.50 page 76 claims:
"The shader must also set all values in gl_ClipDistance that have been
enabled via the OpenGL API, or results are undefined."
So we don't need to handle legacy clipping in geometry shaders. I think
Paul added this back when we were considering supporting the old
GL_ARB_geometry_shader4 extension.
This removes a non-orthagonal state dependency on GS compilation.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
brw_upload_cs_push_constants was based on gen6_upload_push_constants.
v2:
* Add FINISHME comments about more efficient ways to push uniforms
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
|
|
|
|
|
|
|
|
|
| |
load/store.
v2: Store early fragment test mode in brw_wm_prog_data instead of
getting it from core mesa data structures (Ken).
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
| |
v2: Add CS support. Move the image_params array back to
brw_stage_prog_data.
Reviewed-by: Topi Pohjolainen <[email protected]>
Acked-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will be used to pass image meta-data to the shader when we cannot
use typed surface reads and writes. All entries except surface_idx
and size are otherwise unused and will get eliminated by the uniform
packing pass. size will be used for bounds checking with some image
formats and will be useful for ARB_shader_image_size too. surface_idx
is always used.
v2: Add CS support. Move the image_params array back to
brw_stage_prog_data.
v3: Improve documentation.
Reviewed-by: Topi Pohjolainen <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
| |
v2: Add SKL support.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Fixes following compiler warning:
brw_cs.cpp:386:27: warning: comparison between signed and unsigned
integer expressions [-Wsign-compare]
Signed-off-by: Anuj Phogat <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements the binding table enable command which is also
used to allocate a binding table pool where where hardware-generated
binding table entries are flushed into. Each binding table offset in
the binding table pool is unique per each shader stage that are
enabled within a batch.
Also insert the required brw_tracked_state objects to enable
hw-generated binding tables in normal render path.
v2: - Use MOCS in binding table pool alloc for GEN8
- Fix spurious offset when allocating binding table pool entry
and start from zero instead.
v3: - Include GEN8 fix for spurious offset above.
v4: - Fixup wrong packet length in enable/disable hw-binding table
for GEN8 (Ville).
- Don't invoke HW-binding table disable command when we dont
have resource streamer (Chris).
v5: - Reorder the state cache invalidate flush so it happens in-between
enabling hw-generated binding tables and the previous sw-binding
table GPU state (Chris).
v6: - Do the same fix in v5 for gen7_disable_hw_binding_tables().
- Adhere to coding guidelines and make comments more informative.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Kenneth Graunke <[email protected]>
Signed-off-by: Abdiel Janulgue <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Check first if the hardware and kernel supports resource streamer. If this
is allowed, tell the kernel to enable the resource streamer enable bit on
MI_BATCHBUFFER_START by specifying I915_EXEC_RESOURCE_STREAMER
execbuffer flags.
v2: - Use new I915_PARAM_HAS_RESOURCE_STREAMER ioctl to check if kernel
supports RS (Ken).
- Add brw_device_info::has_resource_streamer and toggle it for
Haswell, Broadwell, Cherryview, Skylake, and Broxton (Ken).
v3: - Update I915_PARAM_HAS_RESOURCE_STREAMER to match updated kernel.
v4: - Always inspect the getparam.value (Chris Wilson).
v5: - Fold redundant devinfo->has_resource_streamer check in context create
into init screen.
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Kenneth Graunke <[email protected]>
Signed-off-by: Abdiel Janulgue <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously OUT_BATCH was just a macro around an inline function which
does
brw->batch.map[brw->batch.used++] = dword;
When making consecutive calls to intel_batchbuffer_emit_dword() the
compiler isn't able to recognize that we're writing consecutive memory
locations or that it doesn't need to write batch.used back to memory
each time.
We can avoid both of these problems by making a local pointer to the
next location in the batch in BEGIN_BATCH().
Cuts 18k from the .text size.
text data bss dec hex filename
4946956 195152 26192 5168300 4edcac i965_dri.so before
4928956 195152 26192 5150300 4e965c i965_dri.so after
This series (including commit c0433948) improves performance of Synmark
OglBatch7 by 8.01389% +/- 0.63922% (n=83) on Ivybridge.
Reviewed-by: Chris Wilson <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's only used inside #ifdef DEBUG. Cuts ~1.7k of .text, and more
importantly prevents a larger code size regression in the next commit
when the .used field is replaced and calculated on demand.
text data bss dec hex filename
4945468 195152 26192 5166812 4ed6dc i965_dri.so before
4943740 195152 26192 5165084 4ed01c i965_dri.so after
And surround the emit and total fields with #ifdef DEBUG to prevent
such mistakes from happening again.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
With the exception of gen8, the sole user of the workaround bo are for
emitting pipe controls. Move it out of the purview of the batchbuffer
and into the pipecontrol.
Signed-off-by: Chris Wilson <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
Reviewed-by: Martin Peres <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
On Gen9+ there is a new bit in 3DSTATE_PS_EXTRA that must be set if
the shader sends a message to the pixel interpolator. This fixes the
interpolateAt* tests on SKL, apart from interpolateatsample-nonconst
but that is not implemented anywhere so it's not a regression.
Reviewed-by: Ben Widawsky <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
Cc: "10.6 10.5" <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The thread counts and URB information are all speculative numbers that were
based on some CHV numbers at the time.
v2:
Originally this patch had PCI IDs. I've moved that to a new patch at the end of
the series.
Remove is_cherryview hack.
Add PCI ids. These match the ones defined in the kernel. The only one tested by
us is 0x0a84.
Capitalize the hex string (Mark)
Signed-off-by: Ben Widawsky <[email protected]>
Tested-by: "Lecluse, Philippe" <[email protected]>
Reviewed-by: Mark Janes <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Chris Wilson <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Start trimming the fat from intel_batchbuffer.c. First by moving the set
of routines for emitting PIPE_CONTROLS (along with the lore concerning
hardware workarounds) to a separate brw_pipe_control.c
Signed-off-by: Chris Wilson <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Previously, each shader took 3 shader time indices which were potentially
at arbirary points in the shader time buffer. Now, each shader gets a
single index which refers to 3 consecutive locations in the buffer. This
simplifies some of the logic at the cost of having a magic 3 a few places.
Reviewed-by: Kenneth Graunke <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates the options at screen cration time and then we just copy them
into the context at context creation time. We also move is_scalar to the
brw_compiler structure.
We also end up manually setting some values that the core would have set by
default for us. Fortunately, there are only two non-zero shader compiler
option defaults that we aren't overriding anyway so this isn't a big deal.
Reviewed-by: Kenneth Graunke <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This function will be utilised in later patches.
V2: Make both pointers constants (Topi)
Signed-off-by: Anuj Phogat <[email protected]>
Reviewed-by: Topi Pohjolainen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We used to store the GS dispatch mode in brw_gs_prog_data while
separately storing the VS dispatch mode in brw_vue_prog_data::simd8.
This patch introduces an enum to represent all possible dispatch modes,
and stores it in brw_vue_prog_data::dispatch_mode, unifying the two.
Based on a suggestion by Matt Turner.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
|