| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
More than half of the stuff in intel_reg.h had nothing whatsoever to do
with registers and really belongs in brw_defines.h anyway.
Signed-off-by: Jason Ekstrand <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Eric Engestrom <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We currently use CL_INVOCATION_COUNT for the GL_PRIMITIVES_GENERATED
query, which involves passing all primitives to the clipper. When
rasterizer discard is enabled, we program the clipper in REJECT_ALL
mode, rather than using the SOL stage's "Rendering Disable" feature.
See commit f09b91f78247409f54c975f56cb10d5f350fe64e for an explanation
of why we implement GL_PRIMITIVES_GENERATED this way.
Apparently the SOL stage's "Rendering Disable" feature is a lot faster
than having the clipper reject all primitives. It's safe to use when
no GL_PRIMITIVES_GENERATED query is active, as we don't care about
CL_INVOCATION_COUNT incrementing.
This patch makes us use SO_RENDERING_DISABLE when no query is active,
but continues falling back to the clipper in REJECT_ALL mode when the
queries are enabled. It brings back the perf_debug for the clipper
case (which I removed in commit 1f9445ff57b, thinking it wasn't useful).
Improves performance in Gl32GSCloth by 84.8303% +/- 2.07132% (n = 10)
on my Broadwell GT2 laptop.
Cc: [email protected]
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
v2:
* Declare loop index variable at loop site (idr)
* Make arrays of MI_MATH instructions 'static const' (idr)
* Remove commented debug code (idr)
* Updated comment in set_query_availability (Ken)
* Replace switch with if/else in hsw_result_to_gpr0 (Ken)
* Only divide GL_FRAGMENT_SHADER_INVOCATIONS_ARB by 4 on
hsw and gen8 (Ken)
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This matches the byte based offset of brw_load_register_mem*.
The function is also moved into intel_batchbuffer.c like
brw_load_register_mem*.
Signed-off-by: Jordan Justen <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
We basically just need to uncomment Ben's code.
v2: Fix obvious bugs caught by Ben.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Chris Wilson <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the number needed to be divided by 4 to get the proper results. Now
the hardware does the right thing. Through experimentation it seems Braswell
(CHV) does also need the division by 4.
Fixes piglit test:
arb_pipeline_statistics_query-frag
Signed-off-by: Ben Widawsky <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In gen6 we need to compute the primitive count in the generated GS program.
The current implementation only counts full primitives, that is, if the
output primitive type is a triangle strip, it won't count individual
triangles in the strip, only complete strips.
If we want to count basic primitives instead we have two options: rework
the assembly code we generate for strip primitives or simply use
CL_INVOCATION_COUNT to resolve the query and let the hardware do that work
for us. This patch implements the latter approach.
Fixes the following piglit test:
bin/arb_pipeline_statistics_query-geom -auto
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=89210
Tested-by: Mark Janes <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
NOTE: The implementation was initially one patch, this. All the history is kept
here, even though all the core mesa changes were moved to the parent of this
patch.
This patch implements ARB_pipeline_statistics_query. This addition to GL does
not add a new API. Instead, it adds new tokens to the existing query APIs. The
work to hook up the new tokens is trivial due to it's similarity to the previous
work done for the query APIs. I've implemented all the new tokens to some
degree, but have stubbed out the untested ones at the entry point for Begin().
Doing this should allow the remainder of the code to be left in.
The new tokens give GL clients a way to obtain stats about the GL pipeline.
Generally, you get the number of things going in, invocations, and number of
things coming out, primitives, of the various stages. There are two immediate
uses for this, performance information, and debugging various types of
misrendering. I doubt one can use these for debugging very complex applications,
but for piglit tests, it should be quite useful.
Tessellation shaders, and compute shaders are not addressed in this patch
because there is no upstream implementation. I've implemented how I believe
tessellation shader stats will work for Intel hardware (though there is a bit of
ambiguity). Compute shaders are a bit more interesting though, and I don't yet
know what we'll do there.
For the lazy, here is a link to the relevant part of the spec:
https://www.opengl.org/registry/specs/ARB/pipeline_statistics_query.txt
Running the piglit tests
http://lists.freedesktop.org/archives/piglit/2014-November/013321.html
(http://cgit.freedesktop.org/~bwidawsk/piglit/log/?h=pipe_stats)
yield the following results:
> piglit-run.py -t stats tests/all.py output/pipeline_stats
> [5/5] pass: 5 Running Test(s): 5
v2:
- Don't allow pipeline_stats to be per stream (Ilia). This may (not sure) be
needed for AMD_transform_feedback4, which we do not support.
> If AMD_transform_feedback4 is supported then GEOMETRY_SHADER_PRIMITIVES_-
> EMITTED_ARB counts primitives emitted to any of the vertex streams for
> which STREAM_RASTERIZATION_AMD is enabled.
- Remove comment from GL3.txt because it is only used for extensions that are
part of required versions (Ilia)
- Move the new tokens to a new XML doc instead of using the main GL4x.xml (Ilia)
- Add a fallthrough comment (Ilia)
- Only divide PS invocations by 4 on HSW+ (Ben)
v3:
- Add ARB_pipeline_statistics_query to relnotes.html
- Add ARB_pipeline_statistics_query.xml to the Makefile.am, and master XML (Ilia)
- Correct extension number (Ilia)
- Add link to xml in the main GL API xml (Ilia)
- remove special GS case from gen6_end_query (Ian)
- Make lookup table static so gcc doesn't initialized it on every call (Ian)
- Use if (_mesa_has_geometry_shaders(ctx)) instead of explicit checks (Ian)
- Core mesa parts moved into a prep patch (Ilia)
v4:
- Change to 10.6 relnotes since we missed 10.5 window
- Moved compute shader stuff into the switch statement (Jordan)
- Jordan: Add compute shader support
v5:
- Fixed relnote style (Ilia)
v6:
- Rebased on master which beat me to adding the first relnotes - essentially
this undoes v5 (which had a typo anyway)
- Some code style fixes (Ken)
- Remove some excess comments (Ken)
- Unify tessellation failure style - unreachable (Ken)
- Fix workaround comment for PS invocations (Ken)
Signed-off-by: Ben Widawsky <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Chris Wilson noted that repeated calls to CheckQuery() would call
drm_intel_bo_references(brw->batch.bo, query->bo) on each invocation,
which is expensive. Once we've flushed, we know that future batches
won't reference query->bo, so there's no point in asking more than once.
This patch adds a brw_query_object::flushed flag, which is a
conservative estimate of whether the batch has been flushed.
On the first call to CheckQuery() or WaitQuery(), we check if the
batch references query->bo. If not, it must have been flushed for
some reason (such as being full). We record that it was flushed.
If it does reference query->bo, we explicitly flush, and record that
we did so.
Any subsequent checks will simply see that query->flushed is set,
and skip the drm_intel_bo_references() call.
Inspired by a patch from Chris Wilson.
According to Eero, this does not affect the performance of Witcher 2
on Haswell, but approximately halves the userspace CPU usage.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=86969
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This is less code and also measures the duration of the stall for us.
Our old code predates the existance of brw_bo_map().
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CheckQuery calls drm_intel_bo_references to see if the batch references
the query BO, and if so, flushes. It then checks if the query BO is
busy, and if not, calls gen6_queryobj_get_results().
Stupidly, gen6_queryobj_get_results() immediately did a second redundant
drm_intel_bo_references check, even though we know the buffer is not
referenced and in fact idle.
This patch moves the batch-flush check out of gen6_queryobj_get_results
and into WaitQuery() (the other caller). That way, both callers do a
single batch-flush check.
This should only be a minor improvement, since it would only affect
the first CheckQuery call where the result is actually available.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=86969
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
If query->bo == NULL, this is a redundant CheckQuery call, and we
should simply return. We didn't do anything anyway - we skipped the
batch flushing block, and although we called get_results(), it has an
early return and does nothing. Why bother?
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
q->Ready means that the results are in, and core Mesa is free to return
them to the application. gen6_queryobj_get_results() is a natural place
to set that flag; doing so means callers don't have to.
The older non-hardware-context aware code couldn't do this, because we
had to call brw_queryobj_get_results() to gather intermediate results
when we ran out of space for snapshots in the query buffer. We only
gather complete results in the Gen6+ code, however.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ben Widawsky <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
So far we have been using CL_INVOCATION_COUNT to resolve this query but this
is no good with streams, as only stream 0 reaches the clipping stage.
From ARB_transform_feedback3:
"When a generated primitive query for a vertex stream is active, the
primitives-generated count is incremented every time a primitive emitted to
that stream reaches the Discarding Rasterization stage (see Section 3.x)
right before rasterization. This counter is incremented whether or not
transform feedback is active."
Unfortunately, we don't have any registers that provide the number of primitives
written to a specific stream other than the ones that track the number of
primitives written to transform feedback in the SOL stage, so we can't
implement this exactly as specified.
In the past we implemented this feature by activating the SOL unit even if
transform feeback was disabled, but making it so that all buffers were
disabled and it only recorded statistics, which gave us the right semantics
(see 3178d2474ae5bdd1102fb3d76a60d1d63c961ff5). Unfortunately, this came with
a significant performance impact and had to be reverted.
This new take does not intend to implement the exact semantics required by
the spec, but improves what we have now, since now we return the primitive
count for stream 0 in all cases. With this patch we use
GEN7_SO_PRIM_STORAGE_NEEDED to resolve GL_PRIMITIVES_GENERATED queries
for non-zero streams. This would return the number of primitives written
to transform feedback for each stream instead. Since non-zero streams are
only useful in combination with transform feedback this should not be too
bad, and the only case that I think we would not be supporting would be
the one in which we want to use both GL_PRIMITIVES_GENERATED and
GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN on the same non-zero stream to
detect buffer overflow.
This patch also fixes the following piglit test:
arb_gpu_shader5-xfb-streams-without-invocations
This test uses both GL_PRIMITIVES_GENERATED and
GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN queries on non-zero streams, but it
does never hit the overflow case, so both queries are always expected to return
the same value.
Reviewed-by: Kenneth Graunke <[email protected]>
Cc: "10.3" <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 3178d2474ae5bdd1102fb3d76a60d1d63c961ff5.
This caused GPU hangs on Ivybridge for some users and huge (80%)
performance regressions across the board on multiple platforms.
We need to find a better solution. I've made several attempts, but none
of them have worked yet. In the meantime, we should revert this.
Reverting it breaks GL_PRIMITIVES_GENERATED for non-zero streams, but
that's okay, since we don't expose GL_ARB_gpu_shader5 yet.
Fixes Piglit's EXT_transform_feedback/generatemipmap prims_generated
test case on Haswell.
|
|
|
|
| |
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
So far we have been using CL_INVOCATION_COUNT to resolve this query but this
is no good with streams, as only stream 0 reaches the clipping stage. Instead
we will use SO_PRIM_STORAGE_NEEDED which can keep track of the primitives sent
to each individual stream.
Since SO_PRIM_STORAGE_NEEDED is related to the SOL stage and according to
ARB_transform_feedback3 we need to be able to query primitives generated in
each stream whether transform feedback is active or not what we do is to
enable the SOL unit even if transform feedback is not active but disable all
output buffers in that case. This effectively disables transform feedback
but permits activation of statistics enabling SO_PRIM_STORAGE_NEEDED even
when transform feedback is not active.
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
| |
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Now that we have a helper function that handles the PIPE_CONTROL
variations between the various platforms, these are basically the same.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are a lot of places that use PIPE_CONTROL to write a value to a
buffer (either an immediate write, TIMESTAMP, or PS_DEPTH_COUNT).
Creating a single function to do this seems convenient.
As part of this refactor, we now set the PPGTT/GTT selection bit
correctly on Gen7+. Previously, we set bit 2 of DW2 on all platforms.
This is correct for Sandybridge, but actually part of the address on
Ivybridge and later!
Broadwell will also increase the length of these packets by 1; with the
refactoring, we should have to adjust that in substantially fewer
places, giving us confidence that we've hit them all.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
| |
It now takes a 48-bit address.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
brw_queryobj.c needs a version of write_timestamp that works on all
generations for the QueryCounter() driver hook. So there's no point in
duplicating it in gen6_queryobj.c.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
| |
Previously, the write of each 32-bit half might land in separate batch
buffers, which is insane.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise the gen6 w/a in the kernel won't kick in and the write will
land nowhere.
Inspired by a patch Ken pointed me at which had the same issue (but
isn't yet merged and also for a gen7+ feature). An audit of the entire
driver didn't reveal any other case than the one in in the write_reg
helper used by the gen6 queryobj code.
Acked-by: Kenneth Graunke <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Tested-by: Xinkai Chen <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Cc: "9.2" <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Writing a 64-bit register value to memory is sufficiently complicated
that it makes sense to reuse this function rather than duplicating it.
Exposing it outside of gen6_queryobj.c means it needs a more descriptive
function name. It could probably be moved to brw_util.c or somewhere
else, but this works too.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
The current callers just want to write a single register, so combining
the register read with a pipeline flush made sense. However, in the
future we'll want to do multiple register reads back to back, and we'll
only want to flush once.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Gen7+ supports four transform feedback streams. Using a function-like
macro makes it easy to access them by stream number or loop over them.
"GEN7_" prefixes are more common than "_IVB" suffixes, so we use that.
Gen6 only supports a single stream, so the single #define should be
fine. However, SO_NUM_PRIMS_WRITTEN was confusingly generic, as it
doesn't exist on Gen7+. Add a "GEN6_" prefix for clarity.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Paul Berry <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Most functions no longer use intel_context, so this patch additionally
removes the local "intel" variables to avoid compiler warnings.
Signed-off-by: Kenneth Graunke <[email protected]>
Acked-by: Chris Forbes <[email protected]>
Acked-by: Paul Berry <[email protected]>
Acked-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
| |
Signed-off-by: Kenneth Graunke <[email protected]>
Acked-by: Chris Forbes <[email protected]>
Acked-by: Paul Berry <[email protected]>
Acked-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
| |
Signed-off-by: Kenneth Graunke <[email protected]>
Acked-by: Chris Forbes <[email protected]>
Acked-by: Paul Berry <[email protected]>
Acked-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
| |
Signed-off-by: Kenneth Graunke <[email protected]>
Acked-by: Chris Forbes <[email protected]>
Acked-by: Paul Berry <[email protected]>
Acked-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes brw_context available in every function that used
intel_context. This makes it possible to start migrating fields from
intel_context to brw_context.
Surprisingly, this actually removes some code, as functions that use
OUT_BATCH don't need to declare "intel"; they just use "brw."
Signed-off-by: Kenneth Graunke <[email protected]>
Acked-by: Chris Forbes <[email protected]>
Acked-by: Paul Berry <[email protected]>
Acked-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we have hardware contexts and can use MI_STORE_REGISTER_MEM,
we can use the GPU's pipeline statistics counters rather than going out
of our way to count primitives in software.
Aside from being simpler, this also paves the way for Geometry Shaders,
which can output an arbitrary number of primitives on the GPU. It will
also allow us to use hardware primitive restart when these queries are
in use.
The GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN query is easy: it
corresponds to the SO_NUM_PRIMS_WRITTEN/SO_NUM_PRIMS_WRITTEN0_IVB
counters.
The GL_PRIMITIVES_GENERATED query is trickier. Gen provides several
statistics registers which /almost/ match the semantics required:
- IA_PRIMITIVES_COUNT
The number of primitives fetched by the VF or IA (input assembler).
This undercounts when GS is enabled, as it can output many primitives.
- GS_PRIMITIVES_COUNT
The number of primitives output by the GS. Unfortunately, this
doesn't increment unless the GS unit is actually enabled, and it
usually isn't.
- SO_PRIM_STORAGE_NEEDED*_IVB
The amount of space needed to write primitives output by transform
feedback. These naturally only work when transform feedback is on.
We'd also have to add the counters for all four streams.
- CL_INVOCATION_COUNT
The number of primitives processed by the clipper. This doesn't work
if the GS or SOL throw away primitives for rasterizer discard.
However, it does increment even if the clipper is in REJECT_ALL mode.
Dynamically switching between counters would be painfully complicated,
especially since GS, rasterizer discard, and transform feedback can all
be switched on and off repeatedly during a single query.
The most usable counter is CL_INVOCATION_COUNT. The previous two
patches reworked rasterizer discard support so that all primitives hit
the clipper, making this work.
v2: Occlusion query bug fixes removed and squashed in earlier patches.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Paul Berry <[email protected]>
|
|
Hardware contexts greatly simplify the query object code. The pipeline
statistics counters get saved and restored with the context, which means
that we don't need to worry about other workloads polluting them.
This means that we can simply write a single pair of values (one at
BeginQuery and one at EndQuery) rather than a series of pairs. This
also means we don't need to worry about the BO getting full. We also
don't need to delay BO allocation and starting snapshot until the first
draw.
The generation split here is a little off: technically, Ironlake can also
support hardware contexts. However, the kernel currently doesn't, and
even if it were to do so someday, we'd need to wait a while before
bumping the kernel requirement to take advantage of it.
v2: Incorporate Paul's feedback.
- Clarify which functions are Gen4/5-only via assertions and comments.
- Change how driver hook initialization happens.
- Update comments.
- Squash a bug fix from a later commit here where it belongs.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Eric Anholt <[email protected]> [v1]
Acked-by: Paul Berry <[email protected]>
|