| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
total instructions in shared programs: 41168 -> 40976 (-0.47%)
instructions in affected programs: 18156 -> 17964 (-1.06%)
|
|
|
|
|
| |
This will let me coalesce the VPM writes into the instructions generating
the values.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Small immediates have the downside of taking over the raddr B field, so
you might have less chance to pack instructions together thanks to raddr B
conflicts. However, it also reduces some register pressure since it lets
you load 2 "uniform" values in one instruction (avoiding a previous load
of the constant value to a register), and increases some pairing for the
same reason.
total uniforms in shared programs: 16231 -> 13374 (-17.60%)
uniforms in affected programs: 10280 -> 7423 (-27.79%)
total instructions in shared programs: 40795 -> 41168 (0.91%)
instructions in affected programs: 25551 -> 25924 (1.46%)
In a previous version of this patch I had a reduction in instruction count
by forcing the other args alongside a SMALL_IMM to be in the A file or
accumulators, but that increases register pressure and had a bug in
handling FRAG_Z. In this patch is I just use raddr conflict resolution,
which is more expensive. I think I'd rather tweak allocation to have some
way to slightly prefer good choices for files in general, rather than risk
failing to register allocate by forcing things into register classes.
|
|
|
|
| |
I want this from other passes.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Since our kernel BOs require CMA allocation, and the use of them requires
new mmaps, it's pretty expensive and we should avoid it if possible.
Copying my original design for Intel, make a userspace cache that reuses
BOs that haven't been shared to other processes but frees BOs that have
sat in the cache for over a second.
Improves glxgears framerate on RPi by around 30%.
|
|
|
|
|
|
| |
This gets DRI3 working on modesetting with glamor. It's not enabled under
simulation, because it looks like handing our dumb-allocated buffers off
to the server doesn't actually work for the server's rendering.
|
| |
|
|
|
|
|
| |
total instructions in shared programs: 43053 -> 40795 (-5.24%)
instructions in affected programs: 37996 -> 35738 (-5.94%)
|
| |
|
|
|
|
| |
We're deciding about the WS bit, not PM.
|
|
|
|
| |
This is the same basic logic from the original Broadcom driver.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Plus a new PIPE_CAP_VERTEXID_NOBASE query. The idea is that drivers not
supporting vertex ids with base vertex offset applied (so, only support
d3d10-style vertex ids) will get such a d3d10-style vertex id instead -
with the caveat they'll also need to handle the basevertex system value
too (this follows what core mesa already does).
Additionally, this is also useful for other state trackers (for instance
llvmpipe / draw right now implement the d3d10 behavior on purpose, but
with different semantics it can just do both).
Doesn't do anything yet.
And fix up the docs wrt similar values.
v2: incorporate feedback from Brian and others, better names, better docs.
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Jose Fonseca <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
| |
32-bit unsigned would require some adjustments to handle values >=
0x80000000.
|
| |
|
|
|
|
|
| |
It's only an f16 conversion if you're doing a float operation, otherwise
it's 16 bit signed to 32-bit signed.
|
| |
|
|
|
|
| |
There was just way too much indentation.
|
|
|
|
|
| |
We're actually allocating out of r3 now, and I missed it because I'd typed
this one as qpu_rn(3) instead of qpu_r3().
|
|
|
|
|
| |
There is an equivalent unpack function without conversion to float if you
use an integer operation instead.
|
| |
|
|
|
|
| |
I typoed this when rebasing the memory leak fixes.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
They're copied into a vc4_bo after compiling is done.
|
|
|
|
|
| |
No performance difference on a microbenchmark with norast that should hit it
enough to have mattered, n=220.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, the hash_table API required the user to do all of the hashing
of keys as it passed them in. Since the hashing function is intrinsically
tied to the comparison function, it makes sense for the hash table to know
about it. Also, it makes for a somewhat clumsy API as the user is
constantly calling hashing functions many of which have long names. This
is especially bad when the standard call looks something like
_mesa_hash_table_insert(ht, _mesa_pointer_hash(key), key, data);
In the above case, there is no reason why the hash table shouldn't do the
hashing for you. We leave the option for you to do your own hashing if
it's more efficient, but it's no longer needed. Also, if you do do your
own hashing, the hash table will assert that your hash matches what it
expects out of the hashing function. This should make it harder to mess up
your hashing.
v2: change to call the old entrypoint "pre_hashed" rather than
"with_hash", like cworth's equivalent change upstream (change by
anholt, acked-in-general by Jason).
Signed-off-by: Jason Ekstrand <[email protected]>
Signed-off-by: Eric Anholt <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
| |
While the pipe_reference_* helpers set the pointer, a bare pipe_reference
doesn't. Fixes 5 ARB_sync tests.
|
|
|
|
|
| |
This fixes flatshading of backface color in 4 of the piglit interpolation
tests.
|
|
|
|
| |
This is already done at set_index_buffer() time.
|
|
|
|
|
| |
When we upload shadow indices at draw time, we need the source offset.
Fixes the piglit draw-elements test.
|
|
|
|
| |
The original Broadcom driver also did this with the viewport.
|
| |
|
|
|
|
|
|
|
|
| |
This means another pass of reordering the uniform data store, but it lets
us pair up a lot more instructions.
total instructions in shared programs: 44639 -> 43176 (-3.28%)
instructions in affected programs: 36938 -> 35475 (-3.96%)
|
|
|
|
|
|
|
| |
This is a standard scheduling heuristic, and clearly helps.
total instructions in shared programs: 46418 -> 44467 (-4.20%)
instructions in affected programs: 42531 -> 40580 (-4.59%)
|
|
|
|
| |
These don't have raddr fields.
|
|
|
|
|
| |
Fixes assertion failures if we adjust scheduling priorities to emphasize
VPM reads more.
|
|
|
|
|
| |
An immediate load overwrites the mul and add operations, so you can't
merge with them.
|
|
|
|
|
|
|
|
|
|
| |
This increases the cost of a raddr b conflict spill (save r3 to rb31, move
src1 to r3, move rb31 back to r3 when done, instead of just move src1 to
r3), but on average thanks to instruction pairing it's more worthwhile to
have another accumulator.
total instructions in shared programs: 46428 -> 46171 (-0.55%)
instructions in affected programs: 38030 -> 37773 (-0.68%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The register allocator walks from the end of the nodes array looking for
trivially-allocatable things to put on the stack, meaning (assuming
everything is trivially colorable and gets put on the stack in a single
pass) the low node numbers get allocated first. The things allocated
first happen to get the lower-numbered registers, which is to say the fast
accumulators that can be paired more easily.
When we previously made the nodes match the temporary register numbers,
we'd end up putting the shader inputs (VS or FS) in the accumulators,
which are often long-lived values. By prioritizing the shortest-lived
values for allocation, we can get a lot more instructions that involve
accumulators, and thus fewer conflicts for raddr and WS.
total instructions in shared programs: 52870 -> 46428 (-12.18%)
instructions in affected programs: 52260 -> 45818 (-12.33%)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The register allocator prefers low-index registers from vc4_regs[] in the
configuration we're using, which is good because it means we prioritize
allocating the accumulators (which are faster). On the other hand, it was
causing raddr conflicts because everything beyond r0-r2 ended up in
regfile A until you got massive register pressure. By interleaving, we
end up getting more instruction pairing from getting non-conflicting
raddrs and QPU_WSes.
total instructions in shared programs: 55957 -> 52719 (-5.79%)
instructions in affected programs: 46855 -> 43617 (-6.91%)
|
| |
|
|
|
|
|
|
|
|
| |
We can avoid it by carefully ordering the packing. This is important as a
step in giving r3 to the register allocator.
total instructions in shared programs: 56087 -> 55957 (-0.23%)
instructions in affected programs: 18368 -> 18238 (-0.71%)
|
| |
|
|
|
|
| |
This is being emitted now from st_glsl_to_tgsi.cpp.
|
|
|
|
|
| |
total instructions in shared programs: 56995 -> 56087 (-1.59%)
instructions in affected programs: 40503 -> 39595 (-2.24%)
|