summaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/vc4
Commit message (Collapse)AuthorAgeFilesLines
* vc4: Coalesce MOVs into VPM with the instructions generating the values.Eric Anholt2014-12-184-15/+143
| | | | | total instructions in shared programs: 41168 -> 40976 (-0.47%) instructions in affected programs: 18156 -> 17964 (-1.06%)
* vc4: Redefine VPM writes as a (destination) QIR register file.Eric Anholt2014-12-173-7/+19
| | | | | This will let me coalesce the VPM writes into the instructions generating the values.
* vc4: Add support for turning constant uniforms into small immediates.Eric Anholt2014-12-1713-46/+283
| | | | | | | | | | | | | | | | | | | | | | Small immediates have the downside of taking over the raddr B field, so you might have less chance to pack instructions together thanks to raddr B conflicts. However, it also reduces some register pressure since it lets you load 2 "uniform" values in one instruction (avoiding a previous load of the constant value to a register), and increases some pairing for the same reason. total uniforms in shared programs: 16231 -> 13374 (-17.60%) uniforms in affected programs: 10280 -> 7423 (-27.79%) total instructions in shared programs: 40795 -> 41168 (0.91%) instructions in affected programs: 25551 -> 25924 (1.46%) In a previous version of this patch I had a reduction in instruction count by forcing the other args alongside a SMALL_IMM to be in the A file or accumulators, but that increases register pressure and had a bug in handling FRAG_Z. In this patch is I just use raddr conflict resolution, which is more expensive. I think I'd rather tweak allocation to have some way to slightly prefer good choices for files in general, rather than risk failing to register allocate by forcing things into register classes.
* vc4: Move follow_movs() to common QIR code.Eric Anholt2014-12-173-11/+12
| | | | I want this from other passes.
* vc4: Fix missing newline for load immediate instruction disasm.Eric Anholt2014-12-171-4/+4
|
* vc4: Add a userspace BO cache.Eric Anholt2014-12-174-4/+175
| | | | | | | | | | Since our kernel BOs require CMA allocation, and the use of them requires new mmaps, it's pretty expensive and we should avoid it if possible. Copying my original design for Intel, make a userspace cache that reuses BOs that haven't been shared to other processes but frees BOs that have sat in the cache for over a second. Improves glxgears framerate on RPi by around 30%.
* vc4: Add dmabuf support.Eric Anholt2014-12-173-24/+73
| | | | | | This gets DRI3 working on modesetting with glamor. It's not enabled under simulation, because it looks like handing our dumb-allocated buffers off to the server doesn't actually work for the server's rendering.
* vc4: Drop a weird argument in the BOs-from-handles API.Eric Anholt2014-12-173-7/+5
|
* vc4: Add support for turning add-based MOVs to muls for pairing.Eric Anholt2014-12-161-2/+49
| | | | | total instructions in shared programs: 43053 -> 40795 (-5.24%) instructions in affected programs: 37996 -> 35738 (-5.94%)
* vc4: Add a helper for changing a field in an instruction.Eric Anholt2014-12-162-11/+12
|
* vc4: Fix the name of qpu_waddr_ignores_ws().Eric Anholt2014-12-161-5/+5
| | | | We're deciding about the WS bit, not PM.
* vc4: Add support for enabling early Z discards.Eric Anholt2014-12-161-0/+18
| | | | This is the same basic logic from the original Broadcom driver.
* gallium: add TGSI_SEMANTIC_VERTEXID_NOBASE and TGSI_SEMANTIC_BASEVERTEXRoland Scheidegger2014-12-161-0/+1
| | | | | | | | | | | | | | | | | | | Plus a new PIPE_CAP_VERTEXID_NOBASE query. The idea is that drivers not supporting vertex ids with base vertex offset applied (so, only support d3d10-style vertex ids) will get such a d3d10-style vertex id instead - with the caveat they'll also need to handle the basevertex system value too (this follows what core mesa already does). Additionally, this is also useful for other state trackers (for instance llvmpipe / draw right now implement the d3d10 behavior on purpose, but with different semantics it can just do both). Doesn't do anything yet. And fix up the docs wrt similar values. v2: incorporate feedback from Brian and others, better names, better docs. Reviewed-by: Brian Paul <[email protected]> Reviewed-by: Jose Fonseca <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* vc4: Add support for 32-bit signed norm/scaled vertex attrs.Eric Anholt2014-12-152-0/+18
| | | | | 32-bit unsigned would require some adjustments to handle values >= 0x80000000.
* vc4: Add support for 16-bit signed/unsigned norm/scaled vertex attrs.Eric Anholt2014-12-156-6/+94
|
* vc4: Rename the 16-bit unpack #define.Eric Anholt2014-12-152-4/+4
| | | | | It's only an f16 conversion if you're doing a float operation, otherwise it's 16 bit signed to 32-bit signed.
* vc4: Add support for 8-bit unnormalized vertex attrs.Eric Anholt2014-12-156-11/+69
|
* vc4: Refactor vertex attribute conversions a bit.Eric Anholt2014-12-151-25/+40
| | | | There was just way too much indentation.
* vc4: Fix use of r3 as a temp in 8-bit unpacking.Eric Anholt2014-12-151-10/+7
| | | | | We're actually allocating out of r3 now, and I missed it because I'd typed this one as qpu_rn(3) instead of qpu_r3().
* vc4: Rename UNPACK_8* to UNPACK_8*_F.Eric Anholt2014-12-155-20/+20
| | | | | There is an equivalent unpack function without conversion to float if you use an integer operation instead.
* vc4: Add support for UMAD.Eric Anholt2014-12-151-0/+9
|
* vc4: 0-initialize the screen again.Eric Anholt2014-12-151-1/+1
| | | | I typoed this when rebasing the memory leak fixes.
* vc4: Fix leaks of the compiled shaders' keys.Eric Anholt2014-12-141-1/+1
|
* vc4: Fix leaks of the CL contents.Eric Anholt2014-12-142-1/+6
|
* vc4: Fix leak of vc4_bos stashed in the context.Eric Anholt2014-12-141-0/+5
|
* vc4: Fix leak of the compiled shader programs in the cache.Eric Anholt2014-12-143-0/+24
|
* vc4: Fix leak of a copy of the scheduled QPU instructions.Eric Anholt2014-12-141-2/+3
| | | | They're copied into a vc4_bo after compiling is done.
* vc4: Switch to using the util/ hash table.Eric Anholt2014-12-142-54/+33
| | | | | No performance difference on a microbenchmark with norast that should hit it enough to have mattered, n=220.
* vc4: Fix leak of simulator memory on screen cleanup.Eric Anholt2014-12-142-3/+6
|
* vc4: Fix a leak of the simulator's exec BO's actual vc4_bo.Eric Anholt2014-12-141-0/+1
|
* util/hash_table: Rework the API to know about hashingJason Ekstrand2014-12-141-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the hash_table API required the user to do all of the hashing of keys as it passed them in. Since the hashing function is intrinsically tied to the comparison function, it makes sense for the hash table to know about it. Also, it makes for a somewhat clumsy API as the user is constantly calling hashing functions many of which have long names. This is especially bad when the standard call looks something like _mesa_hash_table_insert(ht, _mesa_pointer_hash(key), key, data); In the above case, there is no reason why the hash table shouldn't do the hashing for you. We leave the option for you to do your own hashing if it's more efficient, but it's no longer needed. Also, if you do do your own hashing, the hash table will assert that your hash matches what it expects out of the hashing function. This should make it harder to mess up your hashing. v2: change to call the old entrypoint "pre_hashed" rather than "with_hash", like cworth's equivalent change upstream (change by anholt, acked-in-general by Jason). Signed-off-by: Jason Ekstrand <[email protected]> Signed-off-by: Eric Anholt <[email protected]> Reviewed-by: Eric Anholt <[email protected]>
* vc4: Fix referencing of sync objects.Eric Anholt2014-12-121-0/+1
| | | | | While the pipe_reference_* helpers set the pointer, a bare pipe_reference doesn't. Fixes 5 ARB_sync tests.
* vc4: Consider FS backface color loads as color inputs as well.Eric Anholt2014-12-111-1/+4
| | | | | This fixes flatshading of backface color in 4 of the piglit interpolation tests.
* vc4: Drop redundant index size setting.Eric Anholt2014-12-111-1/+0
| | | | This is already done at set_index_buffer() time.
* vc4: Don't throw out the index offset in the shadow index buffer path.Eric Anholt2014-12-111-2/+1
| | | | | When we upload shadow indices at draw time, we need the source offset. Fixes the piglit draw-elements test.
* vc4: Fix triangle-guardband-viewport piglit test.Eric Anholt2014-12-111-5/+14
| | | | The original Broadcom driver also did this with the viewport.
* vc4: Fix a memory leak in setting up QPU instructions for scheduling.Eric Anholt2014-12-111-1/+2
|
* vc4: Do QPU scheduling across uniform loads.Eric Anholt2014-12-091-28/+60
| | | | | | | | This means another pass of reordering the uniform data store, but it lets us pair up a lot more instructions. total instructions in shared programs: 44639 -> 43176 (-3.28%) instructions in affected programs: 36938 -> 35475 (-3.96%)
* vc4: Populate the delay field better, and schedule high delay first.Eric Anholt2014-12-091-1/+49
| | | | | | | This is a standard scheduling heuristic, and clearly helps. total instructions in shared programs: 46418 -> 44467 (-4.20%) instructions in affected programs: 42531 -> 40580 (-4.59%)
* vc4: Skip raddr dependencies for 32-bit immediate loads.Eric Anholt2014-12-091-2/+5
| | | | These don't have raddr fields.
* vc4: Mark VPM read setup as impacting VPM reads, not writes.Eric Anholt2014-12-091-1/+7
| | | | | Fixes assertion failures if we adjust scheduling priorities to emphasize VPM reads more.
* vc4: Refuse to merge instructions involving 32-bit immediate loads.Eric Anholt2014-12-091-0/+5
| | | | | An immediate load overwrites the mul and add operations, so you can't merge with them.
* vc4: Reserve rb31 instead of r3 for raddr conflict spills.Eric Anholt2014-12-092-11/+45
| | | | | | | | | | This increases the cost of a raddr b conflict spill (save r3 to rb31, move src1 to r3, move rb31 back to r3 when done, instead of just move src1 to r3), but on average thanks to instruction pairing it's more worthwhile to have another accumulator. total instructions in shared programs: 46428 -> 46171 (-0.55%) instructions in affected programs: 38030 -> 37773 (-0.68%)
* vc4: Prioritize allocating accumulators to short-lived values.Eric Anholt2014-12-091-14/+59
| | | | | | | | | | | | | | | | | | The register allocator walks from the end of the nodes array looking for trivially-allocatable things to put on the stack, meaning (assuming everything is trivially colorable and gets put on the stack in a single pass) the low node numbers get allocated first. The things allocated first happen to get the lower-numbered registers, which is to say the fast accumulators that can be paired more easily. When we previously made the nodes match the temporary register numbers, we'd end up putting the shader inputs (VS or FS) in the accumulators, which are often long-lived values. By prioritizing the shortest-lived values for allocation, we can get a lot more instructions that involve accumulators, and thus fewer conflicts for raddr and WS. total instructions in shared programs: 52870 -> 46428 (-12.18%) instructions in affected programs: 52260 -> 45818 (-12.33%)
* vc4: Interleave register allocation from regfile A and B.Eric Anholt2014-12-081-39/+38
| | | | | | | | | | | | | The register allocator prefers low-index registers from vc4_regs[] in the configuration we're using, which is good because it means we prioritize allocating the accumulators (which are faster). On the other hand, it was causing raddr conflicts because everything beyond r0-r2 ended up in regfile A until you got massive register pressure. By interleaving, we end up getting more instruction pairing from getting non-conflicting raddrs and QPU_WSes. total instructions in shared programs: 55957 -> 52719 (-5.79%) instructions in affected programs: 46855 -> 43617 (-6.91%)
* vc4: Fix decision for whether the MIN operation writes to the B regfile.Eric Anholt2014-12-081-3/+3
|
* vc4: Drop dependency on r3 for color packing.Eric Anholt2014-12-081-4/+27
| | | | | | | | We can avoid it by carefully ordering the packing. This is important as a step in giving r3 to the register allocator. total instructions in shared programs: 56087 -> 55957 (-0.23%) instructions in affected programs: 18368 -> 18238 (-0.71%)
* vc4: Add support for GL 1.0 logic ops.Eric Anholt2014-12-081-2/+60
|
* vc4: Add support for TGSI_OPCODE_UCMP.Eric Anholt2014-12-081-0/+12
| | | | This is being emitted now from st_glsl_to_tgsi.cpp.
* vc4: Try swapping the regfile A to B to pair instructions.Eric Anholt2014-12-051-2/+62
| | | | | total instructions in shared programs: 56995 -> 56087 (-1.59%) instructions in affected programs: 40503 -> 39595 (-2.24%)