summaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/vc4
Commit message (Collapse)AuthorAgeFilesLines
* vc4: Fix math with a condition flag set.Eric Anholt2017-03-082-3/+18
| | | | | | | | | | | Math results land in r4, regardless of the condition. To implement them, we just need to ensure that the results are moved out of r4 (as often happens anyway, the values is live across another math instruction), so that we can attach the condition to the MOV. Fixes dEQP-GLES2.functional.shaders.random.all_features.fragment.93 and a couple others, that were assertion failing that their conditions hadn't been handled during the QIR->QPU stage.
* vc4: Fix register pressure cost estimates when a src appears twice.Eric Anholt2017-03-081-3/+13
| | | | | | | | | | This ended up confusing the scheduler for things like fabs (implemented as fmaxabs x, x) or squaring a number, and it would try to avoid scheduling them because it appeared more expensive than other instructions. Fixes failure to register allocate in dEQP-GLES2.functional.uniform_api.random.3 with almost no shader-db effects (+.35% max temps)
* vc4: Report to shader-db how many threads a fragment shader has.Eric Anholt2017-03-081-0/+7
| | | | | Doing instruction count analysis when we emit the thread switches that will save us from tons of stalls is kind of missing the point.
* Revert "vc4: Lazily emit our FS/VS input loads."Eric Anholt2017-03-084-93/+75
| | | | | | This reverts commit 292c24ddac5acc35676424f05291c101fcd47b3e. It broke a lot of GLES2 deqp, and I see at least one problem that will require some serious rework to fix.
* gallium: s/uint/enum pipe_shader_type/ for set_constant_buffer()Brian Paul2017-03-081-1/+2
| | | | Reviewed-by: Edward O'Callaghan <[email protected]>
* gallium: s/unsigned/enum pipe_shader_type/ for get_compiler_options()Brian Paul2017-03-082-2/+4
| | | | Reviewed-by: Edward O'Callaghan <[email protected]>
* gallium: s/unsigned/enum pipe_shader_type/ for pipe_screen::get_shader_param()Brian Paul2017-03-081-2/+3
| | | | Reviewed-by: Edward O'Callaghan <[email protected]>
* gallium/util: replace pipe_mutex_unlock() with mtx_unlock()Timothy Arceri2017-03-072-7/+7
| | | | | | | | | | pipe_mutex_unlock() was made unnecessary with fd33a6bcd7f12. Replaced using: find ./src -type f -exec sed -i -- \ 's:pipe_mutex_unlock(\([^)]*\)):mtx_unlock(\&\1):g' {} \; Reviewed-by: Marek Olšák <[email protected]>
* gallium/util: replace pipe_mutex_lock() with mtx_lock()Timothy Arceri2017-03-072-6/+6
| | | | | | | | | | replace pipe_mutex_lock() was made unnecessary with fd33a6bcd7f12. Replaced using: find ./src -type f -exec sed -i -- \ 's:pipe_mutex_lock(\([^)]*\)):mtx_lock(\&\1):g' {} \; Reviewed-by: Marek Olšák <[email protected]>
* gallium/util: replace pipe_mutex_init() with mtx_init()Timothy Arceri2017-03-071-1/+1
| | | | | | | | | | pipe_mutex_init() was made unnecessary with fd33a6bcd7f12. Replace was done using: find ./src -type f -exec sed -i -- \ 's:pipe_mutex_init(\([^)]*\)):(void) mtx_init(\&\1, mtx_plain):g' {} \; Reviewed-by: Marek Olšák <[email protected]>
* gallium/util: replace pipe_mutex with mtx_tTimothy Arceri2017-03-071-2/+2
| | | | | | pipe_mutex was made unnecessary with fd33a6bcd7f12. Reviewed-by: Marek Olšák <[email protected]>
* vc4: Lazily emit our FS/VS input loads.Eric Anholt2017-02-244-75/+93
| | | | | | | | | | | | | | | | | | | This reduces register pressure in both types of shaders, by reordering the input loads from the var->data.driver_location order to whatever order they appear first in the NIR shader. These instructions aren't reorderable at our QIR scheduling level because the FS takes two in lockstep to do an interpolation, and the VS takes multiple read instructions in a row to get a whole vec4-level attribute read. shader-db impact: total instructions in shared programs: 76666 -> 76590 (-0.10%) instructions in affected programs: 42945 -> 42869 (-0.18%) total max temps in shared programs: 9395 -> 9208 (-1.99%) max temps in affected programs: 2951 -> 2764 (-6.34%) Some programs get their max temps hurt, depending on the order that the load_input intrinsics appear, because we end up being unable to copy propagate an older VPM read into its only use.
* vc4: Refactor the load_input code out of the intrinsic code.Eric Anholt2017-02-241-25/+42
| | | | It's going gain most of ntq_setup_inputs(), so simplify it first.
* vc4: Track the last block we emitted at the top level.Eric Anholt2017-02-243-5/+10
| | | | | This will be used for delaying our VPM reads (which must be unconditional) until just before they're used.
* vc4: Emit max number of temps in the shader-db output.Eric Anholt2017-02-241-0/+23
| | | | | | | We need to be paying attention to optimization's impact on this -- even if we reduce instruction count, increasing max temps in general is likely to cause us to fail to register allocate on some shaders, which means that those won't run at all.
* gallium: remove PIPE_CAP_USER_INDEX_BUFFERSMarek Olšák2017-02-251-1/+0
| | | | | | | | all drivers support it Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Brian Paul <[email protected]> Tested-by: Brian Paul <[email protected]> (VMware driver only)
* vc4: automake: add the kernel/README to the tarballEmil Velikov2017-02-241-0/+2
| | | | | | Signed-off-by: Emil Velikov <[email protected]> Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Andreas Boll <[email protected]>
* gallium: set pipe_context uploaders in drivers (v3)Marek Olšák2017-02-141-5/+6
| | | | | | | | | | | | | | | Notes: - make sure the default size is large enough to handle all state trackers - pipe wrappers don't receive transfer calls from stream_uploader, because pipe_context::stream_uploader points directly to the underlying driver's stream_uploader (to keep it simple for now) v2: add error handling to nv50, nvc0, noop v3: set const_uploader Reviewed-by: Nicolai Hähnle <[email protected]> Tested-by: Edmondo Tommasina <[email protected]> (v1) Tested-by: Charmaine Lee <[email protected]>
* vc4: Enable glSampleMask() even when !rasterizer->multisample.Eric Anholt2017-02-101-2/+1
| | | | | | | | gallium's blitter expects that it can set the sample mask even when the rasterizer doesn't have the flag on. Between this and the previous test, 10 new ext_framebuffer_multisample tests start passing.
* vc4: Respect glSampleMask() even when we're not writing color.Eric Anholt2017-02-101-3/+13
| | | | | | gallium's quad-based blitter for copying MSAA depth textures expects to be able to do 4 passes updating a sample at a time using glSampleMask, and there's no color buffer bound when it's doing that.
* vc4: Use the nir_builder helper for loading sample mask.Eric Anholt2017-02-101-10/+1
|
* vc4: Use accurate 1/w in coordinate shader as well as vert shader.Eric Anholt2017-02-101-1/+1
| | | | | We probably shouldn't be emitting different scaled viewport coordinates between vertex and coord.
* vc4: Drop VS inputs to 8.Eric Anholt2017-02-101-4/+1
| | | | | | | In the hardware we only get to declare 8 vertex elements (GLES2's minimum), so we should be exposing that number here. Fixes an assertion failure in piglit texrect-many, at the expense of various GL 2.0-ish minmax tests now complaining that our count is too low.
* vc4: Avoid emitting small immediates for UBO indirect load address guards.Eric Anholt2017-02-105-4/+20
| | | | | | | | | | | | The kernel will reject our shader if we emit one here, and having 4, 8, or 12 as the top end of our UBO clamp rare is enough that it's not worth making the kernel let us. Fixes piglit fs-const-array-of-struct and fs-const-array-of-struct-of-array since recent GLSL linking changes made us get this as an indirect load of a uniform, instead of a tempoary. Cc: "13.0 17.0" <[email protected]>
* gallium: add separate PIPE_CAP_INT64_DIVMODIlia Mirkin2017-02-091-0/+1
| | | | | | | | | | | Nouveau does not currently have logic to implement this as a library function. Even though such a library could be written, there's no big advantage to do it that way for now given that int64 is a very uncommon use-case. Allow a driver to expose INT64 without supporting division and modulo operations. Signed-off-by: Ilia Mirkin <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* gallium: turn PIPE_SHADER_CAP_DOUBLES into a screen capabilityNicolai Hähnle2017-02-021-1/+1
| | | | | | | | | | | | | | | | | | | Make the cap consistent with PIPE_CAP_INT64. Aside from the hypothetical case of using draw for vertex shaders (and actually caring about doubles...), every implementation supports doubles either nowhere or everywhere. Also, st/mesa didn't even check the cap correctly in all supported shader stages. While at it, add a missing LLVM version check for 64-bit integers in radeonsi. This is conservative: judging by the log, LLVM 3.8 might be sufficient, but there are probably bugs that have been fixed since then. v2: fix clover (Marek) Reviewed-by: Marek Olšák <[email protected]>
* vc4: Enable Neon on arm android buildsRob Herring2017-01-311-0/+2
| | | | | Signed-off-by: Rob Herring <[email protected]> Reviewed-by: Eric Anholt <[email protected]>
* vc4: fix arm64 build with NeonRob Herring2017-01-311-1/+1
| | | | | | | | The addition of Neon assembly breaks on arm64 builds because the assembly syntax is different. For now, restrict Neon to ARMv7 builds. Signed-off-by: Rob Herring <[email protected]> Reviewed-by: Eric Anholt <[email protected]>
* vc4: Make Neon inline assembly clang compatibleRob Herring2017-01-311-35/+35
| | | | | | | | | | | | clang throws an error on "%r2" and similar. I couldn't find any documentation on what "%r?" is supposed to mean and I've never seen any use like that as far as I remember. The parameter is supposed to be cpu_stride and just %2/%3 should be sufficient. There's no need for trailing ";" either, so remove those, too. Signed-off-by: Rob Herring <[email protected]> Reviewed-by: Eric Anholt <[email protected]>
* vc4: Coalesce into TLB writes as well as VPM/tex.Eric Anholt2017-01-281-1/+5
| | | | | | | | This generally cuts an instruction when blending is enabled and we thus have a single instruction generating the color value. total instructions in shared programs: 91759 -> 91634 (-0.14%) instructions in affected programs: 5338 -> 5213 (-2.34%)
* vc4: Avoid an extra temporary and mov in ffloor/ffract/fceil.Eric Anholt2017-01-281-13/+18
| | | | | | | | | | shader-db results: total instructions in shared programs: 92611 -> 91764 (-0.91%) instructions in affected programs: 27417 -> 26570 (-3.09%) The star is one shader in glmark2's terrain (drops 16% of its instructions), but there are also wins in mupen64plus and glb2.7.
* vc4: Flip the switch to run the GLSL compiler optimization loop once.Eric Anholt2017-01-281-1/+1
| | | | | | | | | | | | | This has almost no effect on shader-db: total instructions in shared programs: 92572 -> 92611 (0.04%) instructions in affected programs: 4486 -> 4525 (0.87%) Looking at 2 of the 7 different shaders that were hurt (all of which were in mupen64), they all appear to be just differences in order of instructions at the NIR level. The advantage is that this should significantly reduce time in the compiler.
* gallium: Add integer 64 capabilityDave Airlie2017-01-271-0/+1
| | | | | | | | | v1.1: move to using a normal CAP. (Marek) v2: fill in the cap everywhere Signed-off-by: Dave Airlie <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* vc4: Use NEON to speed up utile stores on Pi2+.cros-mesa-17.1.0-r2-vanillacros-mesa-17.1.0-r1-vanillachadv/cros-mesa-17.1.0-r2-vanillachadv/cros-mesa-17.1.0-r1-vanillaEric Anholt2017-01-261-5/+50
| | | | Improves 1024x1024 TexSubImage2D by 41.2371% +/- 3.52799% (n=10).
* vc4: Use NEON to speed up utile loads on Pi2.Eric Anholt2017-01-263-18/+115
| | | | | | | | | | | | | | | | | | | We had a lot of memcpy call overhead because gpu_stride wasn't being inlined. But if you split out the stride==8 and stride==16 cases like this code does while still using memcpy, you'd no longer have glibc's NEON memcpy applied at which point we'd be doing 16 uncached reads instead of 64/(NEON memcpy granularity), for about a 30% performance hit. By hand writing the assembly, we can get a whole cacheline loaded at a time. Unfortunately, NEON intrinsics turned out to be unusable -- they didn't have the vldm instruction available. Note that, for now, the NEON code is only enabled when building for ARMv7 (Pi 2+). We may want to do runtime detection for the Raspbian case, in the future. Improves 1024x1024 GetTexImage by 208.256% +/- 7.07029% (n=10).
* vc4: Move LT tiling code to a separate file.Eric Anholt2017-01-264-80/+122
| | | | This paves the way for building it twice, with NEON assembly or not.
* vc4: Use unreachable() in an unreachable codepath for tiling.Eric Anholt2017-01-261-4/+2
|
* gallium: add PIPE_CAP_TGSI_MUL_ZERO_WINSIlia Mirkin2017-01-231-0/+1
| | | | | | Signed-off-by: Ilia Mirkin <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Axel Davy <[email protected]>
* gallium: add PIPE_CAP_TGSI_FS_FBFETCHIlia Mirkin2017-01-161-0/+1
| | | | | Signed-off-by: Ilia Mirkin <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* vc4: Rewrite T image handling based on calling the LT handler.Eric Anholt2017-01-051-34/+75
| | | | | | | | | | | | | The T images are composed of effectively swizzled-around blocks of LT (4x4 utile) images, so we can reduce the t_utile_address() calls by 16x by calling into the simpler LT loop. This also adds support for calling down with non-utile-aligned coordinates, which will be part of lifting the utile alignment requirement on our callers and avoiding the RMW on non-utile-aligned stores. Improves 1024x1024 TexSubImage by 2.55014% +/- 1.18584% (n=46) Improves 1024x1024 GetTexImage by 2.242% +/- 0.880954% (n=32)
* vc4: Move the utile_width/height functions to header inlines.Eric Anholt2017-01-052-37/+36
| | | | | | I want these inlined in the callers, particularly with the tiling changes coming up, but we're not building with lto so some caller would suffer.
* vc4: Make the load/store utile functions static.Eric Anholt2017-01-052-4/+2
| | | | | They don't have any other callers outside of this file, and I'm hoping they get inlined soon.
* vc4: Simplify the load/store utile functions.Eric Anholt2017-01-051-10/+22
| | | | | | | | | They now have less of a dependency on the cpp, and don't have to do a divide. Hacking up mesa-demos teximage to do only one subtest and not draw points, I saw 1024x1024 glTexSubImage2D() improve by 4.86939% +/- 1.40408% (n=30) and glGetTexImage() by 2.18978% +/- 0.140268% (n=5).
* vc4: Reuse a list function to simplify bufmgr code.Eric Anholt2017-01-051-11/+2
|
* vc4: Flush the job early if we're referencing too many BOs.Eric Anholt2017-01-053-0/+16
| | | | | | | | | | | If we get up toward 256MB (or whatever the CMA area size is), VC4_GEM_CREATE will start throwing errors. Even if we don't trigger that, when we flush the kernel's BO allocation for the CLs or bin memory may end up throwing an error, at which point our job won't get rendered at all. Just flush early (half of maximum CMA size) so that hopefully we never get to that point.
* gallium: add PIPE_CAP_GLSL_OPTIMIZE_CONSERVATIVELYMarek Olšák2017-01-051-0/+1
| | | | | | Drivers with good compilers don't need aggressive optimizations before TGSI. Reviewed-by: Eric Anholt <[email protected]>
* nir: Rename convert_to_ssa lower_regs_to_ssaJason Ekstrand2016-12-291-1/+1
| | | | This matches the naming of nir_lower_vars_to_ssa, the other to-SSA pass.
* vc4: Rework scheduling of thread switch to cut one more NOP.Eric Anholt2016-12-291-46/+75
| | | | | | | | | | | | | | Jonas's patch got us most of the benefit of scheduling instructions into the delay slots of thread switch, but if there had been nothing to pair the thrsw with, it would move the thrsw up and leave a NOP where the thrsw was. Instead, don't pair anything with thrsw through the normal scheduling path, and have a separate helper function that inserts the thrsw earlier if possible and inserts any necessary NOPs. total instructions in shared programs: 93027 -> 92643 (-0.41%) instructions in affected programs: 14952 -> 14568 (-2.57%)
* vc4: Fill thread switching delay slotsJonas Pfeil2016-12-291-7/+38
| | | | | | | | | | | | | | | Scan for instructions without a signal set in front of the switching instruction and move the signal up there. shader-db results: total instructions in shared programs: 94494 -> 93027 (-1.55%) instructions in affected programs: 23545 -> 22078 (-6.23%) v2: Fix re-emitting of the instruction in the loop trying to emit NOPs, drop a scheduling change from branch delay slots. (by anholt) Signed-off-by: Jonas Pfeil <[email protected]>
* vc4: Enable NIR-based loop unrolling.Eric Anholt2016-12-291-0/+5
| | | | | This successfully unrolls a new shader in GLB2.7, which also gets that shader to successfully compile in multithreaded mode.