aboutsummaryrefslogtreecommitdiffstats
path: root/src/intel/compiler/brw_fs.h
Commit message (Collapse)AuthorAgeFilesLines
* intel/fs: Extend thread payload layout to SIMD32Francisco Jerez2018-06-281-11/+31
| | | | | | | | | | And handle 32-wide payload register reads in fetch_payload_reg(). v2 (Jason Ekstrand); - Fix some whitespace and brace placement Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Wrap FS payload register look-up in a helper function.Francisco Jerez2018-06-281-0/+13
| | | | | Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Remove program key argument from generator.Francisco Jerez2018-06-281-2/+0
| | | | | Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Use image_deref intrinsics instead of image_varJason Ekstrand2018-06-221-1/+1
| | | | | | | | | | | | Since we had to rewrite the deref walking loop anyway, I took the opportunity to make it a bit clearer and more efficient. In particular, in the AoA case, we will now emit one minmax instead of one per array level. Acked-by: Rob Clark <[email protected]> Acked-by: Bas Nieuwenhuizen <[email protected]> Acked-by: Dave Airlie <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: shuffle_64bit_data_for_32bit_write is not used anymoreJose Maria Casanova Crespo2018-06-161-4/+0
| | | | Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: shuffle_32bit_load_result_to_64bit_data is not used anymoreJose Maria Casanova Crespo2018-06-161-5/+0
| | | | Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: Remove old 16-bit shuffle/unshuffle functionsJose Maria Casanova Crespo2018-06-161-11/+0
| | | | Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: New shuffle_for_32bit_write and shuffle_from_32bit_readJose Maria Casanova Crespo2018-06-161-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | These new shuffle functions deal with the shuffle/unshuffle operations needed for read/write operations using 32-bit components when the read/written components have a different bit-size (8, 16, 64-bits). Shuffle from 32-bit to 32-bit becomes a simple MOV. shuffle_src_to_dst takes care of doing a shuffle when source type is smaller than destination type and an unshuffle when source type is bigger than destination. So this new read/write functions just need to call shuffle_src_to_dst assuming that writes use a 32-bit destination and reads use a 32-bit source. As shuffle_for_32bit_write/from_32bit_read components take components in unit of source/destination types and shuffle_src_to_dst takes units of the smallest type component, we adjust components and first_component parameters. To enable this new functions it is needed than there is no source/destination overlap in the case of shuffle_from_32bit_read. That never happens on shuffle_for_32bit_write as it allocates a new destination register as it was at shuffle_64bit_data_for_32bit_write. v2: Reword commit log and add comments to explain why first_component and components parameters are adjusted. (Jason Ekstrand) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: Use the ATTR file for FS inputsFrancisco Jerez2018-05-291-1/+1
| | | | | | | | | | | | This replaces the special magic opcodes which implicitly read inputs with explicit use of the ATTR file. v2 (Jason Ekstrand): - Break into multiple patches - Change the units of the FS ATTR to be in logical scalars Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* i965/fs: Merge CMP and SEL into CSEL on Gen8+Ian Romanick2018-03-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | v2: Fix several problems handling inverted predicates. Add a much bigger comment around the BRW_CONDITIONAL_NZ case. v3: Allow uniforms and shader inputs as sources for the original SEL and CMP instructions. This enables a LOT more shaders to receive CSEL merging (5816 vs 8564 on SKL). v4: Report progress. Broadwell and Skylake had similar results. (Broadwell shown) helped: 8527 HURT: 0 helped stats (abs) min: 1 max: 27 x̄: 2.44 x̃: 1 helped stats (rel) min: 0.03% max: 17.80% x̄: 1.12% x̃: 0.70% 95% mean confidence interval for instructions value: -2.51 -2.36 95% mean confidence interval for instructions %-change: -1.15% -1.10% Instructions are helped. total cycles in shared programs: 559442317 -> 558288357 (-0.21%) cycles in affected programs: 372699860 -> 371545900 (-0.31%) helped: 6748 HURT: 1450 helped stats (abs) min: 1 max: 32000 x̄: 182.41 x̃: 12 helped stats (rel) min: <.01% max: 66.08% x̄: 3.42% x̃: 0.70% HURT stats (abs) min: 1 max: 2538 x̄: 53.08 x̃: 14 HURT stats (rel) min: <.01% max: 96.72% x̄: 3.32% x̃: 0.90% 95% mean confidence interval for cycles value: -179.01 -102.51 95% mean confidence interval for cycles %-change: -2.37% -2.08% Cycles are helped. LOST: 0 GAINED: 6 No changes on earlier platforms. Signed-off-by: Ian Romanick <[email protected]> Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]> [v1] Reviewed-by: Kenneth Graunke <[email protected]> [v3] Reviewed-by: Matt Turner <[email protected]>
* i965/fs: Add support for nir_intrinsic_shuffleJason Ekstrand2018-03-071-0/+5
| | | | Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel: Drop program size pointer from vec4/fs assembly getters.Kenneth Graunke2018-03-021-1/+1
| | | | | | | | | These days, we're just passing a pointer to a prog_data field, which we already have access to. We can just use it directly. (In the past, it was a pointer to a separate value.) Reviewed-by: Iago Toral Quiroga <[email protected]>
* i965/fs: shuffle_32bit_load_result_to_16bit_data now skips componentsJose Maria Casanova Crespo2018-02-281-0/+1
| | | | | | | | | | | | | | | | This helper used to load 16bit components from 32-bits read now allows skipping components with the new parameter first_component. The semantics now skip components until we reach the first_component, and then reads the number of components passed to the function. All previous uses of the helper are updated to use 0 as first_component. This will allow read 16-bit components when the first one is not aligned 32-bit. Enabling more usages of untyped_reads with 16-bit types. v2: (Jason Ektrand) Change parameters order to first_component, num_components Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler/fs: Pass fs_inst to generate_ddx/ddy instead of opcodeMatt Turner2018-02-281-2/+4
| | | | | | | In a future patch, generate_ddy will want to inspect inst->exec_size. Change generate_ddx as well for consistency. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/compiler/fs: Return multiple_instructions_emitted from generate_linterpMatt Turner2018-02-281-1/+1
| | | | | | | If multiple instructions are emitted, special handling of things like conditional mod and NoDDClr/NoDDChk need to be performed. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/bank_conflicts: Don't touch Gen7 MRF hack registers.Francisco Jerez2017-12-121-1/+1
| | | | | | | Fixes: af2c320190f3c731 "intel/fs: Implement GRF bank conflict mitigation pass." Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104199 Reported-by: Darius Spitznagel <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Teach instruction scheduler about GRF bank conflict cycles.Francisco Jerez2017-12-071-0/+1
| | | | | | | | | | | This should allow the post-RA scheduler to do a slightly better job at hiding latency in presence of instructions incurring bank conflicts. The main purpuse of this patch is not to improve performance though, but to get conflict cycles to show up in shader-db statistics in order to make sure that regressions in the bank conflict mitigation pass don't go unnoticed. Acked-by: Matt Turner <[email protected]>
* intel/fs: Implement GRF bank conflict mitigation pass.Francisco Jerez2017-12-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unnecessary GRF bank conflicts increase the issue time of ternary instructions (the overwhelmingly most common of which is MAD) by roughly 50%, leading to reduced ALU throughput. This pass attempts to minimize the number of bank conflicts by rearranging the layout of the GRF space post-register allocation. It's in general not possible to eliminate all of them without introducing extra copies, which are typically more expensive than the bank conflict itself. In a shader-db run on SKL this helps roughly 46k shaders: total conflicts in shared programs: 1008981 -> 600461 (-40.49%) conflicts in affected programs: 816222 -> 407702 (-50.05%) helped: 46234 HURT: 72 The running time of shader-db itself on SKL seems to be increased by roughly 2.52%±1.13% with n=20 due to the additional work done by the compiler back-end. On earlier generations the pass is somewhat less effective in relative terms because the hardware incurs a bank conflict anytime the last two sources of the instruction are duplicate (e.g. while trying to square a value using MAD), which is impossible to avoid without introducing copies. E.g. for a shader-db run on SNB: total conflicts in shared programs: 944636 -> 623185 (-34.03%) conflicts in affected programs: 853258 -> 531807 (-37.67%) helped: 31052 HURT: 19 And on BDW: total conflicts in shared programs: 1418393 -> 987539 (-30.38%) conflicts in affected programs: 1179787 -> 748933 (-36.52%) helped: 47592 HURT: 70 On SKL GT4e this improves performance of GpuTest Volplosion by 3.64% ±0.33% with n=16. NOTE: This patch intentionally disregards some i965 coding conventions for the sake of reviewability. This is addressed by the next squash patch which introduces an amount of (for the most part boring) boilerplate that might distract reviewers from the non-trivial algorithmic details of the pass. The following patch is squashed in: SQUASH: intel/fs/bank_conflicts: Roll back to the nineties. Acked-by: Matt Turner <[email protected]>
* i965/fs: Helpers for un/shuffle 16-bit pairs in 32-bit componentsJose Maria Casanova Crespo2017-12-061-0/+11
| | | | | | | | | | | | | This helpers are used to load/store 16-bit types from/to 32-bit components. The functions shuffle_32bit_load_result_to_16bit_data and shuffle_16bit_data_for_32bit_write are implemented in a similar way than the analogous functions for handling 64-bit types. v1: Explain need of temporary in shuffle operations. (Jason Ekstrand) Reviewed-by: Jason Ekstrand <[email protected]>
* i965/fs: Add remove_extra_rounding_modes optimizationAlejandro Piñeiro2017-12-061-0/+1
| | | | | | | | | | | | | | | | | | | | Although from SPIR-V point of view, rounding modes are attached to the operation/destination, on i965 it is a status, so we don't need to explicitly set the rounding mode if the one we want is already set. Taking into account that the default mode is RTE, one possible optimization would be optimize out the first RTE set for each block. For in order to work, we would need to take into account block interrelationships. At this point, it is not worth to complicate the optimization for such small gain. v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate with the rounding mode (Curro) v3: Reset optimization for every block. (Jason Ekstrand) Signed-off-by: Jose Maria Casanova Crespo <[email protected]> Signed-off-by: Alejandro Piñeiro <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/cs: Push subgroup ID instead of base thread IDJason Ekstrand2017-11-071-1/+1
| | | | | | | | | | We're going to want subgroup ID for SPIR-V subgroups eventually anyway. We really only want to push one and calculate the other from it. It makes a bit more sense to push the subgroup ID because it's simpler to calculate and because it's a real API thing. The only advantage to pushing the base thread ID is to avoid a single SHL in the shader. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/cs: Rework the way thread local ID is handledJason Ekstrand2017-11-071-0/+1
| | | | | | | | | | Previously, brw_nir_lower_intrinsics added the param and then emitted a load_uniform intrinsic to load it directly. This commit switches things over to use a specific NIR intrinsic for the thread id. The one thing I don't like about this approach is that we have to copy thread_local_id over to the new visitor in import_uniforms. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Remove min_dispatch_width from fs_visitorJason Ekstrand2017-11-071-3/+2
| | | | | | | | It's 8 for everything except compute shaders. For compute shaders, there's no need to duplicate the computation and it's just a possible source of error. Reviewed-by: Iago Toral Quiroga <[email protected]>
* i965/fs: Return a fs_reg from shuffle_64bit_data_for_32bit_writeJason Ekstrand2017-11-071-4/+3
| | | | | | | | | All callers of this function allocate a fs_reg expressly to pass into it. It's much easier if we just let the helper allocate the register. While we're here, we switch it to doing the MOVs with an integer type so that we don't accidentally canonicalize floats on half of a double. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel: Rewrite the world of push/pull paramsJason Ekstrand2017-10-121-3/+3
| | | | | | | | | | | | | | | | | This moves us away to the array of pointers model and onto a model where each param is represented by a generic uint32_t handle. We reserve 2^16 of these handles for builtins that get generated by somewhere inside the compiler and have well-defined meanings. Generic params have handles whose meanings are defined by the driver. The primary downside to this new approach is that it moves a little bit of the work that we would normally do at compile time to draw time. On my laptop this hurts OglBatch6 by no more than 1% and doesn't seem to have any measurable affect on OglBatch7. So, while this may come back to bite us, it doesn't look too bad. Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Use pushed UBO data in the scalar backend.Kenneth Graunke2017-07-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This actually takes advantage of the newly pushed UBO data, avoiding pull loads. Improves performance in GLBenchmark Manhattan 3.1 by: HSW: ~1%, BDW/SKL/KBL GT2: 3-4%, SKL GT4: 7-8%, APL: 4-5%. (thanks to Eero Tamminen for these numbers) shader-db results on Skylake, ignoring programs with spill/fill changes: total instructions in shared programs: 13963994 -> 13651893 (-2.24%) instructions in affected programs: 4250328 -> 3938227 (-7.34%) helped: 28527 HURT: 0 total cycles in shared programs: 179808608 -> 172535170 (-4.05%) cycles in affected programs: 79720410 -> 72446972 (-9.12%) helped: 26951 HURT: 1248 LOST: 46 GAINED: 21 Many "Deus Ex: Mankind Divided" shaders which already spilled end up spill a lot more (about 240 programs hurt, 9 helped). The cycle estimator suggests this is still overall a win (-0.23% in cycle counts) presumably because we trade pull loads for fills. v2: Drop "PULL" environment variable left in for initial debugging (caught by Matt). Reviewed-by: Matt Turner <[email protected]>
* i965: Factor out push locations.Kenneth Graunke2017-07-131-0/+2
| | | | | | | | With UBOs, the answer of "have we decided to push this uniform" gets a bit more complicated - for one, we have multiple surfaces. This patch refactors things so we can add the new code in a single place. Reviewed-by: Matt Turner <[email protected]>
* i965/fs: Lower gl_VertexID and friends to inputs at the NIR levelJason Ekstrand2017-05-091-1/+0
| | | | | | | | | NIR calls these system values but they come in from the VF unit as vertex data. It's terribly convenient to just be able to treat them as such in the back-end. Reviewed-by: Alejandro Piñeiro <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965/fs: rename lower_d2x to lower_conversionsSamuel Iglesias Gonsálvez2017-04-141-1/+1
| | | | | | | | v2: - Change the name to lower_conversions. Signed-off-by: Samuel Iglesias Gonsálvez <[email protected]> Reviewed-by: Francisco Jerez <[email protected]>
* intel/compiler: consistently use ifndef guards over pragma onceEmil Velikov2017-03-221-1/+4
| | | | | | | | Signed-off-by: Emil Velikov <[email protected]> Acked-by: Lionel Landwerlin <[email protected]> Acked-by: Vedran Miletić <[email protected]> Acked-by: Juha-Pekka Heikkila <[email protected]> Reviewed-by: Edward O'Callaghan <[email protected]>
* i965: Move the back-end compiler to src/intel/compilerJason Ekstrand2017-03-131-0/+500
Mostly a dummy git mv with a couple of noticable parts: - With the earlier header cleanups, nothing in src/intel depends files from src/mesa/drivers/dri/i965/ - Both Autoconf and Android builds are addressed. Thanks to Mauro and Tapani for the fixups in the latter - brw_util.[ch] is not really compiler specific, so it's moved to i965. v2: - move brw_eu_defines.h instead of brw_defines.h - remove no-longer applicable includes - add missing vulkan/ prefix in the Android build (thanks Tapani) v3: - don't list brw_defines.h in src/intel/Makefile.sources (Jason) - rebase on top of the oa patches [Emil Velikov: commit message, various small fixes througout] Signed-off-by: Emil Velikov <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>