aboutsummaryrefslogtreecommitdiffstats
path: root/src/mesa/drivers/dri/i965
Commit message (Collapse)AuthorAgeFilesLines
* i965/fs: Add empirically-determined instruction latencies for gen7.Eric Anholt2012-12-141-3/+179
| | | | | | | | | | | | | | | | v2: Actually switch on the other math instructions mentioned in the comment. v3: Add timing data for textureSize(), and clean up some long comment lines. Testing shader_time of fs16 shaders on a few frames of various apps: nexuiz improved by 2.9% +/- 1.5% (n=10) no difference on GLB2.5 (n=36, outliers removed) no difference on GLB2.7 (n=25) etqw improved by 2.6% +/- 2.2% (n=25) no difference on lightsmark (n=25) Acked-by: Kenneth Graunke <[email protected]>
* i965/fs: Fix the clock increment in scheduling.Eric Anholt2012-12-141-3/+15
| | | | | | | I've tested this to be true with various ALU ops on gen7 (with the exception of MADs, which go at either 3 or 4 cycles per dispatch). Acked-by: Kenneth Graunke <[email protected]>
* i965/fs: Move the old gen4 bspec-based scheduling info to a helper func.Eric Anholt2012-12-141-33/+41
| | | | | | For gen7 everything changes, and we have actual information on latency. Acked-by: Kenneth Graunke <[email protected]>
* i965/fs: Set up gen7 UBO loads as sends from GRFs.Eric Anholt2012-12-145-7/+114
| | | | | | | | | | | | This gives the instruction scheduler a chance to schedule between the loads, whereas before it was restricted due to the dependencies between the MRFs for setting them up. For one shader in gles3conform, it goes from getting stuck in register allocation for as long as anybody's bothered to leave it running down to 23 seconds, thanks to the LIFO scheduling. Acked-by: Kenneth Graunke <[email protected]>
* i965/fs: Before reg alloc, schedule instructions to reduce live ranges.Eric Anholt2012-12-141-6/+41
| | | | | | | | | | | | | | | | | | | | | | | | | This came from an idea by Ben Segovia. 16-wide pixel shaders are very important for latency hiding on i965, so we want to try really hard to get them. If scheduling an instruction makes some set of instructions available, those are probably the ones that make the instruction's result dead. By choosing those first, we'll have a tendency to reduce the amount of live data as opposed to creating more. Previously, we were sometimes getting this behavior out of the scheduler, which was what produced the scheduler's original performance wins on lightsmark. Unfortunately, that was mostly an accident of the lame instruction latency information that I had, which made it impossible to fix the actual scheduling for performance. Now that we've fixed the scheduling for setup for register allocation, we can safely update the latency parameters for the final schedule. In shader-db, we lose 37 16-wide shaders, but gain 90 new ones. 4 shaders that were spilling change how many registers spill, for a reduction of 70/3899 instructions. v2: Simplify the new loop. Acked-by: Kenneth Graunke <[email protected]>
* i965/fs: Add some optional debug printfs to scheduling.Eric Anholt2012-12-141-0/+21
| | | | | | Seeing when instructions become available to schedule is really useful. Acked-by: Kenneth Graunke <[email protected]>
* i965/fs: Schedule instructions both before and after register allocation.Eric Anholt2012-12-143-18/+78
| | | | Acked-by: Kenneth Graunke <[email protected]>
* i965: Make sure that the shader_time report at context destroy happens.Eric Anholt2012-12-141-0/+3
| | | | | | Otherwise, you end up with some report from within a second of context destroy, which is now what you really want for testing the impact of changes
* i965: Print a total time for the different shader stages.Eric Anholt2012-12-141-10/+38
| | | | | | | | | Sometimes I've got a patch for a performance optimization that's not showing a statistically significant performance difference on reported FPS, but still seems like a good idea because it ought to reduce time spent in the shader. If I can see the total number of cycles spent in the shader stage being optimized, it may show that the patch is still worthwhile (or point out that it's actually broken in some way).
* i965: Scale shader_time to compensate for resets.Eric Anholt2012-12-144-9/+83
| | | | | | | | | | Some shaders experience resets more than others, which skews the numbers reported. Attempt to correct for this by linearly scaling according to the number of resets that happen. Note that will not be accurate if invocations of shaders have varying times and longer invocations are more likely to reset. However, this should at least be better than the previous situation.
* i965: Adjust the split between shader_time_end() and shader_time_write().Eric Anholt2012-12-144-51/+55
| | | | | | I'm about to emit other kinds of writes besides time deltas, and it turns out with the frequency of resets, we couldn't really use the old time delta write() function more than once in a shader.
* i965: Fix disassembly of jump targets on Gen7.Kenneth Graunke2012-12-121-4/+9
| | | | | | Gen7 stores the JIP/UIP bits in different places. Reviewed-by: Eric Anholt <[email protected]>
* i965: Make try_rewrite_rhs_to_dst compare VGRF size to regs written.Kenneth Graunke2012-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | try_rewrite_rhs_to_dst is a quick optimization to avoid generating new temporaries (and MOVs from those temporaries to the dest) for every expression tree we visit. By generating better code in simple cases, we reduce the burden on later optimization passes like register coalescing. Previously, we compared inst->regs_written() to lhs->vector_elements to make sure the instruction generating our value wrote the same number of components as our destination register. However, this fails in some cases. One example is texturing (which produces a vec4) into gl_FragData[i]. Technically, gl_FragData[i] is also a vec4. However, the destination VGRF actually has size 4n (where n is the size of the array). split_virtual_grfs() can't split VGRFs that are used by SEND messages which require contiguous destination registers (like texturing), and register allocation needs all VGRFs to have sizes between 1 and 4. Amnesia: The Dark Descent hits this case: a texturing instruction (4 components) gets rewritten to the gl_FragData output register (which was 4*3 = 12 components), causing the register allocator to hit the "we rely on split_virtual_grfs" assertion. This makes it possible to play Amnesia. Reviewed-by: Eric Anholt <[email protected]>
* i965/fs: Improve performance of shaders that start out with a discard.Eric Anholt2012-12-116-7/+148
| | | | | | | | | | | | | | I had tried this in the past, but ran into trouble with applications that sample from undiscarded pixels in the same subspan. To fix that issue, only jump to the end for an entire subspan at a time. Improves GLbenchmark 2.7 (1024x768) performance by 7.9 +/- 1.5% (n=8). v2: Drop the br variable in the jump instruction -- if I ever do jumps pre-gen6, it'll be a different code block anyway since we don't have HALT until gen6. Reviewed-by: Kenneth Graunke <[email protected]>
* i965/fs: Rewrite discards to use a flag subreg to track discarded pixels.Eric Anholt2012-12-118-73/+46
| | | | | | | | | This makes much more sense on gen6+, and will also prove useful for early exit of shaders on discard. v2: fix up a stale comment from before converting gen4-5. Reviewed-by: Kenneth Graunke <[email protected]>
* i965/fs: Add an instruction flag for choosing the flag subregister.Eric Anholt2012-12-116-13/+42
| | | | | | | | We're going to redo discard handling to track discards in the other flag subregister, saving instructions in the discard and allowing predicated jumps out to the end of the shader. Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Let brw_flag_reg() choose the flag reg and subreg.Eric Anholt2012-12-114-7/+7
| | | | | | We're about to start using the f0.1 subregister. Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Print the flag reg updated by conditional modifiers.Eric Anholt2012-12-111-1/+15
| | | | | | | This makes our output more consistent with other disasm tools, and will be necessary when we start using f0.1. Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Add the new flag_reg_nr instruction field from IVB.Eric Anholt2012-12-112-5/+9
| | | | Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Correct the name and usage of the flag subregister number field.Eric Anholt2012-12-114-15/+15
| | | | | | | We've been calling it a register number, it's actually the subregister, and things will get confusing once we start using it if it isn't fixed. Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Remove bogus flag_reg_nr field from bits3.Eric Anholt2012-12-111-4/+2
| | | | | | | There's a flag subreg nr field in bits2 next to src0.vertstride, but there shouldn't be anything in bits3 next to src1.vertstride. Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Add missing _NEW_BUFFERS dirty bit in Gen7 SBE state.Kenneth Graunke2012-12-081-1/+2
| | | | | | | | This is needed to compute render_to_fbo. It even has the comment. NOTE: This is a candidate for stable branches. Reviewed-by: Eric Anholt <[email protected]>
* intel: Enable ETC2 support on intel hardwareAnuj Phogat2012-12-071-0/+15
| | | | | | | | | | | | | | | | | | | | This patch enables support for ETC2 compressed textures on all intel hardware. At present, ETC2 texture decoding is not available on intel hardware. So, compressed ETC2 texture data is decoded in software and stored in a suitable uncompressed MESA_FORMAT at the time of glCompressedTexImage2D. Currently, ETC2 formats are only exposed in OpenGL ES 3.0. V2: Use single etc_wraps variable for both etc1 and etc2. V3: Remove redundant code and use just one intel_miptree_map_etc() and intel_miptree_unmap_etc() function. Choose MESA_FORMAT_SIGNED_{R16, GR1616} for ETC2 signed-{r11, rg11} formats Signed-off-by: Anuj Phogat <[email protected]> Tested-by: Matt Turner <[email protected]> Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Add a debug flag for counting cycles spent in each compiled shader.Eric Anholt2012-12-0515-9/+517
| | | | | | | | | | | | | | | | | | | | | | This can be used for two purposes: Using hand-coded shaders to determine per-instruction timings, or figuring out which shader to optimize in a whole application. Note that this doesn't cover the instructions that set up the message to the URB/FB write -- we'd need to convert the MRF usage in these instructions to GRFs so that our offsets/times don't overwrite our shader outputs. Reviewed-by: Kenneth Graunke <[email protected]> (v1) v2: Check the timestamp reset flag in the VS, which is apparently getting set fairly regularly in the range we watch, resulting in negative numbers getting added to our 32-bit counter, and thus large values added to our uint64_t. v3: Rebase on reladdr changes, removing a new safety check that proved impossible to satisfy. Add a comment to the AOP defs from Ken's review, and put them in a slightly more sensible spot. v4: Check timestamp reset in the FS as well.
* i965: Add a flag for instructions with normal writemasking disabled.Eric Anholt2012-12-054-0/+4
| | | | | For getting values from the new timestamp register, the channels we load have nothing to do with the pixels dispatched.
* i965/fs: Add support for uniform array access with a variable index.Eric Anholt2012-12-044-24/+216
| | | | | | | | | | | | | Serious Sam 3 had a shader hitting this path, but it's used rarely so it didn't show a significant performance difference (n=7). It does reduce compile time massively, though -- one shader goes from 14s compile time and 11723 instructions generated to .44s and 499 instructions. Note that some shaders lose 16-wide mode because we don't support 16-wide and pull constants at the moment (generally, things looping over a few-element array where the loop isn't getting unrolled). Given that those shaders are being generated with 15-20% fewer instructions, it probably outweighs the loss of 16-wide.
* i965/fs: Conditionalize constant-index UBO load code and add comments.Eric Anholt2012-12-041-28/+33
| | | | | I wanted to separate this step for easier reviewing when I add the variable-index case next.
* i965/fs: Restrict optimization that would fail for gen7's SENDs from GRFsEric Anholt2012-12-043-8/+28
| | | | | | v2: Fix SNB math bug in register_coalesce() where I was looking at the instruction to be removed, not the instruction to be copy propagated into.
* i965/fs: Allow source mods on gen7+ math.Eric Anholt2012-12-041-1/+1
| | | | | This gen6 restriction was removed in gen7 as the mathbox merge to act more like a normal instruction was finished in the hardware.
* i965/fs: Add instruction emit for varying-index reads of uniforms.Eric Anholt2012-12-044-0/+105
| | | | | | | The gen7 send-from-GRF path is sufficiently different from the perspective of IR generation and optimization that I just made it a separate opcode. v2: fix whitespace, rebase on Ken's recent refactor.
* i965/fs: Rename the existing pull constant load opcode.Eric Anholt2012-12-046-14/+16
| | | | | We're going to use another send message for handling loads with a varying per-fragment array index.
* i965: Add a header_present flag for setting up dp read messages.Eric Anholt2012-12-043-1/+7
| | | | | | As of gen7, we can skip the header on some messages, and this can make optimization on those messages much nicer when you've got GRFs instead of MRFs as the source.
* i965/gen7: Add some safety checks for send messages from GRFs.Eric Anholt2012-12-041-0/+15
|
* i965: Include codegen time in the INTEL_DEBUG=perf stall detection.Eric Anholt2012-12-032-12/+18
| | | | | | | In the VS case, we were missing the entire compile time in the stall detection! Reviewed-by: Kenneth Graunke <[email protected]>
* i965: Don't leak the IR annotation into later instructions.Eric Anholt2012-12-032-0/+2
| | | | | | | | | After walking our IR instructions (Mesa or GLSL), we don't want to also mark the start of the FB/URB writes or whatever as being that IR. This can end up being misleading when the end of the IR visit got copy propagated out to a later instruction in the URB writes. Reviewed-by: Kenneth Graunke <[email protected]>
* i965/vp: Fix crashes with INTEL_DEBUG=vs.Eric Anholt2012-12-031-0/+1
| | | | | | | The VP generation doesn't set up the output reg strings, so if you didn't happen to get these values as 0 on the stack, you'd lose. Reviewed-by: Kenneth Graunke <[email protected]>
* i965/vs: Fix uninitialized shader pointer used in debug output.Eric Anholt2012-12-031-0/+2
| | | | Reviewed-by: Kenneth Graunke <[email protected]>
* i965/fs: Add fs_reg::is_zero() and is_one(); use for opt_algebraic().Kenneth Graunke2012-11-302-7/+24
| | | | | | | | | | | | | | | | | | These helper macros save you from writing nasty expressions like: if ((inst->src[1].type == BRW_REGISTER_TYPE_F && inst->src[1].imm.f == 1.0) || ((inst->src[1].type == BRW_REGISTER_TYPE_D || inst->src[1].type == BRW_REGISTER_TYPE_UD) && inst->src[1].imm.u == 1)) { Instead, you simply get to write inst->src[1].is_one(). Simple. Also, this makes the FS backend match the VS backend (which has these). This patch also converts opt_algebraic to use the new helper functions. As a consequence, it will now also optimize integer-typed expressions. Reviewed-by: Eric Anholt <[email protected]>
* i965/fp: Fix segfault on gen4 TXB instructions.Eric Anholt2012-11-291-0/+2
| | | | | | | | | | The gen4 simd16 workaround looks at ir->type to determine how much storage to allocate for the simd16 value. In fragment programs, texturing only ever returns float vec4s (unlike GLSL, which can also have scalar floats or vector integers), so this is the right type. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=56962 Reviewed-by: Kenneth Graunke <[email protected]>
* mesa: Rename API_OPENGL to API_OPENGL_COMPAT.Paul Berry2012-11-291-1/+1
| | | | | | | | | | This should help avoid confusion now that we're using the gl_api enum to distinguishing between core and compatibility API's. The corresponding enum value for core API's is API_OPENGL_CORE. Acked-by: Eric Anholt <[email protected]> Acked-by: Matt Turner <[email protected]> Acked-by: Kenneth Graunke <[email protected]>
* i965/vs: Move struct brw_compile (p) entirely inside vec4_generator.Kenneth Graunke2012-11-283-4/+3
| | | | | | | | | | | | | | The brw_compile structure contains the brw_instruction store and the brw_eu_emit.c state tracking fields. These are only useful for the final assembly generation pass; the earlier compilation stages doesn't need them. This also means that the code generator for future hardware won't have access to the brw_compile structure, which is extremely desirable because it prevents accidental generation of Gen4-7 code. Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/vs: Split final assembly code generation out of vec4_visitor.Kenneth Graunke2012-11-284-53/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compiling shaders requires several main steps: 1. Generating VS IR from either GLSL IR or Mesa IR 2. Optimizing the IR 3. Register allocation 4. Generating assembly code This patch splits out step 4 into a separate class named "vec4_generator." There are several reasons for doing so: 1. Future hardware has a different instruction encoding. Splitting this out will allow us to replace vec4_generator (which relies heavily on the brw_eu_emit.c code and struct brw_instruction) with a new code generator that writes the new format. 2. It reduces the size of the vec4_visitor monolith. (Arguably, a lot more should be split out, but that's left for "future work.") 3. Separate namespaces allow us to make helper functions for generating instructions in both classes: ADD() can exist in vec4_visitor and create IR, while ADD() in vec4_generator() can create brw_instructions. (Patches for this upcoming.) Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/vs: Abort on unsupported opcodes rather than failing.Kenneth Graunke2012-11-281-3/+4
| | | | | | | | | | Final code generation should never fail. This is a bug, and there should be no user-triggerable cases where this could occur. Also, we're not going to have a fail() method after the split. Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/vs: Move uses of brw_compile from do_vs_prog to brw_vs_emit.Kenneth Graunke2012-11-283-14/+19
| | | | | | | | | | | | The brw_compile structure is closely tied to the Gen4-7 hardware encoding. However, do_vs_prog is very generic: it just calls out to get a compiled program and then uploads it. This isn't ultimately where we want it, but it's a step in the right direction: it's now closer to the code generator. Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/vs: Rework memory contexts for shader compilation data.Kenneth Graunke2012-11-285-8/+12
| | | | | | | | | | | | | | During compilation, we allocate a bunch of things: the IR needs to last at least until code generation...and then the program store needs to last until after we upload the program. For simplicity's sake, just keep it all around until we upload the program. After that, it can all be freed. This will also save a lot of headaches during the upcoming refactoring. Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/vs: Pass the brw_context pointer into brw_compute_vue_map().Kenneth Graunke2012-11-281-3/+2
| | | | | | | | We used to steal it out of the brw_compile struct, but that won't be initialized in time soon (and is eventually going away). Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/vs: Pass the brw_context pointer into vec4_visitor and do_vs_prog.Kenneth Graunke2012-11-285-9/+14
| | | | | | | | We used to steal it out of the brw_compile struct...but vec4_visitor isn't going to have one of those in the future. Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/vs: Move some functions from brw_vec4_emit.cpp to brw_vec4.cpp.Kenneth Graunke2012-11-282-263/+265
| | | | | | | | | | | This leaves only the final code generation stage in brw_vec4_emit.cpp, moving the payload setup, run(), and brw_vs_emit functions to brw_vec4.cpp. The fragment shader backend puts these functions in brw_fs.cpp, so this patch also helps with consistency. Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/gen4-5: Fix segfaults with stencil-only depth/stencil setups.Eric Anholt2012-11-281-1/+3
| | | | | | | | Fixes a ton of piglit regressions since the depthstencil fixes for gen6+. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=57309 Reviewed-by: Chad Versace <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* i965/fs: Don't generate saturates over existing variable values.Eric Anholt2012-11-281-0/+1
| | | | | | | | | | | | Fixes a crash in http://workshop.chromeexperiments.com/stars/ on i965, and the new piglit test glsl-fs-clamp-5. We were trying to emit a saturating move into a uniform, which the code generator appropriately choked on. This was broken in the change in 32ae8d3b321185a85b73ff703d8fc26bd5f48fa7. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=57166 NOTE: This is a candidate for the 9.0 branch. Reviewed-by: Kenneth Graunke <[email protected]>