aboutsummaryrefslogtreecommitdiffstats
path: root/src/intel
Commit message (Collapse)AuthorAgeFilesLines
* mesa: Add new fast mtx_t mutex type for basic use casesTimothy Arceri2017-11-091-23/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While modern pthread mutexes are very fast, they still incur a call to an external DSO and overhead of the generality and features of pthread mutexes. Most mutexes in mesa only needs lock/unlock, and the idea here is that we can inline the atomic operation and make the fast case just two intructions. Mutexes are subtle and finicky to implement, so we carefully copy the implementation from Ulrich Dreppers well-written and well-reviewed paper: "Futexes Are Tricky" http://www.akkadia.org/drepper/futex.pdf We implement "mutex3", which gives us a mutex that has no syscalls on uncontended lock or unlock. Further, the uncontended case boils down to a cmpxchg and an untaken branch and the uncontended unlock is just a locked decr and an untaken branch. We use __builtin_expect() to indicate that contention is unlikely so that gcc will put the contention code out of the main code flow. A fast mutex only supports lock/unlock, can't be recursive or used with condition variables. We keep the pthread mutex implementation around as for the few places where we use condition variables or recursive locking. For platforms or compilers where futex and atomics aren't available, simple_mtx_t falls back to the pthread mutex. The pthread mutex lock/unlock overhead shows up on benchmarks for CPU bound applications. Most CPU bound cases are helped and some of our internal bind_buffer_object heavy benchmarks gain up to 10%. Signed-off-by: Kristian Høgsberg <[email protected]> Signed-off-by: Timothy Arceri <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* intel/nir: Break the linking code into a helper in brw_nir.cJason Ekstrand2017-11-082-0/+36
| | | | | Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com> Cc: [email protected]
* intel/nir: Add a helper for getting the NoIndirect maskJason Ekstrand2017-11-081-14/+19
| | | | | Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com> Cc: [email protected]
* automake: intel: correctly append to the LIBADD variableEmil Velikov2017-11-081-1/+1
| | | | | | | | | | | | | | Commit 05fc62d89f5 sets the variable, yet it forgot the update the existing reference to append (instead of assign). Thus as-is the expat library was discarded from the link chain when building with Android. Fixes: 05fc62d89f5 ("automake: intel: move expat handling where it's used") Cc: Hongxu Jia <[email protected]> Signed-off-by: Emil Velikov <[email protected]> Reviewed-by: Eric Engestrom <[email protected]>
* intel/fs/nir: Return Q types from brw_reg_type_for_bit_sizeJason Ekstrand2017-11-071-2/+2
| | | | Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]>
* intel/fs/nir: Use Q immediates for load_const on gen8+Jason Ekstrand2017-11-071-3/+11
| | | | Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]>
* intel/fs/nir: Setup immediates based on type in i2b and f2bJason Ekstrand2017-11-071-1/+2
| | | | Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]>
* intel/reg: Add helpers for 64-bit integer immediatesJason Ekstrand2017-11-071-0/+18
| | | | Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]>
* nir,intel/compiler: Use a fixed subgroup sizeJason Ekstrand2017-11-072-4/+2
| | | | | | | | | | | | | | | | The GL_ARB_shader_ballot spec says that gl_SubGroupSizeARB is declared as a uniform. This means that it cannot change across an invocation such as a draw call or a compute dispatch. For compute shaders, we're ok because we only ever use one dispatch size. For fragment, however, the hardware dynamically chooses between SIMD8 and SIMD16 which violates the spec. Instead, let's just pick a subgroup size based on the shader stage. The fixed size we choose for compute shaders is a bit higher than strictly needed but there's no real harm in that. The advantage is that, if they do anything interesting with the value, NIR will see it as an immediate and can optimize better. Acked-by: Lionel Landwerlin <[email protected]> Reviewed-by: Iago Toral Quiroga <[email protected]>
* nir/lower_subgroups: Lower ballot intrinsics to the specified bit sizeJason Ekstrand2017-11-072-1/+1
| | | | | | | | | | | | | | Ballot intrinsics return a bitfield of subgroups. In GLSL and some SPIR-V extensions, they return a uint64_t. In SPV_KHR_shader_ballot, they return a uvec4. Also, some back-ends would rather pass around 32-bit values because it's easier than messing with 64-bit all the time. To solve this mess, we make nir_lower_subgroups take a new parameter called ballot_bit_size and it lowers whichever thing it gets in from the source language (uint64_t or uvec4) to a scalar with the specified number of bits. This replaces a chunk of the old lowering code. Reviewed-by: Lionel Landwerlin <[email protected]> Reviewed-by: Iago Toral Quiroga <[email protected]>
* nir: Add a new subgroups lowering passJason Ekstrand2017-11-072-4/+7
| | | | | | | | | | | | This commit pulls nir_lower_read_invocations_to_scalar along with most of the guts of nir_opt_intrinsics (which mostly does subgroup lowering) into a new nir_lower_subgroups pass. There are various other bits of subgroup lowering that we're going to want to do so it makes a bit more sense to keep it all together in one pass. We also move it in i965 to happen after nir_lower_system_values to ensure that because we want to handle the subgroup mask system value intrinsics here. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Don't use automatic exec size inferenceJason Ekstrand2017-11-071-3/+9
| | | | | | | | | | | | | | The automatic exec size inference can accidentally mess things up if we're not careful. For instance, if we have add(4) g38.2<4>D g38.1<8,2,4>D g38.2<8,2,4>D then the destination register will end up having a width of 2 with a horizontal stride of 4 and a vertical stride of 8. The EU emit code sees the width of 2 and decides that we really wanted an exec size of 2 which doesn't do what we wanted. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Explicitly set EXECUTE_1 where neededJason Ekstrand2017-11-074-9/+15
| | | | Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/eu: Explicitly set EXECUTE_1 where neededJason Ekstrand2017-11-071-0/+9
| | | | Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/eu: Make automatic exec sizes a configurable optionJason Ekstrand2017-11-073-14/+29
| | | | | | | | | | | | | | We have had a feature in codegen for some time that tries to automatically infer the execution size of an instruction from the width of its destination. For things such as fixed function GS, clipper, and SF programs, this is very useful because they tend to have lots of hand-rolled register setup and trying to specify the exec size all the time would be prohibitive. For things that come from a higher-level IR, however, it's easier to just set the right size all the time and the automatic exec sizes can, in fact, cause problems. This commit makes it optional while enabling it by default. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Rework zero-length URB write handlingJason Ekstrand2017-11-071-29/+31
| | | | | | | | | | | | | | Originally we tried to handle this case based on slots_valid. However, there are a number of ways that this can go wrong. For one, we throw away any trailing slots which either aren't written or are set to VARYING_SLOT_PAD. Second, even if PSIZ is a valid slot, we may not actually write anything there. Between the lot of these, it was possible to end up in a case where we tried to do a regular URB write but ended up with a length of 1 which is invalid. This commit moves it to the end and makes it based on a new boolean flag urb_written. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/compiler/fs: Set up subgroup invocation as a system valueJason Ekstrand2017-11-071-13/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Subgroup invocation is computed using a vector immediate and some dispatch-aware arithmetic. Unfortunately, due to the vector arithmetic, and the fact that it's frequently read 16-wide, it's not something that can easily be CSEd by the back-end compiler. There are a few different possible approaches to this problem: 1) Emit the code to calculate the subgroup invocation on-the-fly and trust NIR to do the CSE. This is what we were doing. 2) Add a back-end instruction for the subgroup ID. This has the advantage of helping the back-end compiler with CSE but has the downside of very poor scheduling for the calculation because it has to be emitted in the back-end. 3) Emit the calculation at the top of the program and re-use the result. This gets rid of the CSE problem but comes at the cost of an extra live register. This commit switches us from 1) to 3). We choose to store the subgroup invocation values as a W type to reduce the impact of the extra live register. Trusting NIR and using 1) was fine but we're soon going to want to use the subgroup invocation value for other things in the back-end compiler and this makes it much easier to do without having to worry about CSE problems. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/cs: Push subgroup ID instead of base thread IDJason Ekstrand2017-11-077-30/+36
| | | | | | | | | | We're going to want subgroup ID for SPIR-V subgroups eventually anyway. We really only want to push one and calculate the other from it. It makes a bit more sense to push the subgroup ID because it's simpler to calculate and because it's a real API thing. The only advantage to pushing the base thread ID is to avoid a single SHL in the shader. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/cs: Re-run final NIR optimizations for each SIMD sizeJason Ekstrand2017-11-071-41/+69
| | | | | | | | | | | | With the advent of SPIR-V subgroup operations, compute shaders will have to be slightly different depending on the SIMD size at which they execute. In order to allow us to do dispatch-width specific things in NIR, we re-run the final NIR stages for each sIMD width. One side-effect of this change is that we start rallocing fs_visitors which means we need DECLARE_RALLOC_CXX_OPERATORS. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/compiler: Move the destructor from vec4_visitor to backend_shaderJason Ekstrand2017-11-074-5/+5
| | | | Reviewed-by: Kenneth Graunke <[email protected]>
* i965/fs: Get rid of the early return in brw_compile_csJason Ekstrand2017-11-071-13/+14
| | | | Reviewed-by: Kenneth Graunke <[email protected]>
* intel/cs: Rework the way thread local ID is handledJason Ekstrand2017-11-075-46/+29
| | | | | | | | | | Previously, brw_nir_lower_intrinsics added the param and then emitted a load_uniform intrinsic to load it directly. This commit switches things over to use a specific NIR intrinsic for the thread id. The one thing I don't like about this approach is that we have to copy thread_local_id over to the new visitor in import_uniforms. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Mark 64-bit values as being contiguousJason Ekstrand2017-11-071-1/+4
| | | | | | | | | | | This isn't often a problem , when we're in a compute shader, we must push the thread local ID so we decrement the amount of available push space by 1 and it's no longer even and 64-bit data can, in theory, span it. By marking those uniforms contiguous, we ensure that they never get split in half between push and pull constants. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/cs: Ignore runtime_check_aads_emit for CSJason Ekstrand2017-11-071-2/+1
| | | | | | It's only set on gen4-5 which clearly don't support compute shaders. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/cs: Stop setting dispatch_grf_start_regJason Ekstrand2017-11-072-3/+0
| | | | | | Nothing ever reads it for compute shaders because it's always 1. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/cs: Drop max_dispatch_width checks from compile_csJason Ekstrand2017-11-071-4/+8
| | | | | | | | The only things that adjust fs_visitor::max_dispatch_width are render target writes which don't happen in compute shaders so they're pointless. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Remove min_dispatch_width from fs_visitorJason Ekstrand2017-11-073-33/+25
| | | | | | | | It's 8 for everything except compute shaders. For compute shaders, there's no need to duplicate the computation and it's just a possible source of error. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: use pull constant locations to check for first compile of a shaderJason Ekstrand2017-11-072-2/+7
| | | | | | | | | | Before, we bailing in assign_constant_locations based on the minimum dispatch size. The more direct thing to do is simply to check for whether or not we have constant locations and bail if we do. For nir_setup_uniforms, it's completely safe to do it multiple times because we just copy a value from the NIR shader. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Retype dest to match value in read[First]InvocationJason Ekstrand2017-11-071-4/+2
| | | | | | | | | This is what we really wanted all along. Always retyping to D works because that's what get_nir_src() always gives us, at least for 32-bit types. The SPIR-V variants of these operations accept arbitrary types and we need this if we're going to handle 64 or 16-bit values. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Uniformize the index in readInvocationJason Ekstrand2017-11-071-1/+1
| | | | | | | | The index is any value provided by the shader and this can be called in non-uniform control flow so we can't just take component 0. Found by inspection. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Protect opt_algebraic from OOB BROADCAST indicesJason Ekstrand2017-11-071-2/+11
| | | | Reviewed-by: Iago Toral Quiroga <[email protected]>
* i965/fs/nir: Don't stomp 64-bit values to D in get_nir_srcJason Ekstrand2017-11-071-13/+24
| | | | Reviewed-by: Iago Toral Quiroga <[email protected]>
* i965/fs/nir: Minor refactor of store_outputJason Ekstrand2017-11-071-4/+3
| | | | | | | | | | | Stop retyping the output of shuffle_64bit_data_for_32bit_write. It's always BRW_REGISTER_TYPE_D which is perfectly fine for writing out. Also, when we change get_nir_src to return something with a 64-bit type for 64-bit values, the retyping will not be at all what we want. Also, retyping the output based on src.type before we whack it back to 32 bits is a problem because the output is always 32 bits. Reviewed-by: Iago Toral Quiroga <[email protected]>
* i965/fs: Return a fs_reg from shuffle_64bit_data_for_32bit_writeJason Ekstrand2017-11-072-29/+12
| | | | | | | | | All callers of this function allocate a fs_reg expressly to pass into it. It's much easier if we just let the helper allocate the register. While we're here, we switch it to doing the MOVs with an integer type so that we don't accidentally canonicalize floats on half of a double. Reviewed-by: Iago Toral Quiroga <[email protected]>
* i965/fs/nir: Simplify 64-bit store_outputJason Ekstrand2017-11-071-19/+6
| | | | | | | | | The swizzles weren't doing any good because swiz is just XYZW. Also, we were emitting an extra set of MOVs because shuffle_64bit_data_for_32bit already does a MOV for us. Finally, the temporary was only ever used inside the inner loop so there's no need for it to actually be an array. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Use the original destination region for int MUL loweringJason Ekstrand2017-11-071-7/+9
| | | | | | | | | Some hardware (CHV, BXT) have special restrictions on register regions when doing integer multiplication. We want to respect those when we lower to DxW multiplication. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/fs: Fix integer multiplication lowering for src/dst hazardsJason Ekstrand2017-11-071-2/+8
| | | | | Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/fs: Fix MOV_INDIRECT for 64-bit values on little-coreJason Ekstrand2017-11-071-36/+39
| | | | | | | | | The same workaround we need for 64-bit values on little core also takes care of the Ivy Bridge problem and does so a bit more efficiently so we can drop that code while we're here. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/eu: Fix broadcast instruction for 64-bit values on little-coreJason Ekstrand2017-11-071-2/+24
| | | | | | | | We're not using broadcast for any 32-bit types right now since we mostly use it for emit_uniformize on 32-bit buffer indices. However, SPIR-V subgroups are going to need it for 64-bit so let's make it work. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/eu/reg: Add a subscript() helperJason Ekstrand2017-11-071-0/+16
| | | | | | | This is similar to the identically named fs_reg helper. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/eu: Just modify the offset in brw_broadcastJason Ekstrand2017-11-071-4/+5
| | | | | | | This means we have to drop const from a variable but it also means that 100% of the code which deals with the offset limit is in one place. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/compiler: Add some restrictions to MOV_INDIRECT and BROADCASTJason Ekstrand2017-11-073-0/+20
| | | | | | | These restrictions effectively already existed due to the way we use indirect sources but weren't being directly enforced. Reviewed-by: Iago Toral Quiroga <[email protected]>
* intel/fs: Use a pair of 1-wide MOVs instead of SEL for any/allJason Ekstrand2017-11-071-9/+33
| | | | | | | | | | | For some reason, the any/all predicates don't work properly with SIMD32. In particular, it appears that a SEL with a QtrCtrl of 2H doesn't read the correct subset of the flag register and you end up getting garbage in the second half. Work around this by using a pair of 1-wide MOVs and scattering the result. This fixes the any/all instructions for SIMD32. Reviewed-by: Matt Turner <[email protected]> Cc: [email protected]
* intel/fs: Use an explicit D type for vote any/all/eq intrinsicsJason Ekstrand2017-11-071-0/+6
| | | | | | | | | | | | | The any/all intrinsics return a boolean value so D or UD is the correct type. Unfortunately, get_nir_dest has the annoying behavior of returnning a float type by default. This causes format conversion which gives us -1.0f or 0.0f in the register. If the consumer of the result does an integer comparison to zero, it will give you the right boolean value but if we do something more clever based on the 0/~0 assumption for booleans, this will give the wrong value. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/fs: Don't stomp f0.1 in SIMD16 ballotJason Ekstrand2017-11-071-2/+9
| | | | | | | | | In fragment shaders f0.1 is used for discards so doing ballot after a discard can potentially cause the discard to not happen. However, we don't support SIMD32 fragment shaders yet so this isn't a problem. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/fs: Use ANY/ALL32 predicates in SIMD32Jason Ekstrand2017-11-071-12/+30
| | | | | | | | | | | | We have ANY/ALL32 predicates and, for the most part, they work just fine. (See the next commit for more details.) Also, due to the way that flag registers are handled in hardware, instruction splitting is able to split the CMP correctly. Specifically, that hardware looks at the execution group and knows to shift it's flag usage up correctly so a 2H instruction will write to f0.1 instead of f0.0. Reviewed-by: Matt Turner <[email protected]> Cc: [email protected]
* intel/fs: Be more explicit about our placement of [un]zipJason Ekstrand2017-11-071-3/+17
| | | | | | | | | | | | | Before, we were careful to place the zip after the last of the split instructions but did unzip on-demand. This changes things so that the unzips go before all of the split instructions and the unzip comes explicitly after all the split instructions. As a side-effect of this change, we now emit the split instruction from highest SIMD group to lowest instead of low to high. We could have kept the old behavior, but it shouldn't matter and this made the code easier. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/fs: Pass builders instead of blocks into emit_[un]zipJason Ekstrand2017-11-071-26/+35
| | | | | | | | | This makes it far more explicit where we're inserting the instructions rather than the magic "before and after" stuff that the emit_[un]zip helpers did based on block and inst. Reviewed-by: Iago Toral Quiroga <[email protected]> Cc: [email protected]
* intel/fs: Use a pure vertical stride for large register stridesJason Ekstrand2017-11-071-3/+13
| | | | | | | | | | | | | | Register strides higher than 4 are uncommon but they can happen. For instance, if you have a 64-bit extract_u8 operation, we turn that into UB -> UQ MOV with a source stride of 8. Our previous calculation would try to generate a stride of <32;8,8>:ub which is invalid because the maximum horizontal stride is 4. To solve this problem, we instead use a stride of <8;1,0>. As noted in the comment, this does not work as a destination but that's ok as very few things actually generate that stride. Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]> Cc: [email protected]
* anv: Suffix anv-private 'VK' tokens with 'ANV'Chad Versace2017-11-075-31/+31
| | | | | | | | | | | | | | | | | I saw VK_IMAGE_ASPECT_ANY_COLOR_BIT while hacking anv_formats.c and got confused. "Huh? What extension added that?". No extension defines it; anv_private.h defines it. To remove confusion, rename the anv-private VK tokens as if they were extension tokens with the ANV vendor suffix. I found only two such tokens: VK_IMAGE_ASPECT_ANY_COLOR_BIT VK_IMAGE_ASPECT_PLANES_BITS Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Lionel Landwerlin <[email protected]>