summaryrefslogtreecommitdiffstats
path: root/src/gallium
Commit message (Collapse)AuthorAgeFilesLines
* clover: Check for executables before enqueueing a kernelPierre Moreau2017-01-111-1/+4
| | | | | | | | | | Without this check, the kernel::bind() method would fail with a std::out_of_range exception, letting an exception escape from the library into the client, rather than returning the corresponding error code CL_INVALID_PROGRAM_EXECUTABLE. Signed-off-by: Pierre Moreau <[email protected]> Reviewed-by: Francisco Jerez <[email protected]>
* gallium/tgsi: fix overflow in parse propertyLi Qiang2017-01-111-3/+6
| | | | | | | | | | | In parse_identifier, it doesn't stop copying '*pcur' untill encounter the NULL. As the 'ret' has a fixed-size buffer, if the '*pcur' has a long string, there will be a buffer overflow. This patch avoid this. Signed-off-by: Li Qiang <[email protected]> Signed-off-by: Marek Olšák <[email protected]> Reviewed-by: Marc-André Lureau <[email protected]>
* st/dri: remove trailing whitespaceMauro Rossi2017-01-111-1/+1
| | | | | Reviewed-by: Tapani Pälli <[email protected]> Reviewed-by: Emil Velikov <[email protected]>
* freedreno: add "nogrow" debug paramRob Clark2017-01-103-1/+4
| | | | | | | Sometimes it is useful to disable the "growable" cmdstream buffers for debugging. (See 419a154d in libdrm) Signed-off-by: Rob Clark <[email protected]>
* freedreno/a5xx: remove hack for glamorRob Clark2017-01-101-3/+0
| | | | | | | Now that issues glamor was hitting w/ glsl>=130 (aka missing INSTANCED bit in vertex attribute state) is fixed, remove hack. Signed-off-by: Rob Clark <[email protected]>
* freedreno/a5xx: fixed instancedRob Clark2017-01-101-0/+1
| | | | | | Add missing bit, now that we know where it is. Signed-off-by: Rob Clark <[email protected]>
* freedreno/a5xx: use the non-_ZERO_BASE for vertexidRob Clark2017-01-104-6/+20
| | | | Signed-off-by: Rob Clark <[email protected]>
* freedreno/a5xx: add texture MIPLVLSRob Clark2017-01-101-3/+3
| | | | Signed-off-by: Rob Clark <[email protected]>
* freedreno/a5xx: fix fragcoord related hangsRob Clark2017-01-102-2/+6
| | | | Signed-off-by: Rob Clark <[email protected]>
* freedreno: update generated headersRob Clark2017-01-106-13/+22
| | | | Signed-off-by: Rob Clark <[email protected]>
* ac/debug: Dump indirect buffers.Bas Nieuwenhuizen2017-01-091-3/+6
| | | | | | | | | | | | | | This is for handling chained command buffers and secondary command buffers. It doesn't handle the trace id for secondary command buffers yet, but I don't think that is possible in general with just writes, as we could call a secondary command buffer multiple times. I think this is good enough for now, as the most useful case is the chaining when we grow an IB. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]> Reviewed-by: Dave Airlie <[email protected]>
* ac/debug: Move IB decode to common code.Bas Nieuwenhuizen2017-01-093-332/+15
| | | | | | Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]> Reviewed-by: Dave Airlie <[email protected]>
* ac/debug: Move sid_tables.h generation to common code.Bas Nieuwenhuizen2017-01-093-308/+1
| | | | | | Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]> Reviewed-by: Dave Airlie <[email protected]>
* radeonsi: fix the Witcher 2 black transitionsMarek Olšák2017-01-091-2/+13
| | | | | | | | v2: do it properly Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98238 Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: set si_shader_context::input_decls for ranged decls correctlyMarek Olšák2017-01-091-1/+4
| | | | | | | | This has no effect because no code uses those members with ranged decls. Tested-by: Edmondo Tommasina <[email protected]> Acked-by: Edward O'Callaghan <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: cleanly communicate whether si_shader_dump should check R600_DEBUGMarek Olšák2017-01-095-13/+15
| | | | | | Tested-by: Edmondo Tommasina <[email protected]> Acked-by: Edward O'Callaghan <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* gallium/radeon: use the internal clear_buffer callback to fix r600gMarek Olšák2017-01-061-1/+3
| | | | | | | | r600g doesn't set pipe_context::clear_buffer. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99303 Reviewed-by: Alex Deucher <[email protected]>
* llvmpipe: do transpose/untwiddle after conversion for 8bit formatsRoland Scheidegger2017-01-061-7/+143
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Generally we should do tranpose after conversion, if the format has less than 32 bits per channel (if it has 32 bits, conversion is going to be a no-op anyway...). This is obviously because there's less vectors to deal with. Though the advantage for 16 bit formats isn't that big, and in fact with AVX there isn't really any (as the 32bit unpacks can be done with 256bit, but the smaller ones cannot, although that would change again with proper AVX2 support). Only makes sense for 2d and not 1d cases. And to keep things easy, only handle 1,2 and 4 channels (rgbx is just fine). For rgba unorm8 format the backend conversion sums up to these instruction totals (not counting the movs for SSE2 due to 2-op syntax - generally every 2 unpacks need an additional mov). SSE2 AVX transpose: 32 unpack 16 unpack untwiddle: 0 8 (128bit low/high permutes) convert: 16 mul + 16 cvt 8 mul + 8 cvt 32->8bit: 12 pack 8 (128bit extract) + 12 pack When doing transpose/untwiddle afterwards we get: convert: 16 mul + 16 cvt 8 mul + 8 cvt 32->8bit: 12 pack 8 (128bit extract) + 12 pack transpose/untwiddle 12 unpack 12 unpack So for SSE2, this drops 20 unpacks (total instruction count 76->56) whereas for AVX it replaces the 16 256bit unpacks with 8 128bit ones and drops the 8 lo/hi permutes (in total 60->48). (Albeit to be fair, the permutes could be dropped even when doing the transpose first, they are extremely pointless but we'd need to be able to tell lp_build_conv to reorder the vectors, for AVX2 we're going to need to be able to tell lp_build_conv about ordering in any case.) (With different ordering going into conversion, it would be possible to do 4 unpacks + 4 pshufbs instead of 12 unpacks, but that might not be better, and not all cpus can do it. Proper AVX2 support should eliminate the 8 128bit extracts, reduce these 12 packs to 6 and the 12 unpacks to 2 pshufb + 2 permq ideally (+ 2 final 128bit extracts).) Reviewed-by: Jose Fonseca <[email protected]>
* gallivm: generalize 4x4f->1x16ub special case conversionRoland Scheidegger2017-01-061-56/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | This special packing path can be easily extended to handle not just float->unorm8 but also float->snorm8 and uint32->uint8 and int32->int8 (i.e. all interesting cases for llvmpipe fs backend code). The packing parts all stay the same (only the last step packing will be signed->signed instead of signed->unsigned but luckily even sse2 can do both). While here also note some bugs with that (we keep the bugs identical to what we did before on x86, albeit other archs may differ). In particular float->unorm8 too large values will still get clamped to 0, not 255, and for float->snorm8 NaNs will end up as -1, not 0 (but we do the clamp against 1.0 there to prevent too large values ending up as -1.0 - this is inconsistent to unorm8 handling but is what we ended up before, I'm not sure we can get away without it). This is quite fishy in any case as we depend on arch-dependent behavior of the iround (my understanding is in fact with altivec the conversion would actually saturate although I've no idea about NaNs, so probably wouldn't need to do anything for snorm). (There are only minimal piglit tests for unorm clamping behavior AFAICT, in particular nothing seems to test values which are too large to be handled by the float->int conversion.) For uint32->uint8 we also do a min against MAX_INT, since the source for the packs is always signed (again, on x86 - should probably be able to express these arch-dependent bits better some day). Reviewed-by: Jose Fonseca <[email protected]>
* llvmpipe: use alpha from already converted color if possibleRoland Scheidegger2017-01-062-18/+54
| | | | | | | | | | | | | For rgbx formats, there is no point in doing alpha conversion again (and with different tranpose even, so llvm can't eliminate it). Albeit it looks like there's some minimal changes needed in the blend code (found by code inspection, no test seemed to complain) if we do this - the blend factors are already sanitized if we have no destination alpha, however for src_alpha_saturate it looks like it still might make a difference (note that we forced has_alpha to true before for some formats and nothing complained, but this seems safer). Reviewed-by: Jose Fonseca <[email protected]>
* llvmpipe: use scalar load instead of vectors for small vectors in fs backendRoland Scheidegger2017-01-061-6/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | llvm has _huge_ problems trying to load things like <4 x i8> vectors and stitching such loads together to form 128bit vectors. My understanding of the problem is that the type legalizer tries to extend that to really a <4 x i32> vector and not a <16 x i8> vector with the 4 elements first then followed by padding, so the shuffles for then combining things together are more or less impossible - you can in fact see the pmovzxd llvm generates. Pre-4.0 llvm just gives up on it completely and does a 30+ pextrb/pinsrb sequence instead. It looks like current llvm has fixed this behavior (my guess would be due to better shuffle combination and load/shuffle folds), but we can avoid this by just loading as <1 x i32> values, combine that and only cast at the end. (I suspect it might also work if we'd pad the loaded vectors immediately before shuffling them together, instead of directly stitching 2 such vectors together pairwise before combining the pair. But this _might_ lose the ability to load the values directly into their right place in the vector with pinsrd.). But using 32bit values is probably easier for llvm as it will never give it funny ideas how the vector should look like. (This is possibly only a problem for 1x8bit formats, since 2x8bit will end up fetching 64bit hence only two vectors are stitched together, not 4, but we use the same strategy anyway.) Reviewed-by: Jose Fonseca <[email protected]>
* winsys/amdgpu: fix a race condition between fence updates and IB submissionsMarek Olšák2017-01-062-18/+22
| | | | | | | | | | The CS thread is needed to ensure proper ordering of operations and can't be disabled (without complicating the code). Discovered by Nine CSMT, which ended up in a deadlock. Acked-by: Edward O'Callaghan <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: add TC L2 prefetch for shaders and VBO descriptorsMarek Olšák2017-01-063-1/+50
| | | | | Reviewed-by: Edward O'Callaghan <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: add CP DMA flags for greater control over synchronizationMarek Olšák2017-01-063-16/+31
| | | | | | | for L2 prefetch Reviewed-by: Edward O'Callaghan <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: cleanly communicate which CP DMA packet is firstMarek Olšák2017-01-061-11/+21
| | | | | Reviewed-by: Edward O'Callaghan <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* gallium/radeon: add new HUD query num-SDMA-IBsMarek Olšák2017-01-069-2/+21
| | | | | Reviewed-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* gallium/radeon: rename the num-ctx-flushes query to num-GFX-IBsMarek Olšák2017-01-069-14/+14
| | | | | Reviewed-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: add HUD queries for cache flush statsMarek Olšák2017-01-064-0/+32
| | | | | Reviewed-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: don't count fast clears and prefetches into CP DMA statsMarek Olšák2017-01-061-2/+6
| | | | | Reviewed-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: don't wait for compute shaders in texture_barrierMarek Olšák2017-01-061-2/+1
| | | | | | | it doesn't interact with compute shaders in any way Reviewed-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: assume that a TES without POSITION precedes GSMarek Olšák2017-01-061-1/+2
| | | | | Reviewed-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: unduplicate VS color export codeMarek Olšák2017-01-061-9/+2
| | | | | | | it's exactly the same as the other ones Reviewed-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* radeonsi: clean up more HAVE_LLVM #ifdefsMarek Olšák2017-01-062-13/+20
| | | | | Reviewed-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]>
* gallium/radeon: clean up HAVE_LLVM #ifdefs in r600_get_llvm_processor_nameMarek Olšák2017-01-061-17/+11
| | | | Reviewed-by: Nicolai Hähnle <[email protected]>
* swr: [rasterizer core] rename OutputMerger functionsTim Rowley2017-01-062-9/+9
| | | | Reviewed-by: Bruce Cherniak <[email protected]>
* swr: [rasterizer core] fix SIMD16 Transpose_16_16Tim Rowley2017-01-061-2/+2
| | | | | | | Fix incorrect swizzling in SIMD16 Transpose_16_16 breaking the two-channel 16-bpc formats like R16G16_FLOAT. Reviewed-by: Bruce Cherniak <[email protected]>
* swr: [rasterizer core] fix SIMD16 output mergerTim Rowley2017-01-062-16/+22
| | | | | | Honor the colorHottileEnable mask when accessing colorBuffer pointers. Reviewed-by: Bruce Cherniak <[email protected]>
* swr: [rasterizer core] fix SIMD16 PackTraits pack() and unpack()Tim Rowley2017-01-063-48/+82
| | | | | | Fix routines for 8-bit and 16-bit formats used by optimized tile store. Reviewed-by: Bruce Cherniak <[email protected]>
* swr: [rasterizer core] fix SIMD16 transpose functionsTim Rowley2017-01-063-113/+225
| | | | | | | | | | | | | Fixed Transpose_16 methods of following formats: Transpose8_8_8_8 Transpose8_8 Transpose32_32 Transpose16_16_16_16 Transpose16_16_16 Transpose16_16 Reviewed-by: Bruce Cherniak <[email protected]>
* swr: [rasterizer core] whitespace adjustmentsTim Rowley2017-01-061-2/+1
| | | | Reviewed-by: Bruce Cherniak <[email protected]>
* vc4: Rewrite T image handling based on calling the LT handler.Eric Anholt2017-01-051-34/+75
| | | | | | | | | | | | | The T images are composed of effectively swizzled-around blocks of LT (4x4 utile) images, so we can reduce the t_utile_address() calls by 16x by calling into the simpler LT loop. This also adds support for calling down with non-utile-aligned coordinates, which will be part of lifting the utile alignment requirement on our callers and avoiding the RMW on non-utile-aligned stores. Improves 1024x1024 TexSubImage by 2.55014% +/- 1.18584% (n=46) Improves 1024x1024 GetTexImage by 2.242% +/- 0.880954% (n=32)
* vc4: Move the utile_width/height functions to header inlines.Eric Anholt2017-01-052-37/+36
| | | | | | I want these inlined in the callers, particularly with the tiling changes coming up, but we're not building with lto so some caller would suffer.
* vc4: Make the load/store utile functions static.Eric Anholt2017-01-052-4/+2
| | | | | They don't have any other callers outside of this file, and I'm hoping they get inlined soon.
* vc4: Simplify the load/store utile functions.Eric Anholt2017-01-051-10/+22
| | | | | | | | | They now have less of a dependency on the cpp, and don't have to do a divide. Hacking up mesa-demos teximage to do only one subtest and not draw points, I saw 1024x1024 glTexSubImage2D() improve by 4.86939% +/- 1.40408% (n=30) and glGetTexImage() by 2.18978% +/- 0.140268% (n=5).
* vc4: Reuse a list function to simplify bufmgr code.Eric Anholt2017-01-051-11/+2
|
* vc4: Flush the job early if we're referencing too many BOs.Eric Anholt2017-01-053-0/+16
| | | | | | | | | | | If we get up toward 256MB (or whatever the CMA area size is), VC4_GEM_CREATE will start throwing errors. Even if we don't trigger that, when we flush the kernel's BO allocation for the CLs or bin memory may end up throwing an error, at which point our job won't get rendered at all. Just flush early (half of maximum CMA size) so that hopefully we never get to that point.
* gallivm: (trivial) fix typo bug with small AoS format unpackingRoland Scheidegger2017-01-061-1/+1
| | | | | | | Fix typo using wrong (uninitialized) build context introduced by 4634cb5921b985f04f2daf00cda2d28036143bd3. (This only affects very rare small packed formats which have a PIPE_SWIZZLE_0 channel, such as r4a4, which is never used by mesa/st. Nevertheless it broke lp_test_format.)
* gallivm: implement aos unpack (to unorm8) for small unorm formatsRoland Scheidegger2017-01-052-17/+155
| | | | | | | | | | | | | | | | | | | Using bit replication. This path now resembles something which might make sense. (The logic was mostly copied from llvmpipe fs backend.) I am not convinced though it is actually faster than SoA sampling (actually I'm quite certain it's always a loss with AVX). With SoA it's just shift/mask/cvt/mul for getting the colors, whereas there's still roughly 3 shifts, 3 or/and per channel for AoS (i.e. for SoA it's exactly the same as it would be for a rgba8 format, whereas the extra effort for AoS is significant). The filtering might still be faster (albeit with FMA the instruction count gets down quite a bit there on the SoA float filtering path on new cpus). And those small unorm formats often don't have an alpha channel (which makes things worse relatively for AoS path). (This also fixes a trivial bug in the llvmpipe fs code this was derived from, albeit it was only relevant for 4-bit channels.) Reviewed-by: Jose Fonseca <[email protected]>
* gallivm: optimize lp_build_unpack_arith_rgba_aos slightlyRoland Scheidegger2017-01-051-19/+97
| | | | | | | | | | | | | | | | | This code uses a vector shift which has to be emulated on x86 unless there's AVX2. Luckily in some cases we can actually avoid the shift altogether, so do that. Also make sure we hit the fast lp_build_conv() path when applicable, albeit that's quite the hack... That said, this path is taken for AoS sampling for small unorm (smaller than rgba8) formats, and it is completely hopeless even with those changes, with or without AVX. (Probably should have some code similar to the one in the llvmpipe fs backend code, using bit replication to extend to rgba8888 - rounding is not quite 100% accurate but if it's good enough there it should be here as well.) Reviewed-by: Jose Fonseca <[email protected]>
* gallivm: use 2 srcs for 32->16bit conversions in lp_bld_conv_autoRoland Scheidegger2017-01-051-2/+19
| | | | | | | | | | | | | | | | | If we only feed one source vector at a time, we cannot use pack intrinsics (as we only have a 64bit destination dst vector). lp_bld_conv_auto is specifically designed to alter the length and number of destination vectors, so this works just fine (if we use single source vectors at a time, afterwards we immediately reassemble the vectors). For AVX though this isn't really possible, since we expect 128bit output already for a single 256bit input. (One day we should handle AVX2 which again would need multiple inputs, however there's the problem that we get different ordered output there and we don't want to reorder, so would need to be able to tell build_conv to handle upper and lower halfs independently.) A similar strategy would probably work for 32->8bit too (if it doesn't hit the special case) but I'm going to try something different for that... Reviewed-by: Jose Fonseca <[email protected]>