mesa.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	gallium: remove TGSI_OPCODE_ABS	Marek Olšák	2017-01-05	21	-88/+8
\| \| \| \| \| \|	It's redundant with the source modifier. Reviewed-by: Nicolai Hähnle <[email protected]>
*	st/nine: Remove all usage of ureg_SUB in nine_shader	Axel Davy	2017-01-05	1	-8/+8
\| \| \| \| \| \| \|	This is required to drop gallium SUB. Signed-off-by: Axel Davy <[email protected]> Signed-off-by: Marek Olšák <[email protected]>
*	st/nine: Remove all usage of ureg_SUB in nine_ff	Axel Davy	2017-01-05	1	-20/+20
\| \| \| \| \| \| \|	This is required to remove gallium SUB. Signed-off-by: Axel Davy <[email protected]> Signed-off-by: Marek Olšák <[email protected]>
*	st/nine: Do not map SUB and ABS to their gallium equivalent.	Axel Davy	2017-01-05	1	-2/+23
\| \| \| \| \| \| \|	This is required for gallium SUB and ABS to be removed. Signed-off-by: Axel Davy <[email protected]> Signed-off-by: Marek Olšák <[email protected]>
*	st/va: fix incorrect argument in vl_compositor_cleanup	Nayan Deshmukh	2017-01-05	1	-1/+1
\| \| \| \| \| \| \| \|	This fixes the mistake introduced in commit b6737a8bcd03ea68952799144c0c6e6e6679bee9 Signed-off-by: Nayan Deshmukh <[email protected]> Reviewed-by: Christian König <[email protected]>
*	swr: remove unneeded llvm version check	Tim Rowley	2017-01-05	1	-4/+0
\| \| \| \| \| \| \|	Old test caused breakage with llvm-svn (4.0.0svn), and not needed as the minimum required llvm version for swr is 3.6. Reviewed-by: George Kyriazis <[email protected]>
*	swr: fix windows build break	George Kyriazis	2017-01-05	2	-4/+7
\| \| \| \| \| \| \| \| \| \|	wrap lp_bld_type.h around extern "C". Windows decorates global variables, so when used from .cpp files, need to use an undecorated version. Also, removed related and unneeded code from swr_screen.cpp Reviewed-by: Ilia Mirkin <[email protected]>
*	radeonsi: update clip_regs if clip_disable changes to fix a hang	Marek Olšák	2017-01-05	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This seems to fix the GPU hangs caused by: commit ed3190b3f3a776fc8c75b1e6130a88079166d115 Author: Marek Olšák <[email protected]> Date: Sun Nov 13 18:41:43 2016 +0100 radeonsi: don't export ClipVertex and ClipDistance[] if clipping is disabled Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99219 Tested-by: Samuel Pitoiset <[email protected]>
*	gallium: add PIPE_CAP_GLSL_OPTIMIZE_CONSERVATIVELY	Marek Olšák	2017-01-05	17	-0/+19
\| \| \| \| \| \|	Drivers with good compilers don't need aggressive optimizations before TGSI. Reviewed-by: Eric Anholt <[email protected]>
*	va: call texture_get_handle while the mutex is being held	Marek Olšák	2017-01-04	1	-2/+5
\| \| \| \| \| \| \|	The context may be used by texture_get_handle. Reviewed-by: Christian König <[email protected]> Cc: 13.0 <[email protected]>
*	vdpau: call texture_get_handle while the mutex is being held	Marek Olšák	2017-01-04	2	-6/+13
\| \| \| \| \| \| \| \| \|	The context may be used by texture_get_handle. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99158 Reviewed-by: Christian König <[email protected]> Cc: 13.0 <[email protected]>
*	radeonsi: capitalize VM hex addr when dumping buffer list	Samuel Pitoiset	2017-01-04	1	-1/+1
\| \| \| \| \| \| \| \|	Useful when debugging with R600_DEBUG=vm,check_vm to match addr in both outputs. Signed-off-by: Samuel Pitoiset <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
*	gallium/hud: add a path separator between dump directory and filename	Edmondo Tommasina	2017-01-03	1	-1/+2
\| \| \| \| \| \| \|	It's more user friendly and it avoids to write files in unexpected places. Signed-off-by: Marek Olšák <[email protected]>
*	r600/sb: Fix loop optimization related hangs on eg	Heiko Przybyl	2017-01-03	6	-30/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make sure unused ops and their references are removed, prior to entering the GCM (global code motion) pass, to stop GCM from breaking the loop logic and thus hanging the GPU. Turns out, that sb has problems with loops and node optimizations regarding associative folding: - the global code motion (gcm) pass moves ops up a loop level/basic block until they've fulfilled their total usage count - if there are ops folded into others, the usage count won't be fulfilled and thus the op moved way up to the top - within GCM the op would be visited and their deps would be moved alongside it, to fulfill the src constaints - in a loop, an unused op is moved out of the loop and GCM would move the src value ops up as well - now here arises the problem: if the loop counter is one of the src values it would get moved up as well, the loop break condition would never get hit and the shader turn into an endless loop, resulting in the GPU hanging and being reset A reduced (albeit nonsense) piglit example would be: [require] GLSL >= 1.20 [fragment shader] uniform int SIZE; uniform vec4 lights[512]; void main() { float x = 0; for(int i = 0; i < SIZE; i++) x += lights[2i+1].x; } [test] uniform int SIZE 1 draw rect -1 -1 2 2 Which gets optimized to: ===== SHADER #12 OPT ================================== PS/BARTS/EVERGREEN ===== ===== 42 dw ===== 1 gprs ===== 2 stack ========================================= ALU 3 @24 1 y: MOV R0.y, 0 t: MULLO_UINT R0.w, [0x00000002 2.8026e-45].x, R0.z LOOP_START_DX10 @22 PUSH @6 ALU 1 @30 KC0[CB0:0-15] 2 M x: PRED_SETGE_INT __.x, R0.z, KC0[0].x JUMP @14 POP:1 LOOP_BREAK @20 POP @14 POP:1 ALU 2 @32 3 x: ADD_INT R0.x, R0.w, [0x00000002 2.8026e-45].x TEX 1 @36 VFETCH R0.x___, R0.x, RID:0 MFC:16 UCF:0 FMT[..] ALU 1 @40 4 y: ADD R0.y, R0.y, R0.x LOOP_END @4 EXPORT_DONE PIXEL 0 R0.____ EOP ===== SHADER_END =============================================================== Notice R0.z being the loop counter/break condition relevant register and being never incremented at all. Also some of the loop content has been moved out of it, to fulfill the requirements for the one unused op. With a debug build of mesa this would produce an error like error at : PRED_SETGE_INT __, __, EM.2, R1.x.2\|\|[email protected], C0.x : operand value R1.x.2\|\|[email protected] was not previously written to its gpr and the compilation would fail due to this. On a release build it gets passed to the GPU. When using this patch, the loop remains intact: ===== SHADER #12 OPT ================================== PS/BARTS/EVERGREEN ===== ===== 48 dw ===== 1 gprs ===== 2 stack ========================================= ALU 2 @24 1 y: MOV R0.y, 0 z: MOV R0.z, 0 LOOP_START_DX10 @22 PUSH @6 ALU 1 @28 KC0[CB0:0-15] 2 M x: PRED_SETGE_INT __.x, R0.z, KC0[0].x JUMP @14 POP:1 LOOP_BREAK @20 POP @14 POP:1 ALU 4 @30 3 t: MULLO_UINT T0.x, [0x00000002 2.8026e-45].x, R0.z 4 x: ADD_INT R0.x, T0.x, [0x00000002 2.8026e-45].x TEX 1 @40 VFETCH R0.x___, R0.x, RID:0 MFC:16 UCF:0 FMT[..] ALU 2 @44 5 y: ADD R0.y, R0.y, R0.x z: ADD_INT R0.z, R0.z, 1 LOOP_END @4 EXPORT_DONE PIXEL 0 R0.____ EOP ===== SHADER_END =============================================================== Piglit: ./piglit summary console -d results/_gpu_noglx name: unpatched_gpu_noglx patched_gpu_noglx ---- ------------------- ----------------- pass: 18016 18021 fail: 748 743 crash: 7 7 skip: 1124 1124 timeout: 0 0 warn: 13 13 incomplete: 0 0 dmesg-warn: 0 0 dmesg-fail: 0 0 changes: 0 5 fixes: 0 5 regressions: 0 0 total: 19908 19908 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94900 Tested-by: Heiko Przybyl <[email protected]> Tested-on: Barts PRO HD6850 Signed-off-by: Heiko Przybyl <[email protected]> Signed-off-by: Marek Olšák <[email protected]>
*	vl/zscan: fix "Fix trivial sign compare warnings"	Christian König	2017-01-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	The variable actually needs to be signed, otherwise converting it to a float doesn't work as expected. Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=98914 Signed-off-by: Christian König <[email protected]> Reviewed-by: Nayan Deshmukh <[email protected]> Cc: "13.0" <[email protected]> Fixes: 1fb4179f927 ("vl: Fix trivial sign compare warnings")
*	st/va: error handling	Nayan Deshmukh	2017-01-03	1	-3/+15
\| \| \| \| \| \| \| \|	handle the cases when vl_compositor_set_csc_matrix(), vl_compositor_init_state() and vl_compositor_init() fail Signed-off-by: Nayan Deshmukh <[email protected]> Reviewed-by: Christian König <[email protected]>
*	st/vdpau: error handling	Nayan Deshmukh	2017-01-03	3	-15/+50
\| \| \| \| \| \| \| \|	handle the cases when vl_compositor_set_csc_matrix(), vl_compositor_init_state() and vl_compositor_init() fail Signed-off-by: Nayan Deshmukh <[email protected]> Reviewed-by: Christian König <[email protected]>
*	vl/compositor: implement error handling	Nayan Deshmukh	2017-01-03	2	-3/+12
\| \| \| \| \| \| \|	pipe_buffer_map and pipe_buffer_create may return NULL Signed-off-by: Nayan Deshmukh <[email protected]> Reviewed-by: Christian König <[email protected]>
*	gallium/hud: fix the windows build by disabling file dumping	Marek Olšák	2017-01-02	1	-0/+2
\|
*	gallium/hud: set filedescriptor for fps graph	Edmondo Tommasina	2017-01-01	1	-0/+2
\| \| \| \|	Signed-off-by: Marek Olšák <[email protected]>
*	gallium/hud: set filedescriptor for cpu graph	Edmondo Tommasina	2017-01-01	1	-0/+2
\| \| \| \|	Signed-off-by: Marek Olšák <[email protected]>
*	gallium/hud: move file initialization to a function	Edmondo Tommasina	2017-01-01	3	-11/+20
\| \| \| \| \| \| \|	The function will be used later to create the filedescriptor for other metrics. Signed-off-by: Marek Olšák <[email protected]>
*	gallium/hud: dump hud_driver_query values to files	Edmondo Tommasina	2017-01-01	3	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Dump values for every selected data source in GALLIUM_HUD. Every data source has its own file and the filename is equal to the data source identifier. Set GALLIUM_HUD_DUMP_DIR to dump values to files in this directory. No values are dumped if the environment variable is not set, the directory doesn't exist or the user doesn't have write access. Signed-off-by: Marek Olšák <[email protected]>
*	freedreno/ir3: rework varying slots (maybe??)	Rob Clark	2016-12-30	1	-4/+9
\| \| \| \| \| \| \| \| \| \|	See: dEQP-GLES2.functional.shaders.swizzles.vector_swizzles.mediump_vec2_yyyy_fragment if we only access (in FS) varying.y then it ends up in slot zero.. I'm not sure the hw likes that.. Signed-off-by: Rob Clark <[email protected]>
*	nir: Rename convert_to_ssa lower_regs_to_ssa	Jason Ekstrand	2016-12-29	2	-2/+2
\| \| \| \|	This matches the naming of nir_lower_vars_to_ssa, the other to-SSA pass.
*	vc4: Rework scheduling of thread switch to cut one more NOP.	Eric Anholt	2016-12-29	1	-46/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Jonas's patch got us most of the benefit of scheduling instructions into the delay slots of thread switch, but if there had been nothing to pair the thrsw with, it would move the thrsw up and leave a NOP where the thrsw was. Instead, don't pair anything with thrsw through the normal scheduling path, and have a separate helper function that inserts the thrsw earlier if possible and inserts any necessary NOPs. total instructions in shared programs: 93027 -> 92643 (-0.41%) instructions in affected programs: 14952 -> 14568 (-2.57%)
*	vc4: Fill thread switching delay slots	Jonas Pfeil	2016-12-29	1	-7/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Scan for instructions without a signal set in front of the switching instruction and move the signal up there. shader-db results: total instructions in shared programs: 94494 -> 93027 (-1.55%) instructions in affected programs: 23545 -> 22078 (-6.23%) v2: Fix re-emitting of the instruction in the loop trying to emit NOPs, drop a scheduling change from branch delay slots. (by anholt) Signed-off-by: Jonas Pfeil <[email protected]>
*	vc4: Enable NIR-based loop unrolling.	Eric Anholt	2016-12-29	1	-0/+5
\| \| \| \| \|	This successfully unrolls a new shader in GLB2.7, which also gets that shader to successfully compile in multithreaded mode.
*	freedreno/ir3: fix linkage::var size	Rob Clark	2016-12-27	1	-1/+1
\| \| \| \| \| \| \|	It should actually be 32 for a4xx/a5xx.. we still only advertise 16 but for a5xx the linkage map includes position/psize. Signed-off-by: Rob Clark <[email protected]>
*	freedreno/ir3: treat clipvertex like a normal varying	Rob Clark	2016-12-27	1	-3/+1
\| \| \| \| \| \| \| \|	We need this in case it is streamed out. Not sure why we were treating it specially before. Having it as a VS out is harmless if FS doesn't have a matching input. Signed-off-by: Rob Clark <[email protected]>
*	freedreno/a5xx: transform-feedback support	Rob Clark	2016-12-27	7	-38/+209
\| \| \| \| \| \| \| \| \| \| \|	We'll need to revisit when adding hw binning pass support, whether we can still do this in main draw step, as we do w/ a3xx/a4xx, or if we needed to move it to the binning stage. Still some failing piglits but most tests pass and the common cases seem to work. Signed-off-by: Rob Clark <[email protected]>
*	freedreno: update generated headers	Rob Clark	2016-12-27	7	-43/+81
\| \| \| \| \| \| \|	Pull in a5xx streamout related regs. Also fixes a couple incorrect register definitions. Signed-off-by: Rob Clark <[email protected]>
*	freedreno/ir3: UBO support for 64b GPUs (a5xx)	Rob Clark	2016-12-27	1	-3/+24
\| \| \| \| \| \|	Update address calculation to support 64b addresses. Signed-off-by: Rob Clark <[email protected]>
*	freedreno/ir3: rework location of driver constants	Rob Clark	2016-12-27	6	-53/+75
\| \| \| \| \| \| \| \| \| \| \|	Rework how we lay out driver constants (driver-params, UBO/TFBO buffer addresses, immediates) for more flexibility. For a5xx+ we need to deal with the fact that gpu ptrs are 64b instead of 32b, which makes the fixed offset scheme not work so well. While we are dealing with that we might also make the layout more dynamic to account for varying # of UBOs, etc. Signed-off-by: Rob Clark <[email protected]>
*	freedreno/a5xx: fix emit for bo addresses	Rob Clark	2016-12-27	1	-3/+9
\| \| \| \| \| \|	Reloc for the buffer address is two dwords on 64b devices (a5xx+) Signed-off-by: Rob Clark <[email protected]>
*	freedreno/a5xx: texture layout	Rob Clark	2016-12-27	2	-2/+2
\| \| \| \| \| \| \|	Seems to be imilar to a4xx, and sampler state "array-pitch" needs to be aligned to page size. Signed-off-by: Rob Clark <[email protected]>
*	ttn: set ->info->num_ubos	Rob Clark	2016-12-27	1	-1/+4
\| \| \| \| \| \| \| \| \|	For dealing w/ 32b vs 64b gpu addresses, I need to rework how we pass UBO buffer addresses to shader, and knowing up front the # of UBOs is useful. But I noticed ttn wasn't setting this. Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Eric Anholt <[email protected]>
*	clover: Use Clang's diagnostics	Vedran Miletić	2016-12-24	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Presently errors from frontend are handled only if they occur in clang::CompilerInvocation::CreateFromArgs(). This patch uses clang::DiagnosticsEngine to detect errors such as invalid values for Clang frontend arguments. Fixes Piglit's cl/program/build/fail/invalid-version-declaration.cl test. v2: fix inconsistent code formatting Signed-off-by: Vedran Miletić <[email protected]> Reviewed-by: Francisco Jerez <[email protected]> Tested-by: Aaron Watry <[email protected]>
*	swr: fix icc compile error	Bruce Cherniak	2016-12-23	1	-1/+1
\| \| \| \| \| \| \| \|	ICC doesn't like the use of nullptr (std::nullptr_t) argument in p_atomic_set. GCC and clang don't complain. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99119 Reviewed-by: Tim Rowley <[email protected]>
*	radeonsi: Bugfix needed for hashcat	Christian Inci	2016-12-22	1	-5/+7
\| \| \| \| \| \| \| \| \|	Hashcat needs MAX_GLOBAL_BUFFERS to be 21 or even 22 for some modes. It'll crash otherwise. I'm adding an assert to see if programs need it to be even higher. Signed-off-by: Christian Inci <[email protected]> [Handle first properly; should be NFC, since clover always uses first == 0.] Signed-off-by: Nicolai Hähnle <[email protected]>
*	radeonsi: fix gl_ClipDistance and gl_ClipVertex for points	Nicolai Hähnle	2016-12-22	1	-2/+10
\| \| \| \| \| \| \| \| \| \|	The clipper hardware doesn't consider points as primitives that can be clipped. Simply setting the corresponding cull bits works, and should not have an adverse effect on other primitive types according to the hardware team. Reviewed-by: Marek Olšák <[email protected]> Reviewed-by: Edward O'Callaghan <[email protected]>
*	radeonsi: only set VS_OUT_MISC_SIDE_BUS_ENA when the misc vector is used	Nicolai Hähnle	2016-12-22	1	-5/+6
\| \| \| \| \| \| \| \|	Should have no effect (other than perhaps on power consumption), but Vulkan does this. Reviewed-by: Marek Olšák <[email protected]> Reviewed-by: Edward O'Callaghan <[email protected]>
*	llvmpipe: Link tests with CLOCK_LIB.	Vinson Lee	2016-12-21	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	Fix linking error with 'make check'. CXXLD lp_test_format ../../../../src/gallium/auxiliary/.libs/libgallium.a(os_time.o): In function `os_time_get_nano': src/gallium/auxiliary/os/os_time.c:59: undefined reference to `clock_gettime' Signed-off-by: Vinson Lee <[email protected]>
*	radeonsi: add Polaris12 support (v3)	Junwei Zhang	2016-12-21	5	-1/+11
\| \| \| \| \| \| \| \| \| \| \|	v2: use gfxip names for llvm 4.0+ v3: use tonga for llvm <= 3.8, drop gfxip name, we can just change that we change the other asics. Reviewed-by: Marek Olšák <[email protected]> Signed-off-by: Junwei Zhang <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Acked-by: Christian König <[email protected]>
*	ttn: handle GLSL_SAMPLER_DIM_SUBPASS_MS case	Juan A. Suarez Romero	2016-12-21	1	-0/+1
\| \| \| \| \| \|	Fixes a warning. Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]>
*	svga: Fix a strict-aliasing violation in shader dumper	Edward O'Callaghan	2016-12-21	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As per the C spec, it is illegal to alias pointers to different types. This results in undefined behaviour after optimization passes, resulting in very subtle bugs that happen only on a full moon.. Use a memcpy() as a well defined coercion between the isomorphic bit-field interpretations of memory. V.2: Use C99 compat STATIC_ASSERT() over C11 static_assert(). Signed-off-by: Edward O'Callaghan <[email protected]> Reviewed-by: Charmaine Lee <[email protected]>
*	draw: use SoA fetch, not AoS one	Roland Scheidegger	2016-12-21	1	-48/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that there's some SoA fetch which never falls back, we should always get results which are better or at least not worse (something like rgba32f will stay the same). For cases which get way better, think something like R16_UNORM with 8-wide vectors: this was 8 sign-extend fetches, 8 cvt, 8 muls, followed by a couple of shuffles to stitch things together (if it is smart enough, 6 unpacks) and then a (8-wide) transpose (not sure if llvm could even optimize the shuffles + transpose, since the 16bit values were actually sign-extended to 128bit before being cast to a float vec, so that would be another 8 unpacks). Now that is just 8 fetches (directly inserted into vector, albeit there's one 128bit insert needed), 1 cvt, 1 mul. v2: ditch the old AoS code instead of just disabling it. Reviewed-by: Jose Fonseca <[email protected]>
*	gallivm: generalize the compressed format soa fetch a bit	Roland Scheidegger	2016-12-21	1	-37/+49
\| \| \| \| \| \| \| \| \|	This can now handle rgtc (unorm) too - this path no longer handles plain formats, but that's unnecessary they now all have their proper SoA unpack (this will still be dog-slow though due to the actual fetch being per-pixel util fallbacks). Reviewed-by: Jose Fonseca <[email protected]>
*	gallivm: provide soa fetch path handling formats with more than 32bit	Roland Scheidegger	2016-12-21	1	-154/+375
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This previously always fell back to AoS conversion. Even for 4-float formats (which is the optimal case by far for that fallback case) this was suboptimal, since it meant the conversion couldn't be done with 256bit vectors. While this may still only be partly possible for some formats, (unless there's AVX2 support) at least the transpose can be done with half the unpacks (and before using the transpose for AoS fallbacks, it was worse still). With less than 4 channels, things got way worse with the AoS fallback quickly even with 128bit vectors. The strategy is pretty much the same as the existing one for formats which fit into 32 bits, except there's now multiple vectors to be fetched (2 or 4 to be exact), which need to be shuffled first (if it's 4 vectors, this amounts to a transpose, for 2 it's a bit different), then the unpack is done the same (with the exception that the shift of the channels is now modulo 32, and we need to select the right vector). In fact the most complex part about it is to get the shuffles right for separating into lo/hi parts for AVX/AVX2... This also makes use of the new ability of gather to use provided type information, which we abuse to outsmart llvm so we get decent shuffles, and to fetch 3x32bit vectors without having to ZExt the scalar. And just because we can, we handle double formats too, albeit they are a bit different (draw sometimes needs to handle that). v2: fix typo float/int bug (generating inefficient code). Reviewed-by: Jose Fonseca <[email protected]>
*	gallivm: optimize gather a bit, by using supplied destination type	Roland Scheidegger	2016-12-21	8	-79/+333
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	By using a dst_type in the the gather interface, gather has some more knowledge about how values should be fetched. E.g. if this is a 3x32bit fetch and dst_type is 4x32bit vector gather will no longer do a ZExt with a 96bit scalar value to 128bit, but just fetch the 96bit as 3x32bit vector (this is still going to be 2 loads of course, but the loads can be done directly to simd vector that way). Also, we can now do some try to use the right int/float type. This should make no difference really since there's typically no domain transition penalties for such simd loads, however it actually makes a difference since llvm will use different shuffle lowering afterwards so the caller can use this to trick llvm into using sane shuffle afterwards (and yes llvm is really stupid there - nothing against using the shuffle instruction from the correct domain, but not at the cost of doing 3 times more shuffles, the case which actually matters is refusal to use shufps for integer values). Also do some attempt to avoid things which look great on paper but llvm doesn't really handle (e.g. fetching 3-element 8 bit and 16 bit vectors which is simply disastrous - I suspect type legalizer is to blame trying to extend these vectors to 128bit types somehow, so fetching these with scalars like before which is suboptimal due to the ZExt). Remove the ability for truncation (no point, this is gather, not conversion) as it is complex enough already. While here also implement not just the float, but also the 64bit avx2 gathers (disabled though since based on the theoretical numbers the benefit just isn't there at all until Skylake at least). Reviewed-by: Jose Fonseca <[email protected]>