aboutsummaryrefslogtreecommitdiffstats
path: root/src
Commit message (Collapse)AuthorAgeFilesLines
* i965/sklgt4: Implement depth/timestamp write w/aBen Widawsky2016-05-261-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | The stated bug describes a scenario in which a post sync write operation for depth or timestamp can be ignored. There are two workarounds suggested, the first and easier is to simply do a cs stall when we do these type of writes. The second option is to do a PIPE_CONTROL flush after the post sync but before the data is required. Generally, I believe the data written out is consumed by the application on the CPU side and so doing the easier of the two is ideal. Furthermore, these queries aren't tremendously common in the perf sensitive apps I have looked at. However, there could be cases where a shader stage might directly consume the data, and as a result option 2 may be desirable. This patch goes with the easier solution for now. gen9lp bug_de_id=2137196 By itself, this does *not* fix any of the GT4 hangs we're currently experiencing. Cc: Mika Kuoppala <[email protected]> Signed-off-by: Ben Widawsky <[email protected]> Reviewed-by: Anuj Phogat <[email protected]>
* i965/bxt: Add 2x6 variantBen Widawsky2016-05-261-0/+22
| | | | | | Cc: [email protected] Signed-off-by: Ben Widawsky <[email protected]> Reviewed-by: Kristian Høgsberg <[email protected]>
* radeonsi: Allow TES distribution between shader engines.Bas Nieuwenhuizen2016-05-264-15/+40
| | | | | | | | | | | | | The R_028B50_VGT_TESS_DISTRIBUTION value is copied from amdgpu-pro. Smaller values in the ACCUM fields seem to decrease the performance advantage from this patch, higher values don't seem to matter. v2: Add distribution mode field enums. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Process multiple patches per threadgroup.Bas Nieuwenhuizen2016-05-261-15/+35
| | | | | | | | | | | | | | | | | | | | | | | Using more than 1 wave per threadgroup does increase performance generally. Not using too many patches per threadgroup also increases performance. Both catalyst and amdgpu-pro seem to use 40 patches as their maximum, but I haven't really seen any performance increase from limiting the number of patches to 40 instead of 64. Note that the trick where we overlap the input and output LDS does not work anymore as the insertion of the tess factors changes the patch stride. v2: - Add comment about LDS assumptions. - Add constant for buffer size. - Fix code style. v3: - Correct limits for not splitting patches between waves. - Set max num_patches to 40 as in the proprietary driver. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Add barrier before writing the tess factors.Bas Nieuwenhuizen2016-05-261-0/+6
| | | | | | | | The factors may be stored to LDs by another invocation than the invocation for vertex 0. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Enable dynamic HS.Bas Nieuwenhuizen2016-05-262-5/+16
| | | | | | | | | | This allows running the TES on different CU's than the TCS which results in performance improvements. v2: Only write the control word from one invocation. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Remove LDS layout user SGPR's from TES.Bas Nieuwenhuizen2016-05-263-13/+10
| | | | | | | | They are unused. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Use buffer loads and stores for passing data from TCS to TES.Bas Nieuwenhuizen2016-05-261-16/+50
| | | | | | | | | | | | | | | | We always try to use 4-component loads, as LLVM does not combine loads and they bypass the L1 cache. We can't use a similar strategy for stores and this is especially notable with the tess factors, as they are often set with separate MOV's per component in the TGSI. We keep storing to LDS and the LDS space, so we can load the outputs later, either due to the shader, of for wrting the tess factors. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Store inputs to memory when not using a TCS.Bas Nieuwenhuizen2016-05-263-0/+49
| | | | | | | | | | | | | | | | | We need to copy the VS outputs to memory. I decided to do this using a shader key, as the value depends on other shaders. I also switch the fixed function TCS over to monolithic, as otherwisze many of the user SGPR's need to be passed to the epilog, which increases register pressure, or complexity to avoid that. The main body of the fixed function TCS is not that interesting to precompile anyway, since we do it on demand and it is very small. v2: Use u_bit_scan64. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Add offchip buffer address calculation.Bas Nieuwenhuizen2016-05-261-0/+124
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of creating a memory area per patch and per vertex, we put the same attribute of every vertex & patch together. Most loads and stores access the same attribute across all lanes, only for different patches and vertices. For the TCS this results in tightly packed data for 4-component stores. For the TES this is not the case as within a patch the loads often also access the same vertex. However if there are < 4 vertices/patch, this still results in a reduction of the number of cache lines. In the LDS situation we only do better than worst case if the data per patch < 64 bytes, which due to the tessellation factors is pretty much never. We do not use hardware swizzling for this. It would slightly reduce the number of executed VALU instructions, but I had issues with increased wait times that I haven't been able to solve yet. Furthermore, the tbuffer_store intrinsic does not support both VGPR offset and an index, so we have a problem storing indirectly indexed outputs. This can be solved by temporarily storing arrays in LDS and then copying them, but I don't think that is worth the effort. The difference in VALU cycles hardware swizzling gives is about 0.2% of total busy cycles. That is without handling the array case. I chose for attributes instead of components as they are often accessed together, and the software swizzling takes VALU cycles for calculating offsets. v2: - Rename functions to get_tcs_tes_buffer_address. - multiply by 16 as late as possible. - Use tgsi_full_src_register_from_dst. - Remove some bad comments. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Add user SGPR for the layout of the offchip buffer.Bas Nieuwenhuizen2016-05-263-4/+20
| | | | | | Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Use correct parameter index for LS_OUT_LAYOUT.Bas Nieuwenhuizen2016-05-261-3/+4
| | | | | | | | | This happens to be in the right position, but that changes when TCS/TES get new parameters. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Add buffer load functions.Bas Nieuwenhuizen2016-05-261-0/+114
| | | | | | | | | | v2: - Use llvm.admgcn.buffer.load instrinsics for new LLVM. - Code style fixes. v3: - Code style fix. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Define build_tbuffer_store_dwords earlier to support new users.Bas Nieuwenhuizen2016-05-261-69/+69
| | | | | | Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Add offchip tessellation parameters.Bas Nieuwenhuizen2016-05-263-6/+34
| | | | | | Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Nicolai Hähnle <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* radeonsi: Add buffer for offchip storage between TCS and TES.Bas Nieuwenhuizen2016-05-264-0/+23
| | | | | | | | | | | The buffer is quite large, but should only be allocated if the application uses tessellation. Most non-games don't. v2: - Use the correct register for SI. - Add define for block size. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* tgsi: fix coverity out-of-bounds warningRob Clark2016-05-261-0/+3
| | | | | | | | | CID 1271532 (#1 of 1): Out-of-bounds read (OVERRUN)34. overrun-local: Overrunning array of 2 16-byte elements at element index 2 (byte offset 32) by dereferencing pointer &inst.Dst[i]. Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Brian Paul <[email protected]>
* tgsi: fix out of bounds accessRob Clark2016-05-261-1/+1
| | | | | | | | | | | | Not sure why coverity calls this an out-of-bounds read vs out-of-bounds write. CID 1358920 (#1 of 1): Out-of-bounds read (OVERRUN)9. overrun-local: Overrunning array r of 3 16-byte elements at element index 3 (byte offset 48) using index chan (which evaluates to 3). Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Brian Paul <[email protected]>
* i965: Don't use fast copy blit in case of logical operations other than GL_COPYAnuj Phogat2016-05-261-2/+7
| | | | | | | | | | | XY_FAST_COPY_BLT command doesn't have a field for raster operation. So, fall back to using XY_SRC_COPY_BLT to handle those cases. Fixes piglit test gl-1.1-xor-copypixels when fast copy blit is enabled for all tiling formats. Signed-off-by: Anuj Phogat <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* i965/gen9: Remove the halign/valign field setup code in fast copy blitAnuj Phogat2016-05-261-65/+0
| | | | | | | | | | | | | Experimentation with different values of src/dst horizontal/vertical alignment showed that these fileds are not used on gen9 hardware. A recent update in graphics specs has removed these fields from XY_FAST_COPY_BLT command. Cc: Ben Widawsky <[email protected]> Cc: Chad Versace <[email protected]> Signed-off-by: Anuj Phogat <[email protected]> Reviewed-by: Ben Widawsky <[email protected]>
* nvc0: allow to monitor MP perf counters with compute shadersSamuel Pitoiset2016-05-262-19/+55
| | | | | | | | | | | | | | | | | | To read out MP perf counters we use a compute shader and need to upload input data like a 64-bits addr used to store the values and a sequence ID for synchronization. Currently, this input data is uploaded as user uniforms which means that it's sticked to c0[], but if a compute shader from a real application is used, monitoring those performance counters will just overwrite some data and miserably crash. Instead, sticking the 64-bits addr and the sequence into the driver constant buffer seems like much better and will allow to monitor counters with GL 4.3 apps. Tested on GF119 and GK110, but should not hurt anything on GK104. Signed-off-by: Samuel Pitoiset <[email protected]> Reviewed-by: Ilia Mirkin <[email protected]>
* mesa: Move robustness code to main/robustness.cKristian Høgsberg Kristensen2016-05-263-136/+166
| | | | | | Signed-off-by: Kristian Høgsberg Kristensen <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Brian Paul <[email protected]>
* egl: Additional attribute validation for eglCreatePbufferSurfacePlamena Manolova2016-05-261-0/+13
| | | | | | | | | | | | | eglCreatePbufferSurface should generate an EGL_BAD_MATCH error if: 1: The EGL_TEXTURE_FORMAT attribute is EGL_NO_TEXTURE and EGL_TEXTURE_TARGET is something other than EGL_NO_TEXTURE 2: EGL_TEXTURE_FORMAT is something other than EGL_NO_TEXTURE and EGL_TEXTURE_TARGET is EGL_NO_TEXTURE. This fixes the dEQP-EGL.functional.negative_api.create_pbuffer_surface test. Signed-off-by: Plamena Manolova <[email protected]> Reviewed-by: Ben Widawsky <[email protected]>
* gallium/radeon: add the kernel version into the renderer stringMarek Olšák2016-05-261-3/+9
| | | | | | | | | | | Example: Gallium 0.4 on AMD TONGA (DRM 3.2.0 / 4.5.0, LLVM 3.9.0) My kernel version is pretty long already (4.5.0-amd-01025-g32791c1) and adding "kernel" into the string would make too it long for glxinfo to display. Reviewed-by: Michel Dänzer <[email protected]>
* winsys/amdgpu: add back multithreaded command submissionMarek Olšák2016-05-266-131/+341
| | | | | | | | | | | | | | | Ported from the initial amdgpu winsys from the private AMD branch. The thread creates the buffer list, submits IBs, and cleans up the submission context, which can also destroy buffers. 3-5% reduction in CPU overhead is expected for apps submitting a lot of IBs per frame. This is most visible with DMA IBs. v2: use a semaphore instead of a busy loop in amdgpu_ws_queue_cs add another amdgpu_cs_sync_flush call into amdgpu_bo_map Reviewed-by: Nicolai Hähnle <[email protected]>
* gallium/tgsi: use _mesa_roundevenf in micro_rndLars Hamre2016-05-261-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | Fixes the following piglit tests (for softpipe): /spec/glsl-1.30/execution/built-in-functions/... fs-roundeven-float fs-roundeven-vec2 fs-roundeven-vec3 fs-roundeven-vec4 vs-roundeven-float vs-roundeven-vec2 vs-roundeven-vec3 vs-roundeven-vec4 /spec/glsl-1.50/execution/built-in-functions/... gs-roundeven-float gs-roundeven-vec2 gs-roundeven-vec3 gs-roundeven-vec4 Signed-off-by: Lars Hamre <[email protected]> Reviewed-by: Matt Turner <[email protected]> Reviewed-by: Brian Paul <[email protected]>
* nvc0: add note about where the viewport mask would goIlia Mirkin2016-05-261-0/+1
| | | | | | | Not piping this all the way through yet, but no better place to note this down. This will can be used with NV_viewport_array2. Signed-off-by: Ilia Mirkin <[email protected]>
* nvc0: enable 32 textures on kepler+Ilia Mirkin2016-05-262-3/+3
| | | | | | | | | For fermi, this likely will require use of linked tsc mode. However on bindless architectures, we can have as many as we want. As it stands, the AUX_TEX_INFO has 32 teture handles reserved. Signed-off-by: Ilia Mirkin <[email protected]> Reviewed-by: Samuel Pitoiset <[email protected]>
* glsl: add unit tests data vertex/expected outcome for uninitialized warningAlejandro Piñeiro2016-05-2662-0/+573
| | | | | | v2: fix 025 test. Add three more tests (Ian Romanick) Reviewed-by: Ian Romanick <[email protected]>
* glsl: add warning-testAlejandro Piñeiro2016-05-262-1/+33
| | | | | | | | | | | It executes compiler-glsl on all the available shaders, and it checks that the outcome is the expected. Bash code based on the already existing optimization-test v2: rebasing: use --version option Reviewed-by: Ian Romanick <[email protected]>
* glsl: add just-log option for the standalone compiler.Alejandro Piñeiro2016-05-263-4/+18
| | | | | | | | | | | Add an option in order to ask to just print the InfoLog, without any header or separator. Useful if we want to use the standalone compiler to track only the warning/error messages. v2: all printfs goes on its own line (Ian Romanick) v3: rebasing: move just_log to standalone.h/cpp Reviewed-by: Ian Romanick <[email protected]>
* glsl: do not raise uninitialized warning with out function parametersAlejandro Piñeiro2016-05-261-0/+28
| | | | | | | | | | | | | | | It silence by default warnings with function parameters, as the parameters need to be processed in order to have the actual and the formal parameter, and the function signature. Then it raises the warning if needed at verify_parameter_modes where other in/out/inout modes checks are done. v2: fix comment style, multi-line condition style, simplify check, remove extra blank (Ian Romanick) v3: inout function parameters can raise the warning too (Ian Romanick) Reviewed-by: Ian Romanick <[email protected]>
* glsl: add a empty set_is_lhs on ast_nodeAlejandro Piñeiro2016-05-262-0/+7
| | | | | | | | | Just to allow to call set_is_lhs on any ast_node without a casting. Useful when processing a ast_node list that we know it contain ast_expression. v2: comment out new_value to avoid unused parameter warning (Ian Romanick) Reviewed-by: Ian Romanick <[email protected]>
* glsl: handle implicit sized arrays in ssboDave Airlie2016-05-266-89/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code disallows unsized arrays except at the end of an SSBO but it is a bit overzealous in doing so. struct a { int b[]; int f[4]; }; is valid as long as b is implicitly sized within the shader, i.e. it is accessed only by integer indices. I've submitted some piglit tests to test for this. This also has no regressions on piglit on my Haswell. This fixes: GL45-CTS.shader_storage_buffer_object.basic-syntax GL45-CTS.shader_storage_buffer_object.basic-syntaxSSO This patch moves a chunk of the linker code down, so that we don't link the uniform blocks until after we've merged all the variables. The logic went something like: Removing the checks for last ssbo member unsized from the compiler and into the linker, meant doing the check in the link_uniform_blocks code. However to do that the array sizing had to happen first, so we knew that the only unsized arrays were in the last block. But array sizing required the variable to be merged, otherwise you'd get two different array sizes in different version of two variables, and one would get lost when merged. So the solution was to move array sizing up, after variable merging, but before uniform block visiting. Reviewed-by: Samuel Iglesias Gonsálvez <[email protected]> Signed-off-by: Dave Airlie <[email protected]>
* glsl: fix error message on uniform block mismatchDave Airlie2016-05-261-1/+1
| | | | | | | This looks like a cut-paste from above. Reviewed-by: Anuj Phogat <[email protected]> Signed-off-by: Dave Airlie <[email protected]>
* glsl/ast: assign explicit_xfb_buffer from correct placeDave Airlie2016-05-261-1/+1
| | | | | | | | | | | This fixes: GL44-CTS.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through As the OUT_TC interface structures weren't matching because one of them had explicit_xfb_buffer set when it shouldn't. Reviewed-by: Timothy Arceri <[email protected]> Signed-off-by: Dave Airlie <[email protected]>
* swr: [rasterizer] Correctly select optimized primitive assembly.Bruce Cherniak2016-05-257-9/+17
| | | | | | | | Indexed primitives were always using cut-aware primitive assembly, whether primitive_restart was enabled or not. Correctly pass down primitive_restart and select optimized PA when possible. Reviewed-by: Tim Rowley <[email protected]>
* i965: Enable OES_copy_image (and EXT) on Gen8+ and Baytrail.Kenneth Graunke2016-05-251-0/+8
| | | | | | | | | | | | | | | | | | | | | | | For now, only enable it on platforms that actually support ETC2. At this point, Broadwell is only failing 5 (out of 8358) dEQP tests: dEQP-GLES31.functional.copy_image.non_compressed.viewclass_32_bits. srgb8_alpha8_r11f_g11f_b10f.renderbuffer_to_texture3d srgb8_alpha8_rgb10_a2ui.renderbuffer_to_cubemap srgb8_alpha8_rgb10_a2ui.renderbuffer_to_renderbuffer srgb8_alpha8_rgb10_a2.renderbuffer_to_texture2d srgb8_alpha8_rgb9_e5.renderbuffer_to_texture3d These fail with all methods (meta, blorp, blitter, memcpy). All are blacklisted from the Android mustpass list, which makes me wonder whether there's an issue with the tests. The formats in question work with other targets, and the targets in question work with other formats... Signed-off-by: Kenneth Graunke <[email protected]> Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Chris Forbes <[email protected]>
* i965: Implement a BLORP path for CopyImage and prefer it over Meta.Kenneth Graunke2016-05-251-6/+28
| | | | | | | | | | | | We're dropping Meta in favor of BLORP everywhere we can. This also fixes bugs when copying cubemaps to 2D, which is currently broken in the meta pass. BLORP just works. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94198 Signed-off-by: Kenneth Graunke <[email protected]> Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Chris Forbes <[email protected]>
* i965: Make the CopyImage BLT path bail for stencil images.Kenneth Graunke2016-05-251-0/+3
| | | | | | | | | | The BLT can't handle S8 because it's W-tiled (at least without additional funny business, and I'm not sure we care). Disallow it so it falls back to the CPU path, which works. Signed-off-by: Kenneth Graunke <[email protected]> Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Chris Forbes <[email protected]>
* i965: Also copy stencil miptree data.Kenneth Graunke2016-05-251-0/+15
| | | | | | | | The Meta path handles this, but the CPU/BLT fallbacks did not. Signed-off-by: Kenneth Graunke <[email protected]> Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Chris Forbes <[email protected]>
* i965: Make a helper function for CopyImage of a miptree.Kenneth Graunke2016-05-251-41/+54
| | | | | | | | | | | | Currently, it only contains the BLT/CPU fallbacks, so the name is a bit too generic. But eventually this will use BLORP as well, at which point the name will make more sense. The next patch will introduce a second call. Signed-off-by: Kenneth Graunke <[email protected]> Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Chris Forbes <[email protected]>
* i965: Combine src/dest tex vs. rb checks in intel_copy_image_sub_data.Kenneth Graunke2016-05-251-20/+13
| | | | | | | | | This simplifies things a little - now we only have one (tex or rb?) if-ladder for src, and a second for dst, rather than four. Signed-off-by: Kenneth Graunke <[email protected]> Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Chris Forbes <[email protected]>
* i965: Account for MinLayer in CopyImageSubData's blitter/CPU paths.Kenneth Graunke2016-05-251-0/+4
| | | | | | | | | | | Fixes Piglit's arb_copy_image-texview test with the Meta path disabled (so we hit the blitter/CPU fallback paths). v2: Add MinLayer even for cube maps (suggested by Ilia). Signed-off-by: Kenneth Graunke <[email protected]> Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Chris Forbes <[email protected]>
* freedreno/ir3: cmdline compiler for glslRob Clark2016-05-252-14/+77
| | | | | | | | Use glsl/libstandalone.la to add support for taking glsl src files (in addition to .tgsi) as input. Then glsl->nir and feed the result into the ir3 backend as normal. Signed-off-by: Rob Clark <[email protected]>
* glsl: split out libstandaloneRob Clark2016-05-256-371/+514
| | | | | | | | | Split standalone glsl_compiler into a libstandalone.la and a thin main.cpp. This way drivers can re-use the glsl standalone frontend in their own standalone compilers. Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Emil Velikov <[email protected]>
* android: drop build of standalone glsl_compilerRob Clark2016-05-251-22/+0
| | | | | | | | | It's only a tool for debugging the glsl compiler, and should not be installed. Signed-off-by: Rob Clark <[email protected]> Tested-by: Rob Herring <[email protected]> Acked-by: Emil Velikov <[email protected]>
* i965: Mark fallthrough in switch statement.Matt Turner2016-05-251-0/+1
| | | | | Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Eric Engestrom <[email protected]>
* i965: Assert that a depth_mt exists when using HiZ.Matt Turner2016-05-254-0/+4
| | | | | Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Eric Engestrom <[email protected]>
* nir: Strengthen assertion that 'out' is nonnull.Matt Turner2016-05-251-1/+1
| | | | | Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Eric Engestrom <[email protected]>