| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
Like constant buffers, samplers and textures are aliased on Fermi and
we need to invalidate the state when switching from 3D to CP and vice
versa.
This fixes rendering issues in the UE4 demos.
Signed-off-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
| |
Right now libglsl.la depends on libnir.la so putting it in libnir.la
adds a dependency on libglsl.la that goes the wrong direction.
Reviewed-by: Emil Velikov <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The R_028B50_VGT_TESS_DISTRIBUTION value is copied from
amdgpu-pro. Smaller values in the ACCUM fields seem to
decrease the performance advantage from this patch, higher
values don't seem to matter.
v2: Add distribution mode field enums.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using more than 1 wave per threadgroup does increase performance
generally. Not using too many patches per threadgroup also
increases performance. Both catalyst and amdgpu-pro seem to
use 40 patches as their maximum, but I haven't really seen
any performance increase from limiting the number of patches
to 40 instead of 64.
Note that the trick where we overlap the input and output LDS
does not work anymore as the insertion of the tess factors
changes the patch stride.
v2: - Add comment about LDS assumptions.
- Add constant for buffer size.
- Fix code style.
v3: - Correct limits for not splitting patches between waves.
- Set max num_patches to 40 as in the proprietary driver.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
| |
The factors may be stored to LDs by another invocation than
the invocation for vertex 0.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This allows running the TES on different CU's than the
TCS which results in performance improvements.
v2: Only write the control word from one invocation.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
| |
They are unused.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We always try to use 4-component loads, as LLVM does not combine loads
and they bypass the L1 cache.
We can't use a similar strategy for stores and this is especially
notable with the tess factors, as they are often set with separate
MOV's per component in the TGSI.
We keep storing to LDS and the LDS space, so we can load the outputs
later, either due to the shader, of for wrting the tess factors.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need to copy the VS outputs to memory. I decided to do this
using a shader key, as the value depends on other shaders.
I also switch the fixed function TCS over to monolithic, as
otherwisze many of the user SGPR's need to be passed to the
epilog, which increases register pressure, or complexity to
avoid that. The main body of the fixed function TCS is not
that interesting to precompile anyway, since we do it on
demand and it is very small.
v2: Use u_bit_scan64.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of creating a memory area per patch and per vertex, we put
the same attribute of every vertex & patch together. Most loads
and stores access the same attribute across all lanes, only for
different patches and vertices.
For the TCS this results in tightly packed data for 4-component
stores.
For the TES this is not the case as within a patch the loads
often also access the same vertex. However if there are < 4
vertices/patch, this still results in a reduction of the number
of cache lines. In the LDS situation we only do better than worst
case if the data per patch < 64 bytes, which due to the
tessellation factors is pretty much never.
We do not use hardware swizzling for this. It would slightly reduce
the number of executed VALU instructions, but I had issues with
increased wait times that I haven't been able to solve yet.
Furthermore, the tbuffer_store intrinsic does not support both
VGPR offset and an index, so we have a problem storing
indirectly indexed outputs. This can be solved by temporarily
storing arrays in LDS and then copying them, but I don't think
that is worth the effort. The difference in VALU cycles
hardware swizzling gives is about 0.2% of total busy cycles.
That is without handling the array case.
I chose for attributes instead of components as they are often
accessed together, and the software swizzling takes VALU cycles
for calculating offsets.
v2: - Rename functions to get_tcs_tes_buffer_address.
- multiply by 16 as late as possible.
- Use tgsi_full_src_register_from_dst.
- Remove some bad comments.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This happens to be in the right position, but that changes
when TCS/TES get new parameters.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
v2: - Use llvm.admgcn.buffer.load instrinsics for new LLVM.
- Code style fixes.
v3: - Code style fix.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Nicolai Hähnle <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
The buffer is quite large, but should only be allocated if the
application uses tessellation. Most non-games don't.
v2: - Use the correct register for SI.
- Add define for block size.
Signed-off-by: Bas Nieuwenhuizen <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
| |
CID 1271532 (#1 of 1): Out-of-bounds read (OVERRUN)34. overrun-local:
Overrunning array of 2 16-byte elements at element index 2 (byte offset
32) by dereferencing pointer &inst.Dst[i].
Signed-off-by: Rob Clark <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Not sure why coverity calls this an out-of-bounds read vs out-of-bounds
write.
CID 1358920 (#1 of 1): Out-of-bounds read (OVERRUN)9. overrun-local:
Overrunning array r of 3 16-byte elements at element index 3 (byte
offset 48) using index chan (which evaluates to 3).
Signed-off-by: Rob Clark <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To read out MP perf counters we use a compute shader and need to upload
input data like a 64-bits addr used to store the values and a sequence
ID for synchronization. Currently, this input data is uploaded as user
uniforms which means that it's sticked to c0[], but if a compute shader
from a real application is used, monitoring those performance counters
will just overwrite some data and miserably crash.
Instead, sticking the 64-bits addr and the sequence into the driver
constant buffer seems like much better and will allow to monitor
counters with GL 4.3 apps.
Tested on GF119 and GK110, but should not hurt anything on GK104.
Signed-off-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Example:
Gallium 0.4 on AMD TONGA (DRM 3.2.0 / 4.5.0, LLVM 3.9.0)
My kernel version is pretty long already (4.5.0-amd-01025-g32791c1)
and adding "kernel" into the string would make too it long for glxinfo
to display.
Reviewed-by: Michel Dänzer <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ported from the initial amdgpu winsys from the private AMD branch.
The thread creates the buffer list, submits IBs, and cleans up
the submission context, which can also destroy buffers.
3-5% reduction in CPU overhead is expected for apps submitting a lot
of IBs per frame. This is most visible with DMA IBs.
v2: use a semaphore instead of a busy loop in amdgpu_ws_queue_cs
add another amdgpu_cs_sync_flush call into amdgpu_bo_map
Reviewed-by: Nicolai Hähnle <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes the following piglit tests (for softpipe):
/spec/glsl-1.30/execution/built-in-functions/...
fs-roundeven-float
fs-roundeven-vec2
fs-roundeven-vec3
fs-roundeven-vec4
vs-roundeven-float
vs-roundeven-vec2
vs-roundeven-vec3
vs-roundeven-vec4
/spec/glsl-1.50/execution/built-in-functions/...
gs-roundeven-float
gs-roundeven-vec2
gs-roundeven-vec3
gs-roundeven-vec4
Signed-off-by: Lars Hamre <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
| |
Not piping this all the way through yet, but no better place to note
this down. This will can be used with NV_viewport_array2.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
| |
For fermi, this likely will require use of linked tsc mode. However on
bindless architectures, we can have as many as we want. As it stands,
the AUX_TEX_INFO has 32 teture handles reserved.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Samuel Pitoiset <[email protected]>
|
|
|
|
|
|
|
|
| |
Indexed primitives were always using cut-aware primitive assembly,
whether primitive_restart was enabled or not. Correctly pass down
primitive_restart and select optimized PA when possible.
Reviewed-by: Tim Rowley <[email protected]>
|
|
|
|
|
|
|
|
| |
Use glsl/libstandalone.la to add support for taking glsl src files (in
addition to .tgsi) as input. Then glsl->nir and feed the result into
the ir3 backend as normal.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
|
| |
The GALLIUM_HUD does not yet expose a description for each events, but
this might be useful for developers who want to have a long description
of hw perf counters directly in the source code.
Signed-off-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This text transformation was done automatically via the following shell
command:
$ find -name SCons\* -exec sed -i s/\\s\\+$// '{}' \;
Signed-off-by: Giuseppe Bilotta <[email protected]>
Reviewed-by: Brian Paul <[email protected]>
|
|
|
|
|
|
|
|
|
| |
Print "GEOM" instead of "2", for example.
v2: also update the text parsing code, per Ilia.
Reviewed-by: Ilia Mirkin <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Ilia Mirkin <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows clear and easy communication between the two.
Caller: Requesting information (struct vN)
Callee: I know how to deal with older version (vN-1) only. Here is your
data and the version I support.
Caller: Older version ? Sure I'll cap all access to the fields provided
by the older version (vN-1)
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Tested-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
One cannot use a single version to control both export_in and export_out
versions. Using this forces us to always extend/bump both structs at the
same time.
An alternative scheme is coming with next patch.
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Tested-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
|
|
| |
... and make them more explicit.
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Tested-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
|
|
| |
Be more explicit what it actually does.
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Tested-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
|
|
| |
OCD polish for consistency with other mesa interfaces.
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Tested-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
| |
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
Tested-by: Tom Stellard <[email protected]>
|
|
|
|
| |
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
| |
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Don't clear bucket descriptions to fix issues with sim level
buckets getting out of sync.
2. Close out threadviz file descriptors in ClearThreads().
3. Skip buckets for jitter based buckets when multithreaded. We need
thread local storage through llvm jit functions to be fixed before
we can enable this.
4. Fix buckets StopCapture to correctly detect capture complete.
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
| |
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
| |
Reviewed-by: Bruce Cherniak <[email protected]>
|
|
|
|
|
|
|
| |
We apparently pass all the relevant CTS tests. There are probably some
shortcomings, but they can be addressed down the line.
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some hardware supports primitive restart on patch primitives, and other
hardware does not. Modern GL and ES include a query for this feature;
adding a capability bit will allow us to answer it.
As far as I know, AMD hardware does not support this feature, while
NVIDIA and Intel hardware does. However, most Gallium drivers do not
appear to support tessellation shaders yet. So, I've enabled it for
nvc0 and disabled it everywhere else.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Ilia Mirkin <[email protected]>
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The variable-indexing tests always had a few random fails, which I
usually couldn't reproduce when running tests manually. Somehow
recently this got a lot worse. I ported a couple of the shaders to
GLES to see what blob does, and it also seems to be avoiding to cp
indirect srcs. So I guess indirect w/ instructions other than cat1
(mov) are not totally reliable. Let's just switch that off until
this is better understood.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
| |
Constbufs are only aliased on Fermi and this will reduce the number of
flushes when we switch between 3d and compute.
Signed-off-by: Samuel Pitoiset <[email protected]>
Reviewed-by: Ilia Mirkin <[email protected]>
|
|
|
|
|
|
|
| |
Analogous to previous commits.
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Leo Liu <[email protected]>
|
|
|
|
|
|
|
| |
Analogous to previous commit.
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Leo Liu <[email protected]>
|
|
|
|
|
|
|
| |
Add separate labels and jump to the correct one as needed.
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Leo Liu <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Implement support for mapImage/unmapImage functions in version 12 of the
DRIimage extension.
Signed-off-by: Rob Herring <[email protected]>
[Emil Velikov: align/indent the map/unmap vfuncs]
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This happens with dEQP tests. The code doesn't at all protect against
this condition, so while unhandled, this is an expected situation.
Also avoid using more than the first 16 registers for nv3x vertex
programs.
Signed-off-by: Ilia Mirkin <[email protected]>
|