diff options
author | Eric Anholt <[email protected]> | 2017-01-05 15:11:30 -0800 |
---|---|---|
committer | Eric Anholt <[email protected]> | 2017-01-26 12:48:10 -0800 |
commit | 4d30024238efa829cabc72c1601beeee18c3dbf2 (patch) | |
tree | 09f04f006eb015b3cc5940eddde461519114f77a /doxygen/tnl_dd.doxy | |
parent | 347b69e7d74f61f3b08853ccdfad72bdae683e12 (diff) |
vc4: Use NEON to speed up utile loads on Pi2.
We had a lot of memcpy call overhead because gpu_stride wasn't being
inlined. But if you split out the stride==8 and stride==16 cases like
this code does while still using memcpy, you'd no longer have glibc's
NEON memcpy applied at which point we'd be doing 16 uncached reads
instead of 64/(NEON memcpy granularity), for about a 30% performance
hit. By hand writing the assembly, we can get a whole cacheline
loaded at a time.
Unfortunately, NEON intrinsics turned out to be unusable -- they
didn't have the vldm instruction available.
Note that, for now, the NEON code is only enabled when building for ARMv7
(Pi 2+). We may want to do runtime detection for the Raspbian case, in
the future.
Improves 1024x1024 GetTexImage by 208.256% +/- 7.07029% (n=10).
Diffstat (limited to 'doxygen/tnl_dd.doxy')
0 files changed, 0 insertions, 0 deletions