llvmpipe: do transpose/untwiddle after conversion for 8bit formats - mesa.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Roland Scheidegger <[email protected]>	2016-12-22 03:55:17 +0100
committer	Roland Scheidegger <[email protected]>	2017-01-06 23:13:34 +0100
commit	f4821daed1ec185d16f9ee3cc0951d306ce6e2b9 (patch)
tree	60758afbc66072c7c18d71364061361718da8073 /include/c99_alloca.h
parent	6e7ce1ef55b138ed2cdedb40cbc010b523de8743 (diff)

llvmpipe: do transpose/untwiddle after conversion for 8bit formats

Generally we should do tranpose after conversion, if the format has less than 32 bits per channel (if it has 32 bits, conversion is going to be a no-op anyway...). This is obviously because there's less vectors to deal with. Though the advantage for 16 bit formats isn't that big, and in fact with AVX there isn't really any (as the 32bit unpacks can be done with 256bit, but the smaller ones cannot, although that would change again with proper AVX2 support). Only makes sense for 2d and not 1d cases. And to keep things easy, only handle 1,2 and 4 channels (rgbx is just fine). For rgba unorm8 format the backend conversion sums up to these instruction totals (not counting the movs for SSE2 due to 2-op syntax - generally every 2 unpacks need an additional mov). SSE2 AVX transpose: 32 unpack 16 unpack untwiddle: 0 8 (128bit low/high permutes) convert: 16 mul + 16 cvt 8 mul + 8 cvt 32->8bit: 12 pack 8 (128bit extract) + 12 pack When doing transpose/untwiddle afterwards we get: convert: 16 mul + 16 cvt 8 mul + 8 cvt 32->8bit: 12 pack 8 (128bit extract) + 12 pack transpose/untwiddle 12 unpack 12 unpack So for SSE2, this drops 20 unpacks (total instruction count 76->56) whereas for AVX it replaces the 16 256bit unpacks with 8 128bit ones and drops the 8 lo/hi permutes (in total 60->48). (Albeit to be fair, the permutes could be dropped even when doing the transpose first, they are extremely pointless but we'd need to be able to tell lp_build_conv to reorder the vectors, for AVX2 we're going to need to be able to tell lp_build_conv about ordering in any case.) (With different ordering going into conversion, it would be possible to do 4 unpacks + 4 pshufbs instead of 12 unpacks, but that might not be better, and not all cpus can do it. Proper AVX2 support should eliminate the 8 128bit extracts, reduce these 12 packs to 6 and the 12 unpacks to 2 pshufb + 2 permq ideally (+ 2 final 128bit extracts).) Reviewed-by: Jose Fonseca <[email protected]>

Diffstat (limited to 'include/c99_alloca.h')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: