summaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/r600/sb/notes.markdown
diff options
context:
space:
mode:
authorVadim Girlin <[email protected]>2013-04-30 20:51:36 +0400
committerVadim Girlin <[email protected]>2013-04-30 21:50:47 +0400
commit2cd769179345799d383f92dd615991755ec24be1 (patch)
tree9863c9e92e645cad35a861b7de76f0c98d64b0d0 /src/gallium/drivers/r600/sb/notes.markdown
parentfbb065d629d2f79a6224fc3e5e89d5acc275e3b4 (diff)
r600g/sb: initial commit of the optimizing shader backend
Diffstat (limited to 'src/gallium/drivers/r600/sb/notes.markdown')
-rw-r--r--src/gallium/drivers/r600/sb/notes.markdown413
1 files changed, 413 insertions, 0 deletions
diff --git a/src/gallium/drivers/r600/sb/notes.markdown b/src/gallium/drivers/r600/sb/notes.markdown
new file mode 100644
index 00000000000..056497754a2
--- /dev/null
+++ b/src/gallium/drivers/r600/sb/notes.markdown
@@ -0,0 +1,413 @@
+r600-sb
+=======
+
+* * * * *
+
+Debugging
+---------
+
+### Environment variables
+
+- **R600\_DEBUG**
+
+ There are new flags:
+
+ - **sb** - Enable optimization of graphics shaders
+ - **sbcl** - Enable optimization of compute shaders (experimental)
+ - **sbdry** - Dry run, optimize but use source bytecode -
+ useful if you only want to check shader dumps
+ without the risk of lockups and other problems
+ - **sbstat** - Print optimization statistics (only time so far)
+ - **sbdump** - Print IR after some passes.
+
+### Regression debugging
+
+If there are any regressions as compared to the default backend
+(R600\_SB=0), it's possible to use the following environment variables
+to find the incorrectly optimized shader that causes the regression.
+
+- **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
+ shaders
+ - 0 - disabled (default)
+ - 1 - skip optimization for the shaders in the range
+ [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
+ optimize only the shaders that are not in this range
+ - 2 - optimize only the shaders in the range
+ [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
+
+- **R600\_SB\_DSKIP\_START** - start of the range (1-based)
+
+- **R600\_SB\_DSKIP\_END** - end of the range (1-based)
+
+Example - optimize only the shaders 5, 6, and 7:
+
+ R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
+
+All shaders compiled by the application are numbered starting from 1,
+the number of shaders used by the application may be obtained by running
+it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
+for each compiled shader.
+
+After figuring out the total number of shaders used by the application,
+the variables above allow to use bisection to find the shader that is
+the cause of regression. E.g. if the application uses 100 shaders, we
+can divide the range [1; 100] and run the application with the
+optimization enabled only for the first half of the shaders:
+
+ R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
+
+If the regression is reproduced with these parameters, then the failing
+shader is in the range [1; 50], if it's not reproduced - then it's in
+the range [51; 100]. Then we can divide the new range again and repeat
+the testing, until we'll reduce the range to a single failing shader.
+
+*NOTE: This method relies on the assumption that the application
+produces the same sequence of the shaders on each run. It's not always
+true - some applications may produce different sequences of the shaders,
+in such cases the tools like apitrace may be used to record the trace
+with the application, then this method may be applied when replaying the
+trace - also this may be faster and/or more convenient than testing the
+application itself.*
+
+* * * * *
+
+Intermediate Representation
+---------------------------
+
+### Values
+
+All kinds of the operands (literal constants, references to kcache
+constants, references to GPRs, etc) are currently represented by the
+**value** class (possibly it makes sense to switch to hierarchy of
+classes derived from **value** instead, to save some memory).
+
+All values (except some pseudo values like the exec\_mask or predicate
+register) represent 32bit scalar values - there are no vector values,
+CF/FETCH instructions use groups of 4 values for src and dst operands.
+
+### Nodes
+
+Shader programs are represented using the tree data structure, some
+nodes contain a list of subnodes.
+
+#### Control flow nodes
+
+Control flow information is represented using four special node types
+(based on the ideas from [[1]](#references) )
+
+- **region\_node** - single-entry, single-exit region.
+
+ All loops and if's in the program are enclosed in region nodes.
+ Region nodes have two containers for phi nodes -
+ region\_node::loop\_phi contains the phi expressions to be executed
+ at the region entry, region\_node::phi contains the phi expressions
+ to be executed at the region exit. It's the only type of the node
+ that contains associated phi expressions.
+
+- **depart\_node** - "depart region \$id after { ... }"
+
+ Depart target region (jump to exit point) after executing contained
+ code.
+
+- **repeat\_node** - "repeat region \$id after { ... }"
+
+ Repeat target region (jump to entry point) after executing contained
+ code.
+
+- **if\_node** - "if (cond) { ... }"
+
+ Execute contained code if condition is true. The difference from
+ [[1]](#references) is that we don't have associated phi expressions
+ for the **if\_node**, we enclose **if\_node** in the
+ **region\_node** and store corresponding phi's in the
+ **region\_node**, this allows more uniform handling.
+
+The target region of depart and repeat nodes is always the region where
+they are located (possibly in the nested region), there are no arbitrary
+jumps/goto's - control flow in the program is always structured.
+
+Typical control flow constructs can be represented as in the following
+examples:
+
+GLSL:
+
+ if (cond) {
+ < 1 >
+ } else {
+ < 2 >
+ }
+
+IR:
+
+ region #0 {
+ depart region #0 after {
+ if (cond) {
+ depart region #0 after {
+ < 1 >
+ }
+ }
+ < 2 >
+ }
+ <region #0 phi nodes >
+ }
+
+GLSL:
+
+ while (cond) {
+ < 1 >
+ }
+
+IR:
+
+ region #0 {
+ <region #0 loop_phi nodes>
+ repeat region #0 after {
+ region #1 {
+ depart region #1 after {
+ if (!cond) {
+ depart region #0
+ }
+ }
+ }
+ < 1 >
+ }
+ <region #0 phi nodes>
+ }
+
+'Break' and 'continue' inside the loops are directly translated to the
+depart and repeat nodes for the corresponding loop region.
+
+This may look a bit too complicated, but in fact this allows more simple
+and uniform handling of the control flow.
+
+All loop\_phi and phi nodes for some region always have the same number
+of source operands. The number of source operands for
+region\_node::loop\_phi nodes is 1 + number of repeat nodes that
+reference this region as a target. The number of source operands for
+region\_node::phi nodes is equal to the number of depart nodes that
+reference this region as a target. All depart/repeat nodes for the
+region have unique indices equal to the index of source operand for
+phi/loop\_phi nodes.
+
+First source operand for region\_node::loop\_phi nodes (src[0]) is an
+incoming value that enters the region from the outside. Each remaining
+source operand comes from the corresponding repeat node.
+
+More complex example:
+
+GLSL:
+
+ a = 1;
+ while (a < 5) {
+ a = a * 2;
+ if (b == 3) {
+ continue;
+ } else {
+ a = 6;
+ }
+ if (c == 4)
+ break;
+ a = a + 1;
+ }
+
+IR with SSA form:
+
+ a.1 = 1;
+ region #0 {
+ // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
+ region#0 loop_phi: a.2 = phi a.1, a.6, a.3
+
+ repeat_1 region #0 after {
+ a.3 = a.2 * 2;
+ cond1 = (b == 3);
+ region #1 {
+ depart_0 region #1 after {
+ if (cond1) {
+ repeat_2 region #0;
+ }
+ }
+ a.4 = 6;
+
+ region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
+ }
+ cond2 = (c == 4);
+ region #2 {
+ depart_0 region #2 after {
+ if (cond2) {
+ depart_0 region #0;
+ }
+ }
+ }
+ a.6 = a.5 + 1;
+ }
+
+ region #0 phi: a.7 = phi a.5 // src[0] from depart_0
+ }
+
+Phi nodes with single source operand are just copies, they are not
+really necessary, but this allows to handle all **depart\_node**s in the
+uniform way.
+
+#### Instruction nodes
+
+Instruction nodes represent different kinds of instructions -
+**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
+the "bc" structure where all fields of the bytecode are stored (the type
+is **bc\_alu** for **alu\_node**, etc). The operands are represented
+using the vectors of pointers to **value** class (node::src, node::dst)
+
+#### SSA-specific nodes
+
+Phi nodes currently don't have special node class, they are stored as
+**node**. Destination vector contains a single destination value, source
+vector contains 1 or more source values.
+
+Psi nodes [[5], [6]](#references) also don't have a special node class
+and stored as **node**. Source vector contains 3 values for each source
+operand - the **value** of predicate, **value** of corresponding
+PRED\_SEL field, and the source **value** itself.
+
+### Indirect addressing
+
+Special kind of values (VLK\_RELREG) is used to represent indirect
+operands. These values don't have SSA versions. The representation is
+mostly based on the [[2]](#references). Indirect operand contains the
+"offset/address" value (value::rel), (e.g. some SSA version of the AR
+register value, though after some passes it may be any value - constant,
+register, etc), also it contains the maydef and mayuse vectors of
+pointers to **value**s (similar to dst/src vectors in the **node**) to
+represent the effects of aliasing in the SSA form.
+
+E.g. if we have the array R5.x ... R8.x and the following instruction :
+
+ MOV R0.x, R[5 + AR].x
+
+then source indirect operand is represented with the VLK\_RELREG value,
+value::rel is AR, value::maydef is empty (in fact it always contain the
+same number of elements as mayuse to simplify the handling, but they are
+NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
+corresponding SSA versions after ssa\_rename).
+
+Additional "virtual variables" as in [HSSA [2]](#references) are not
+used, also there is no special handling for "zero versions". Typical
+programs in our case are small, indirect addressing is rare, array sizes
+are limited by max gpr number, so we don't really need to use special
+tricks to avoid the explosion of value versions. Also this allows more
+precise liveness computation for array elements without modifications to
+the algorithms.
+
+With the following instruction:
+
+ MOV R[5+AR].x, R0.x
+
+we'll have both maydef and mayuse vectors for dst operand filled with
+array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
+pass mayuse will contain previous versions, maydef will contain new
+potentially-defined versions.
+
+* * * * *
+
+Passes
+------
+
+- **bc\_parser** - creates the IR from the source bytecode,
+ initializes src and dst value vectors for instruction nodes. Most
+ ALU nodes have one dst operand and the number of source operands is
+ equal to the number of source operands for the ISA instruction.
+ Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
+ gpr as in the original instruction, other two are pseudo-operands
+ that represent possibly updated predicate and exec\_mask. Predicate
+ values are used in the predicated alu instructions (node::pred),
+ exec\_mask values are used in the if\_nodes (if\_node::cond). Each
+ vector operand in the CF/TEX/VTX instructions is represented with 4
+ values - components of the vector.
+
+- **ssa\_prepare** - creates phi expressions.
+
+- **ssa\_rename** - renames the values (assigns versions).
+
+- **liveness** - liveness computation, sets 'dead' flag for unused
+ nodes and values, optionally computes interference information for
+ the values.
+
+- **dce\_cleanup** - eliminates 'dead' nodes, also removes some
+ unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
+ instructions in the source, containers for ALU groups (they were
+ only needed for the ssa\_rename pass)
+
+- **if\_conversion** - converts control flow with if\_nodes to the
+ data flow in cases where it can improve performance (small alu-only
+ branches). Both branches are executed speculatively and the phi
+ expressions are replaced with conditional moves (CNDxx) to select
+ the final value using the same condition predicate as was used by
+ the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
+ instruction, CNDxx now uses dst[0] from the same PREDSETxx
+ instruction.
+
+- **peephole** - peephole optimizations
+
+- **gvn** - Global Value Numbering [[2]](#references),
+ [[3]](#references)
+
+- **gcm** - Global Code Motion [[3]](#references). Also performs
+ grouping of the instructions of the same kind (CF/FETCH/ALU).
+
+- register allocation passes, some ideas are used from
+ [[4]](#references), but implementation is simplified to make it more
+ efficient in terms of the compilation speed (e.g. no recursive
+ recoloring) while achieving good enough results.
+
+ - **ra\_split** - prepares the program to register allocation.
+ Splits live ranges for constrained values by inserting the
+ copies to/from temporary values, so that the live range of the
+ constrained values becomes minimal.
+
+ - **ra\_coalesce** - performs global allocation on registers used
+ in CF/FETCH instructions. It's performed first to make sure they
+ end up in the same GPR. Also tries to allocate all values
+ involved in copies (inserted by the ra\_split pass) to the same
+ register, so that the copies may be eliminated.
+
+ - **ra\_init** - allocates gpr arrays (if indirect addressing is
+ used), and remaining values.
+
+- **post\_scheduler** - ALU scheduler, handles VLIW packing and
+ performs the final register allocation for local values inside ALU
+ clauses. Eliminates all coalesced copies (if src and dst of the copy
+ are allocated to the same register).
+
+- **ra\_checker** - optional debugging pass that tries to catch basic
+ errors of the scheduler or regalloc,
+
+- **bc\_finalize** - propagates the regalloc information from values
+ in node::src and node::dst vectors to the bytecode fields, converts
+ control flow structure (region/depart/repeat) to the target
+ instructions (JUMP/ELSE/POP,
+ LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
+
+- **bc\_builder** - builds final bytecode,
+
+* * * * *
+
+References
+----------
+
+[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
+McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
+
+[2] ["Effective Representation of Aliases and Indirect Memory Operations
+in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
+Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
+
+[3] ["Global Code Motion. Global Value Numbering.", Cliff
+Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
+
+[4] ["Register Allocation for Programs in SSA Form", Sebastian
+Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
+
+[5] ["An extension to the SSA representation for predicated code",
+Francois de
+Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
+
+[6] ["Improvements to the Psi-SSA Representation", F. de
+Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)