summaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/lima
Commit message (Collapse)AuthorAgeFilesLines
* lima: set uniforms_address lower bits properlyVasily Khoruzhick2019-09-281-0/+8
| | | | | | | | | | | | | | | | | | | | Looks like blob uses following values for uniforms buffer: 0 for 8 bytes 1 for 16 bytes 2 for 24 bytes 2 for 32 bytes 3 for 40 bytes 3 for 48 bytes 3 for 56 bytes 3 for 64 bytes 4 for 72 bytes It all looks like log2(size / 8) rounded up, so let's do the same. Fixes: 931fc2a7b3f9("lima: do not set the PP uniforms address lowest bits") Reviewed-by: Icenowy Zheng <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima: do not set the PP uniforms address lowest bitsIcenowy Zheng2019-09-281-1/+0
| | | | | | | | | | | | | | | | | | The PP uniforms address register in render state is not a direct pointer to the uniforms storage -- instead, it points to an one-item array, and the array item is the real pointer to the uniforms storage. This register reuses some of its LSBs as a size field. Currently the size is set according to the length of the real uniforms storage. However, as the register itself contains only a pointer to the one-item array, the size field should be set to the length of the one-item array and subtract it by 1, which means a fixed value of 0. That means we can just omit it now. Test shows this should be the correct approach to set this register. Signed-off-by: Icenowy Zheng <[email protected]> Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: add NIR pass to split varying loadsVasily Khoruzhick2019-09-265-0/+127
| | | | | | | | | | | | | | | NIR may emit a single instrinsic to load several packed varyings, but that's suboptimal for Utgard PP for several reasons: - varyings that are used as sampler inputs can be passed using pipeline register with increased precision - we have small number of regs, so using a vec4 regs for storing two vec2 varyings increases reg pressure. Add NIR pass to split a single load into several loads and utilize it in lima. Reviewed-by: Qiang Yu <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima: support rectangle textureIcenowy Zheng2019-09-264-3/+9
| | | | | | | | | | | | | | As Vasily discovered, the bit 7 of the word 1 of the texture descriptor is set when reloading the framebuffer, to use framebuffer-based offset rather than normalized one. This bit also works for regular textures to enable accessing with non-normalized offset. Add support for rectangle texture by setting this bit for PIPE_TEXTURE_RECT. Suggested-by: Vasily Khoruzhick <[email protected]> Signed-off-by: Icenowy Zheng <[email protected]> Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: Add various varying fetch sources to disassemblerAndreas Baierl2019-09-251-23/+50
| | | | | Signed-off-by: Andreas Baierl <[email protected]> Reviewed-by: Connor Abbott <[email protected]>
* lima/ppir: add support for indirect load of uniforms and varyingsVasily Khoruzhick2019-09-246-12/+60
| | | | | | | | Utgard PP supports indirect load of uniforms and varyings, so let's enable it. Reviewed-by: Qiang Yu <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: add node dependency typesVasily Khoruzhick2019-09-246-28/+58
| | | | | | | | | | | | | | | Currently we add dependecies in 3 cases: 1) One node consumes value produced by another node 2) Sequency dependencies 3) Write after read dependencies 2) and 3) only affect scheduler decisions since we still can use pipeline register if we have only 1 dependency of type 1). Add 3 dependency types and mark dependencies as we add them. Reviewed-by: Qiang Yu <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: don't attempt to clone tex coords if it's not varyingVasily Khoruzhick2019-09-251-3/+10
| | | | | | | | | It makes no sense to clone texture coords if it's not varying, moreover we don't support cloning ALU nodes. Fixes: 1c1890fa7077 ("lima/ppir: clone uniforms and load_coords into each successor") Reviewed-by: Andreas Baierl <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Fix 64-bit shift in scheduler spillingConnor Abbott2019-09-241-2/+2
| | | | | | There are 64 physical registers so the shift must be 64 bits. Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Don't emit movs when translating from NIRConnor Abbott2019-09-241-36/+50
| | | | | | | | | | | The scheduler doesn't expect them. To do this, I had to refactor the registration part of gpir_node_create_dest() to be separate from creating and inserting the node, since the last two now aren't done when handling moves. This adds more code but creates the possibility of automatically inserting input dependencies when inserting nodes, similar to what's done in NIR with the use-def lists (this isn't done yet). Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Fix postlog2 fixup handlingConnor Abbott2019-09-241-11/+12
| | | | | | | | | | | | We guarantee that a complex1 op is always used by postlog2 directly by rewriting the postlog2 op to be a move when there would be a move inserted between them. But we weren't doing this in all circumstances where there might be a move. Move the logic to place_move() so that it always happens. Fixes a few log tests that happened to start failing due to changes in the register allocator leading to a different scheduling order. Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Use registers for values live in multiple blocksConnor Abbott2019-09-247-156/+648
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds the framework for cross-basic-block register allocation. Like ARM's compiler, we assume that the value registers aren't usable across branches, which means we have to use physical registers to store any value that crosses a basic block. There are three parts to this: 1. When translating from NIR, we rely on the NIR out-of-ssa pass to coalesce values into registers. We insert store_reg instructions for values used in more than one basic block, and load_reg instructions for values not defined in the same basic block (or defined after their use, for loops). So by the time we've translated out of NIR we've already split things into values (which are only used in the same basic block) and registers (which are only used in different basic blocks than where they're defined). 2. We allocate the registers at the same time that we allocate the values, before the final scheduler. Unlike the values, where the assigned color is fake, we assign the actual physical index & component to physregs at this stage. load_reg and store_reg are treated as moves in the allocator and when creating write-after-read dependencies. 3. Finally, in the main scheduler we have to avoid overwriting existing live physregs when spilling. First, we have to tell the scheduler which physical registers are live at the end of each block, to avoid overwriting those. If a register is only live at the beginning, we can reuse it for spilling after the last original use in the final program happens, i.e. before any original use is scheduled, but we have to be careful to add the proper dependencies so that the spill write is scheduled before the original reads. To handle this we repurpose reg_link for uses to be used by the scheduler. A few register-related things copied over from NIR or from other drivers can be dropped. Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Support branch instructionsConnor Abbott2019-09-246-78/+102
| | | | | | | | | | | | | | | | Because branch conditions have to be in the pass slot, there is no unconditional branch, and realistically the pass slot has to contain a move when branching (there's nothing it does that would be useful for operating on booleans, so we can't use it for anything when computing the branch condition), we put the branch instruction in the pass slot and at codegen time turn it into a move of the branch condition. This means that it doesn't have to be special-cased like store instructions are in the scheduler. Because of this decision we can remove the half-implemented BRANCH codegen slot. Finally, we (ab)use the existing schedule_first mechanism to make sure that branches are always last in the basic block. Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Only try to place actual childrenConnor Abbott2019-09-241-1/+1
| | | | | | | | | | When picking a node to be scheduled, we try to schedule its children as well. But we shouldn't try to schedule nodes which only have a fake dependency on the original node, since this isn't the point of scheduling children at the same time and can break some expectations of the rest of the code. Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Fix compiler warningConnor Abbott2019-09-241-1/+1
| | | | Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima: remove partial clear support from pipe->clear()Erico Nunes2019-09-231-93/+5
| | | | | | | | | | | | | | | pipe->clear() is not called for partial clears, which mesa emulates by drawing a quad. Furthermore, drivers should not use rasterizer state information for scissor information (which was being used to handle the partial clears). So, remove the partial clear support since it was not supposed to be handled by pipe->clear() anyway. This fixes issues with clearing after switching to different sized framebuffers. Signed-off-by: Erico Nunes <[email protected]> Reviewed-by: Vasily Khoruzhick <[email protected]> Reviewed-by: Qiang Yu <[email protected]>
* lima: implement BO cacheVasily Khoruzhick2019-09-228-30/+212
| | | | | | | | | | | | | | | Allocating BOs is expensive, so we should avoid doing that by caching freed BOs. BO cache is modelled after one in v3d driver and works as follows: - in lima_bo_create() check if we have matching BO in cache and return it if there's one, allocate new BO otherwise. - in lima_bo_unreference() (renamed from lima_bo_free()): put BO in cache instead of freeing it and remove all stale BOs from cache Reviewed-by: Qiang Yu <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima: use 0 to poll if BO is busy in lima_bo_wait()Vasily Khoruzhick2019-09-221-1/+7
| | | | | | | | | os_time_get_absolute_timeout(0) returns current time, while kernel driver expects 0 as value to poll BO status and return immediately. Fix it by setting abs_timeout to 0 if timeout_ns is 0 Reviewed-by: Qiang Yu <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima: move damage bound build to resourceQiang Yu2019-09-233-22/+41
| | | | | Reviewed-and-Tested-by: Vasily Khoruzhick <[email protected]> Signed-off-by: Qiang Yu <[email protected]>
* lima: don't use damage system when full damageQiang Yu2019-09-231-0/+14
| | | | | | | | | Some time weston set full damage region. It is more effient to use the cached pp stream instead of dynamically create one. Reviewed-and-Tested-by: Vasily Khoruzhick <[email protected]> Signed-off-by: Qiang Yu <[email protected]>
* lima: implement EGL_KHR_partial_updateQiang Yu2019-09-235-65/+86
| | | | | | | | | | This extension set a damage region for each buffer swap which can be used to reduce buffer reload cost by only feed damage region's tile buffer address for PP. Reviewed-and-Tested-by: Vasily Khoruzhick <[email protected]> Signed-off-by: Qiang Yu <[email protected]>
* lima: fix PLBU viewport configurationIcenowy Zheng2019-09-223-21/+21
| | | | | | | | | | | | | The PLBU expects the viewport's 4 borders' coordinates, however currently we're feeding the coordinate of the left-bottom point and the size to it, which leads to misrendering when the left-bottom point is not (0,0). Change the macros for the viewport PLBU command, and the data feed to it. The code to calculate the 4 borders is ported from Panfrost. Signed-off-by: Icenowy Zheng <[email protected]> Reviewed-by: Qiang Yu <[email protected]>
* lima: reset scissor state if scissor test is disabledIcenowy Zheng2019-09-171-0/+4
| | | | | | | | | | | The PLBU seems to preserve scissor state between draws, and since lima doesn't emit PLBU_CMD_SCISSORS() if scissor test is disabled, it uses state from previous draw. Fix it by emitting PLBU_CMD_SCISSORS() for full fb if scissor test is disabled. Signed-off-by: Icenowy Zheng <[email protected]> Reviewed-by: Vasily Khoruzhick <[email protected]> Reviewed-by: Qiang Yu <[email protected]>
* lima: add standalone disassembler with primitive MBS parserVasily Khoruzhick2019-09-162-0/+219
| | | | | | | | | It's useful for analyzing shader binaries produced by ARM mali offline compiler which outputs files in MBS format. MBS is mali binary shader, currently parser just extracts shader binary and ignores everything else. Reviewed-and-tested-by: Connor Abbott<[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: Add undef handlingAndreas Baierl2019-09-134-4/+24
| | | | | | | | | | Add a ppir dummy node for nir_ssa_undef_instr, create a reg for it and mark it as undefined, so that regalloc can set it non-interfering to avoid register pressure. Signed-off-by: Andreas Baierl <[email protected]> Reviewed-by: Vasily Khozuzhick <[email protected]> Reviewed-by: Erico Nunes <[email protected]>
* lima/ppir: Rename ppir_op_dummy to ppir_op_undefAndreas Baierl2019-09-133-5/+5
| | | | | | Signed-off-by: Andreas Baierl <[email protected]> Reviewed-by: Vasily Khoruzhick <[email protected]> Reviewed-by: Erico Nunes <[email protected]>
* meson: don't generate file into subdirsDylan Baker2019-09-111-1/+1
| | | | | | | | This is unsupported by meson and may become a hard error in the future. Fixes: 5adfc8602c639827af0ba9a1059bd165a3ae49e7 ("lima/ppir: move sin/cos input scaling into NIR") Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima: set .out_sync field of req in lima_submit_start()Vasily Khoruzhick2019-09-101-0/+1
| | | | | | | | Looks like .out_sync wasn't set in lima_submit_start(), as result submit completion fence was never signalled. Reviewed-by: Qiang Yu <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: drop fge/flt/feq/fne optionsVasily Khoruzhick2019-09-091-4/+0
| | | | | | | | | These are supposed to be lowered into sge/slt/seq/sne equivalents. Reviewed-by: Connor Abbott <[email protected]> Reviewed-by: Erico Nunes <[email protected]> Reviewed-by: Andreas Baierl <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima: run opt_algebraic between int_to_float and boot_to_float for vsVasily Khoruzhick2019-09-091-4/+5
| | | | | | | | | int_to_float emits ftrunc and ftrunc lowering generates bool ops. Reviewed-by: Connor Abbott <[email protected]> Reviewed-by: Erico Nunes <[email protected]> Reviewed-by: Andreas Baierl <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: fix warning in gpir disassemblerVasily Khoruzhick2019-09-091-1/+1
| | | | | | | | | | | | | Fixes following warning: ../src/gallium/drivers/lima/ir/gp/disasm.c: In function ‘print_src’: ../src/gallium/drivers/lima/ir/gp/disasm.c:241:20: warning: array subscript 28 is above array bounds of ‘char[5]’ [-Warray-bounds] 241 | "xyzw"[src - gpir_codegen_src_attrib_x]); Reviewed-by: Connor Abbott <[email protected]> Reviewed-by: Erico Nunes <[email protected]> Reviewed-by: Andreas Baierl <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: lower fceilVasily Khoruzhick2019-09-091-0/+1
| | | | | | | | | GP doesn't support fceil so we need to lower it. Reviewed-by: Connor Abbott <[email protected]> Reviewed-by: Erico Nunes <[email protected]> Reviewed-by: Andreas Baierl <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Disallow moves for schedule_first nodesConnor Abbott2019-09-091-1/+5
| | | | | | | | | | | | The entire point of schedule_first is that the node has to be scheduled as soon as possible without any moves because it doesn't produce a proper floating-point value, or its value changes depending on where you read it. We were still introducing a move for preexp2 in some cases though, even if it got scheduled as soon as possible, which broke some exp() tests. Fix that. Reviewed-by: Vasily Khoruzhick <[email protected]> Tested-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Fix fake dep handling for schedule_first nodesConnor Abbott2019-09-092-10/+30
| | | | | | | | | | | | | The whole point of schedule_first nodes is that they need to be scheduled as soon as possible, so if a schedule_first node is the successor in a fake dependency that prevents it from being scheduled after its parent, that can cause problems. We need to add these fake dependencies to the parent as well, and we need to guarantee that the pre-RA scheduler puts schedule_first nodes right before their parents in order to prevent this from adding cycles to the dependency graph. Reviewed-by: Vasily Khoruzhick <[email protected]> Tested-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Fix schedule_first insertion logicConnor Abbott2019-09-091-2/+3
| | | | | | | | | | | The idea was to make sure schedule_first nodes were always first in the ready list. I made sure they were inserted first, but not that other nodes wouldn't later be scheduled ahead of them. Fixes [email protected]@execution@built-in-functions@vs-exp-float and probably others. Reviewed-by: Vasily Khoruzhick <[email protected]> Tested-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Ignore unscheduled successors in can_use_complex()Connor Abbott2019-09-091-1/+2
| | | | | | | | | | The point of the function is to avoid creating a complex move which is used by certain slots in the next instruction, but unscheduled successors will never be in the next instruction. Found while debugging a crash that the previous commit fixed. Reviewed-by: Vasily Khoruzhick <[email protected]> Tested-by: Vasily Khoruzhick <[email protected]>
* lima/gpir: Do all lowerings before rschedConnor Abbott2019-09-093-23/+2
| | | | | | | | | | | | | | | | | | | | The scheduler assumes that load nodes are always duplicated so that they can always be scheduled eventually and therefore they never need to be spilled. But some lowerings were running after the pre-RA scheduler, whereas duplication has to happen before then since it's needed for the scheduler to do a better job reducing register pressure. This meant that lowerings were introducing multiple uses of a load instruction, which broke the scheduler's expectation and resulted in infinite loops in situations where the only nodes available to spill were load nodes. Spilling load nodes would be silly, so we want to fix the lowerings rather than the scheduler. Just do all lowerings before the pre-RA scheduler, which also helps with reducing pressure since the scheduler can more accurately compute the pressure. Fixes lima/mesa#104. Reviewed-by: Vasily Khoruzhick <[email protected]> Tested-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: don't lower phis to scalarVasily Khoruzhick2019-09-051-1/+0
| | | | | | | | | | Utgard PP is vec4 architecture, so lowering phis to scalars increases instruction count and potentially interferes with spilling. Tested-by: Andreas Baierl <[email protected]> Reviewed-by: Eric Anholt <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: don't lower vector {b,f}csel to scalar if condition is scalarVasily Khoruzhick2019-09-061-5/+21
| | | | | | | | | | Utgard PP has vector fcsel operation, but its condition is scalar. Add filtering callback that checks whether {b,f}csel condition is not scalar to lower {b,f}csel to scalar only in this case. Reviewed-by: Qiang Yu <[email protected]> Reviewed-by: Eric Anholt <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* nir: allow specifying filter callback in lower_alu_to_scalarVasily Khoruzhick2019-09-061-16/+30
| | | | | | | | | | | | | Set of opcodes doesn't have enough flexibility in certain cases. E.g. Utgard PP has vector conditional select operation, but condition is always scalar. Lowering all the vector selects to scalar increases instruction number, so we need a way to filter only those ops that can't be handled in hardware. Reviewed-by: Qiang Yu <[email protected]> Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: improve regalloc spill cost calculationErico Nunes2019-09-051-5/+49
| | | | | | | | | | | | Now that spilling ops can be inserted into existing instructions, it makes sense to increase cost to spill registers that would cause the creation of a new instruction. Experimental results showed that penalizing too much due to this caused worse results, however it is beneficial as a tie resolver between registers with the same number of components. Signed-off-by: Erico Nunes <[email protected]> Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: optimizations in regalloc spilling codeErico Nunes2019-09-051-90/+88
| | | | | | | | | | | | | | | Avoid creating unnecessary instructions for the load/store temp nodes when not required, to further reduce register pressure. The store_temp operation seems to be unable to do any spilling. At least the offline shader seems to never output instructions accessing swizzled components, and attempting to output that in ppir results in errors. So, force spilled registers to allocate a full vec4 register. This seems to be the optimal way as it is possible to always keep stores and temps in a single instruction that can be pipelined. Signed-off-by: Erico Nunes <[email protected]> Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: mark regalloc created ssa unspillableErico Nunes2019-09-051-0/+1
| | | | | | | | | | | One ssa created in the spillinc code in ppir_update_spilled_src was not properly being marked 'spilled', which made it a candidate for future spilling attempts. Since it was being inserted by the spilling code itself, let's mark it unspillable to avoid an infinite spilling loop. Signed-off-by: Erico Nunes <[email protected]> Reviewed-by: Vasily Khoruzhick <[email protected]>
* lima: Return fence unconditionallyRoman Stratiienko2019-09-041-4/+2
| | | | | | | | | Based on the vc4 implementation. Fixes Android RenderEngine::flush() routine: android.googlesource.com/platform/frameworks/native/+/refs/tags/android-o-mr1-iot-release-smart-clock-fcs/services/surfaceflinger/RenderEngine/RenderEngine.cpp#225 Signed-off-by: Roman Stratiienko <[email protected]> Reviewed-by: Qiang Yu <[email protected]>
* lima/ppir: clone uniforms and load_coords into each successorVasily Khoruzhick2019-09-044-41/+155
| | | | | | | | | | | | | | | | | | | | | | | Try more aggressive approach with cloning uniform and coord loads. Uniform load can be inserted into any instruction, so let's do that. ARM site claim that penalty for cache miss is one clock, so we don't lose anything if we merge it into instruction that uses the result. As side effect we can also pipeline it and thus decrease reg pressure. Do the same for varyings that hold texture coords, but for different reason: looks like there's a special path for coords that increases precision if varying that holds it is pipelined. If we don't pipeline it and load coords from a register its precision is fp16 and thus only 10 bits which is not enough to accurately sample textures of size 1024 or larger. Since instruction can hold only one uniform load and one varying load, node_to_instr now creates a move using helper introduced in previous commit if slot is already taken. As side effect of this change we can also try to pipeline texture loads and create a move if attempt fails. Reviewed-by: Erico Nunes <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: don't assume that load coords gets value from registerVasily Khoruzhick2019-09-043-9/+13
| | | | | | | | | It can load value from varying directly as well. Also load_regs is the only op that has a source, so add src_num field to load node and set it accordingly. Reviewed-by: Erico Nunes <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: add common helper for creating movsVasily Khoruzhick2019-09-043-49/+41
| | | | | | | Introduce common helper for creating movs to avoid code duplication Reviewed-by: Erico Nunes <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima: fix texture descriptor issuesVasily Khoruzhick2019-08-282-17/+13
| | | | | | | | | | | Looks like initial RE was wrong and some fields have different purpose. I.e. there's no "disable_mipmap" field, it's actually part of another field that selects mipmap filtering. Also fix layout position. Reviewed-by: Qiang Yu <[email protected]> Signed-off-by: Vasily Khoruzhick <[email protected]>
* lima/ppir: enable vectorize optimizationErico Nunes2019-08-251-0/+5
| | | | | | | | | | pp has vector units and some operations can be optimized when bundled together. Benchmarking this with piglit shaders shows that the instruction count can be greatly reduced on many examples with vectorize. Signed-off-by: Erico Nunes <[email protected]> Reviewed-by: Qiang Yu <[email protected]>
* lima/ppir: lower selects to scalarsErico Nunes2019-08-251-0/+5
| | | | | | | | | nir vec4 fcsel assumes that each component of the condition will be used to select the same component from the options, but pp can't implement that since it only has 1 component for the condition. Signed-off-by: Erico Nunes <[email protected]> Reviewed-by: Qiang Yu <[email protected]>