iris: Mark render batches as non-recoverable.

Adapted from Chris Wilson's patch. The comment is largely his. Currently, when iris hangs the GPU, it will continue sending batches which incrementally update the state, assuming it's preserved across batches. However, the kernel's GPU reset support reinitializes the guilty context to the default GPU state (reasonably not wanting to trust the current state). This ends up resetting critical things like STATE_BASE_ADDRESS, causing memory accesses in all subsequent batches to be garbage, and almost certainly result in more hangs until we're banned or we kill the machine. We now ask the kernel to ban our render context immediately, so we notice we've gone off the rails as fast as possible. Eventually, we'll attempt to recover and continue. For now, we just avoid torching the GPU over and over.
author: Kenneth Graunke <[email protected]> 2019-05-07 23:03:46 -0700
committer: Kenneth Graunke <[email protected]> 2019-05-09 16:49:07 -0700
commit: c3701e90707805d622cf51a85b02c53b141f945c (patch)
tree: 937ea2e6e7c55c134c7d94fa46b4e623bda92588 /src
parent: 9faf218b8cdda81b5813e935d5ba6e0d57706a03 (diff)
1 files changed, 22 insertions, 0 deletions
diff --git a/src/gallium/drivers/iris/iris_bufmgr.c b/src/gallium/drivers/iris/iris_bufmgr.c
index 5b807e0fbc8..808f20d537d 100644
--- a/src/gallium/drivers/iris/iris_bufmgr.c
+++ b/src/gallium/drivers/iris/iris_bufmgr.c
@@ -1412,6 +1412,28 @@ iris_create_hw_context(struct iris_bufmgr *bufmgr)
       return 0;
    }
 
+   /* Upon declaring a GPU hang, the kernel will zap the guilty context
+    * back to the default logical HW state and attempt to continue on to
+    * our next submitted batchbuffer.  However, our render batches assume
+    * the previous GPU state is preserved, and only emit commands needed
+    * to incrementally change that state.  In particular, we inherit the
+    * STATE_BASE_ADDRESS and PIPELINE_SELECT settings, which are critical.
+    * With default base addresses, our next batches will almost certainly
+    * cause more GPU hangs, leading to repeated hangs until we're banned
+    * or the machine is dead.
+    *
+    * Here we tell the kernel not to attempt to recover our context but
+    * immediately (on the next batchbuffer submission) report that the
+    * context is lost, and we will do the recovery ourselves.  Ideally,
+    * we'll have two lost batches instead of a continual stream of hangs.
+    */
+   struct drm_i915_gem_context_param p = {
+      .ctx_id = create.ctx_id,
+      .param = I915_CONTEXT_PARAM_RECOVERABLE,
+      .value = false,
+   };
+   drmIoctl(bufmgr->fd, DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &p);
+
    return create.ctx_id;
 }
author	Kenneth Graunke <[email protected]>	2019-05-07 23:03:46 -0700
committer	Kenneth Graunke <[email protected]>	2019-05-09 16:49:07 -0700
commit	c3701e90707805d622cf51a85b02c53b141f945c (patch)
tree	937ea2e6e7c55c134c7d94fa46b4e623bda92588 /src
parent	9faf218b8cdda81b5813e935d5ba6e0d57706a03 (diff)