From 27c9272c3d72894ac7675cb542fdaccf1b89fd41 Mon Sep 17 00:00:00 2001 From: Eric Anholt Date: Tue, 16 Jun 2020 15:51:52 -0700 Subject: docs: Move the gitlab-ci docs to RST. I tried not to edit too much meaning in the process, but I did shuffle some stuff around to work as structured documentation. Reviewed-by: Eric Engestrom Reviewed-by: Erik Faye-Lund Part-of: --- .gitlab-ci/README.md | 212 ---------------------------------------- .gitlab-ci/bare-metal/README.md | 124 ----------------------- docs/ci/LAVA.rst | 86 ++++++++++++++++ docs/ci/bare-metal.rst | 123 +++++++++++++++++++++++ docs/ci/docker.rst | 75 ++++++++++++++ docs/ci/index.rst | 54 +++++++++- 6 files changed, 337 insertions(+), 337 deletions(-) delete mode 100644 .gitlab-ci/README.md delete mode 100644 .gitlab-ci/bare-metal/README.md create mode 100644 docs/ci/LAVA.rst create mode 100644 docs/ci/bare-metal.rst create mode 100644 docs/ci/docker.rst diff --git a/.gitlab-ci/README.md b/.gitlab-ci/README.md deleted file mode 100644 index 1150e182d12..00000000000 --- a/.gitlab-ci/README.md +++ /dev/null @@ -1,212 +0,0 @@ -# Mesa testing - -The goal of the "test" stage of the .gitlab-ci.yml is to do pre-merge -testing of Mesa drivers on various platforms, so that we can ensure no -regressions are merged, as long as developers are merging code using -marge-bot. - -There are currently 4 automated testing systems deployed for Mesa. -LAVA and gitlab-runner on the DUTs are used in pre-merge testing and -are described in this document. Managing bare metal using -gitlab-runner is described under [bare-metal/README.md]. Intel also -has a jenkins-based CI system with restricted access that isn't -connected to gitlab. - -## Mesa testing using LAVA - -[LAVA](https://lavasoftware.org/) is a system for functional testing -of boards including deploying custom bootloaders and kernels. This is -particularly relevant to testing Mesa because we often need to change -kernels for UAPI changes (and this lets us do full testing of a new -kernel during development), and our workloads can easily take down -boards when mistakes are made (kernel oopses, OOMs that take out -critical system services). - -### Mesa-LAVA software architecture - -The gitlab-runner will run on some host that has access to the LAVA -lab, with tags like "lava-mesa-boardname" to control only taking in -jobs for the hardware that the LAVA lab contains. The gitlab-runner -spawns a docker container with lava-cli in it, and connects to the -LAVA lab using a predefined token to submit jobs under a specific -device type. - -The LAVA instance manages scheduling those jobs to the boards present. -For a job, it will deploy the kernel, device tree, and the ramdisk -containing the CTS. - -### Deploying a new Mesa-LAVA lab - -You'll want to start with setting up your LAVA instance and getting -some boards booting using test jobs. Start with the stock QEMU -examples to make sure your instance works at all. Then, you'll need -to define your actual boards. - -The device type in lava-gitlab-ci.yml is the device type you create in -your LAVA instance, which doesn't have to match the board's name in -`/etc/lava-dispatcher/device-types`. You create your boards under -that device type and the Mesa jobs will be scheduled to any of them. -Instantiate your boards by creating them in the UI or at the command -line attached to that device type, then populate their dictionary -(using an "extends" line probably referencing the board's template in -`/etc/lava-dispatcher/device-types`). Now, go find a relevant -healthcheck job for your board as a test job definition, or cobble -something together from a board that boots using the same boot_method -and some public images, and figure out how to get your boards booting. - -Once you can boot your board using a custom job definition, it's time -to connect Mesa CI to it. Install gitlab-runner and register as a -shared runner (you'll need a gitlab admin for help with this). The -runner *must* have a tag (like "mesa-lava-db410c") to restrict the -jobs it takes or it will grab random jobs from tasks across fd.o, and -your runner isn't ready for that. - -The runner will be running an ARM docker image (we haven't done any -x86 LAVA yet, so that isn't documented). If your host for the -gitlab-runner is x86, then you'll need to install qemu-user-static and -the binfmt support. - -The docker image will need access to the lava instance. If it's on a -public network it should be fine. If you're running the LAVA instance -on localhost, you'll need to set `network_mode="host"` in -`/etc/gitlab-runner/config.toml` so it can access localhost. Create a -gitlab-runner user in your LAVA instance, log in under that user on -the web interface, and create an API token. Copy that into a -`lavacli.yaml`: - -``` -default: - token: - uri: - username: gitlab-runner -``` - -Add a volume mount of that `lavacli.yaml` to -`/etc/gitlab-runner/config.toml` so that the docker container can -access it. You probably have a `volumes = ["/cache"]` already, so now it would be - -``` - volumes = ["/home/anholt/lava-config/lavacli.yaml:/root/.config/lavacli.yaml", "/cache"] -``` - -Note that this token is visible to anybody that can submit MRs to -Mesa! It is not an actual secret. We could just bake it into the -gitlab CI yml, but this way the current method of connecting to the -LAVA instance is separated from the Mesa branches (particularly -relevant as we have many stable branches all using CI). - -Now it's time to define your test runner in -`.gitlab-ci/lava-gitlab-ci.yml`. - -## Mesa testing using gitlab-runner on DUTs - -### Software architecture - -For freedreno and llvmpipe CI, we're using gitlab-runner on the test -devices (DUTs), cached docker containers with VK-GL-CTS, and the -normal shared x86_64 runners to build the Mesa drivers to be run -inside of those containers on the DUTs. - -The docker containers are rebuilt from the debian-install.sh script -when DEBIAN\_TAG is changed in .gitlab-ci.yml, and -debian-test-install.sh when DEBIAN\_ARM64\_TAG is changed in -.gitlab-ci.yml. The resulting images are around 500MB, and are -expected to change approximately weekly (though an individual -developer working on them may produce many more images while trying to -come up with a working MR!). - -gitlab-runner is a client that polls gitlab.freedesktop.org for -available jobs, with no inbound networking requirements. Jobs can -have tags, so we can have DUT-specific jobs that only run on runners -with that tag marked in the gitlab UI. - -Since dEQP takes a long time to run, we mark the job as "parallel" at -some level, which spawns multiple jobs from one definition, and then -deqp-runner.sh takes the corresponding fraction of the test list for -that job. - -To reduce dEQP runtime (or avoid tests with unreliable results), a -deqp-runner.sh invocation can provide a list of tests to skip. If -your driver is not yet conformant, you can pass a list of expected -failures, and the job will only fail on tests that aren't listed (look -at the job's log for which specific tests failed). - -### DUT requirements - -#### DUTs must have a stable kernel and GPU reset. - -If the system goes down during a test run, that job will eventually -time out and fail (default 1 hour). However, if the kernel can't -reliably reset the GPU on failure, bugs in one MR may leak into -spurious failures in another MR. This would be an unacceptable impact -on Mesa developers working on other drivers. - -#### DUTs must be able to run docker - -The Mesa gitlab-runner based test architecture is built around docker, -so that we can cache the debian package installation and CTS build -step across multiple test runs. Since the images are large and change -approximately weekly, the DUTs also need to be running some script to -prune stale docker images periodically in order to not run out of disk -space as we rev those containers (perhaps [this -script](https://gitlab.com/gitlab-org/gitlab-runner/issues/2980#note_169233611)). - -Note that docker doesn't allow containers to be stored on NFS, and -doesn't allow multiple docker daemons to interact with the same -network block device, so you will probably need some sort of physical -storage on your DUTs. - -#### DUTs must be public - -By including your device in .gitlab-ci.yml, you're effectively letting -anyone on the internet run code on your device. docker containers may -provide some limited protection, but how much you trust that and what -you do to mitigate hostile access is up to you. - -#### DUTs must expose the dri device nodes to the containers. - -Obviously, to get access to the HW, we need to pass the render node -through. This is done by adding `devices = ["/dev/dri"]` to the -`runners.docker` section of /etc/gitlab-runner/config.toml. - -### HW CI farm expectations - -To make sure that testing of one vendor's drivers doesn't block -unrelated work by other vendors, we require that a given driver's test -farm produces a spurious failure no more than once a week. If every -driver had CI and failed once a week, we would be seeing someone's -code getting blocked on a spurious failure daily, which is an -unacceptable cost to the project. - -Additionally, the test farm needs to be able to provide a short enough -turnaround time that people can regularly use the "Merge when pipeline -succeeds" button successfully (until we get -[marge-bot](https://github.com/smarkets/marge-bot) in place on -freedesktop.org). As a result, we require that the test farm be able -to handle a whole pipeline's worth of jobs in less than 5 minutes (to -compare, the build stage is about 10 minutes, if you could get all -your jobs scheduled on the shared runners in time.). - -If a test farm is short the HW to provide these guarantees, consider -dropping tests to reduce runtime. -`VK-GL-CTS/scripts/log/bottleneck_report.py` can help you find what -tests were slow in a `results.qpa` file. Or, you can have a job with -no `parallel` field set and: - -``` - variables: - CI_NODE_INDEX: 1 - CI_NODE_TOTAL: 10 -``` - -to just run 1/10th of the test list. - -If a HW CI farm goes offline (network dies and all CI pipelines end up -stalled) or its runners are consistenly spuriously failing (disk -full?), and the maintainer is not immediately available to fix the -issue, please push through an MR disabling that farm's jobs by adding -'.' to the front of the jobs names until the maintainer can bring -things back up. If this happens, the farm maintainer should provide a -report to mesa-dev@lists.freedesktop.org after the fact explaining -what happened and what the mitigation plan is for that failure next -time. diff --git a/.gitlab-ci/bare-metal/README.md b/.gitlab-ci/bare-metal/README.md deleted file mode 100644 index 843d168c455..00000000000 --- a/.gitlab-ci/bare-metal/README.md +++ /dev/null @@ -1,124 +0,0 @@ -# bare-metal Mesa testing - -Testing Mesa with gitlab-runner running on the devices being tested -(DUTs) proved to be too unstable, so this set of scripts is for -running Mesa testing on bare-metal boards connected to a separate -system using gitlab-runner. Currently only "fastboot" and "ChromeOS -Servo" devices are supported. - -In comparison with LAVA, this doesn't involve maintaining a separate -webservice with its own job scheduler and replicating jobs between the -two. It also places more of the board support in git, instead of -webservice configuration. On the other hand, the serial interactions -and bootloader support are more primitive. - -## Requirements (fastboot) - -This testing requires power control of the DUTs by the gitlab-runner -machine, since this is what we use to reset the system and get back to -a pristine state at the start of testing. - -We require access to the console output from the gitlab-runner system, -since that is how we get the final results back from the tests. You -should probably have the console on a serial connection, so that you -can see bootloader progress. - -The boards need to be able to have a kernel/initramfs supplied by the -gitlab-runner system, since the initramfs is what contains the Mesa -testing payload. - -The boards should have networking, so that (in a future iteration of -this code) we can extract the dEQP .xml results to artifacts on -gitlab. - -## Requirements (servo) - -For servo-connected boards, we can use the EC connection for power -control to reboot the board. However, loading a kernel is not as easy -as fastboot, so we assume your bootloader can do TFTP, and that your -gitlab-runner mounts the runner's tftp directory specific to the board -at /tftp in the container. - -Since we're going the TFTP route, we also use NFS root. This avoids -packing the rootfs and sending it to the board as a ramdisk, which -means we can support larger rootfses (for piglit or tracie testing), -at the cost of needing more storage on the runner. - -Telling the board about where its TFTP and NFS should come from is -done using dnsmasq on the runner host. For example, this snippet in -the dnsmasq.conf.d in the google farm, with the gitlab-runner host we -call "servo". - -``` -dhcp-host=1c:69:7a:0d:a3:d3,10.42.0.10,set:servo - -# Fixed dhcp addresses for my sanity, and setting a tag for -# specializing other DHCP options -dhcp-host=a0:ce:c8:c8:d9:5d,10.42.0.11,set:cheza1 -dhcp-host=a0:ce:c8:c8:d8:81,10.42.0.12,set:cheza2 - -# Specify the next server, watch out for the double ',,'. The -# filename didn't seem to get picked up by the bootloader, so we use -# tftp-unique-root and mount directories like -# /srv/tftp/10.42.0.11/jwerner/cheza as /tftp in the job containers. -tftp-unique-root -dhcp-boot=tag:cheza1,cheza1/vmlinuz,,10.42.0.10 -dhcp-boot=tag:cheza2,cheza2/vmlinuz,,10.42.0.10 - -dhcp-option=tag:cheza1,option:root-path,/srv/nfs/cheza1 -dhcp-option=tag:cheza2,option:root-path,/srv/nfs/cheza2 -``` - -## Setup - -Each board will be registered in fd.o gitlab. You'll want something -like this to register a fastboot board: - -``` -sudo gitlab-runner register \ - --url https://gitlab.freedesktop.org \ - --registration-token $1 \ - --name MY_BOARD_NAME \ - --tag-list MY_BOARD_TAG \ - --executor docker \ - --docker-image "alpine:latest" \ - --docker-volumes "/dev:/dev" \ - --docker-network-mode "host" \ - --docker-privileged \ - --non-interactive -``` - -For a servo board, you'll need to also volume mount the board's NFS -root dir at /nfs and TFTP kernel directory at /tftp. - -The registration token has to come from a fd.o gitlab admin going to -https://gitlab.freedesktop.org/admin/runners - -The name scheme for Google's lab is google-freedreno-boardname-n, and -our tag is something like google-freedreno-db410c. The tag is what -identifies a board type so that board-specific jobs can be dispatched -into that pool. - -We need privileged mode and the /dev bind mount in order to get at the -serial console and fastboot USB devices (--device arguments don't -apply to devices that show up after container start, which is the case -with fastboot, and the servo serial devices are acctually links to -/dev/pts). We use host network mode so that we can (in the future) -spin up a server to collect XML results for fastboot. - -Once you've added your boards, you're going to need to add a little -more customization in `/etc/gitlab-runner/config.toml`. First, add -`concurrent = ` at the top ("we should have up to -this many jobs running managed by this gitlab-runner"). Then for each -board's runner, set `limit = 1` ("only 1 job served by this board at a -time"). Finally, add the board-specific environment variables -required by your bare-metal script, something like: - -``` -[[runners]] - name = "google-freedreno-db410c-1" - environment = ["BM_SERIAL=/dev/ttyDB410c8", "BM_POWERUP=google-power-up.sh 8", "BM_FASTBOOT_SERIAL=15e9e390"] -``` - -Once you've updated your runners' configs, restart with `sudo service -gitlab-runner restart` diff --git a/docs/ci/LAVA.rst b/docs/ci/LAVA.rst new file mode 100644 index 00000000000..6ccce795a1b --- /dev/null +++ b/docs/ci/LAVA.rst @@ -0,0 +1,86 @@ +LAVA CI +======= + +`LAVA `_ is a system for functional testing +of boards including deploying custom bootloaders and kernels. This is +particularly relevant to testing Mesa because we often need to change +kernels for UAPI changes (and this lets us do full testing of a new +kernel during development), and our workloads can easily take down +boards when mistakes are made (kernel oopses, OOMs that take out +critical system services). + +Mesa-LAVA software architecture +------------------------------- + +The gitlab-runner will run on some host that has access to the LAVA +lab, with tags like "lava-mesa-boardname" to control only taking in +jobs for the hardware that the LAVA lab contains. The gitlab-runner +spawns a docker container with lava-cli in it, and connects to the +LAVA lab using a predefined token to submit jobs under a specific +device type. + +The LAVA instance manages scheduling those jobs to the boards present. +For a job, it will deploy the kernel, device tree, and the ramdisk +containing the CTS. + +Deploying a new Mesa-LAVA lab +----------------------------- + +You'll want to start with setting up your LAVA instance and getting +some boards booting using test jobs. Start with the stock QEMU +examples to make sure your instance works at all. Then, you'll need +to define your actual boards. + +The device type in lava-gitlab-ci.yml is the device type you create in +your LAVA instance, which doesn't have to match the board's name in +``/etc/lava-dispatcher/device-types``. You create your boards under +that device type and the Mesa jobs will be scheduled to any of them. +Instantiate your boards by creating them in the UI or at the command +line attached to that device type, then populate their dictionary +(using an "extends" line probably referencing the board's template in +``/etc/lava-dispatcher/device-types``). Now, go find a relevant +healthcheck job for your board as a test job definition, or cobble +something together from a board that boots using the same boot_method +and some public images, and figure out how to get your boards booting. + +Once you can boot your board using a custom job definition, it's time +to connect Mesa CI to it. Install gitlab-runner and register as a +shared runner (you'll need a gitlab admin for help with this). The +runner *must* have a tag (like "mesa-lava-db410c") to restrict the +jobs it takes or it will grab random jobs from tasks across fd.o, and +your runner isn't ready for that. + +The runner will be running an ARM docker image (we haven't done any +x86 LAVA yet, so that isn't documented). If your host for the +gitlab-runner is x86, then you'll need to install qemu-user-static and +the binfmt support. + +The docker image will need access to the lava instance. If it's on a +public network it should be fine. If you're running the LAVA instance +on localhost, you'll need to set ``network_mode="host"`` in +``/etc/gitlab-runner/config.toml`` so it can access localhost. Create a +gitlab-runner user in your LAVA instance, log in under that user on +the web interface, and create an API token. Copy that into a +``lavacli.yaml``: + +.. code-block:: yaml + + default: + token: + uri: + username: gitlab-runner + +Add a volume mount of that ``lavacli.yaml`` to +``/etc/gitlab-runner/config.toml`` so that the docker container can +access it. You probably have a ``volumes = ["/cache"]`` already, so now it would be:: + + volumes = ["/home/anholt/lava-config/lavacli.yaml:/root/.config/lavacli.yaml", "/cache"] + +Note that this token is visible to anybody that can submit MRs to +Mesa! It is not an actual secret. We could just bake it into the +gitlab CI yml, but this way the current method of connecting to the +LAVA instance is separated from the Mesa branches (particularly +relevant as we have many stable branches all using CI). + +Now it's time to define your test runner in +``.gitlab-ci/lava-gitlab-ci.yml``. diff --git a/docs/ci/bare-metal.rst b/docs/ci/bare-metal.rst new file mode 100644 index 00000000000..cfe4e0d55e5 --- /dev/null +++ b/docs/ci/bare-metal.rst @@ -0,0 +1,123 @@ +Bare-metal CI +============= + +The bare-metal scripts run on a system with gitlab-runner and docker, +connected to potentially multiple bare-metal boards that run tests of +Mesa. Currently only "fastboot" and "ChromeOS Servo" devices are +supported. + +In comparison with LAVA, this doesn't involve maintaining a separate +webservice with its own job scheduler and replicating jobs between the +two. It also places more of the board support in git, instead of +webservice configuration. On the other hand, the serial interactions +and bootloader support are more primitive. + +Requirements (fastboot) +----------------------- + +This testing requires power control of the DUTs by the gitlab-runner +machine, since this is what we use to reset the system and get back to +a pristine state at the start of testing. + +We require access to the console output from the gitlab-runner system, +since that is how we get the final results back from the tests. You +should probably have the console on a serial connection, so that you +can see bootloader progress. + +The boards need to be able to have a kernel/initramfs supplied by the +gitlab-runner system, since the initramfs is what contains the Mesa +testing payload. + +The boards should have networking, so that (in a future iteration of +this code) we can extract the dEQP .xml results to artifacts on +gitlab. + +Requirements (servo) +-------------------- + +For servo-connected boards, we can use the EC connection for power +control to reboot the board. However, loading a kernel is not as easy +as fastboot, so we assume your bootloader can do TFTP, and that your +gitlab-runner mounts the runner's tftp directory specific to the board +at /tftp in the container. + +Since we're going the TFTP route, we also use NFS root. This avoids +packing the rootfs and sending it to the board as a ramdisk, which +means we can support larger rootfses (for piglit or tracie testing), +at the cost of needing more storage on the runner. + +Telling the board about where its TFTP and NFS should come from is +done using dnsmasq on the runner host. For example, this snippet in +the dnsmasq.conf.d in the google farm, with the gitlab-runner host we +call "servo":: + + dhcp-host=1c:69:7a:0d:a3:d3,10.42.0.10,set:servo + + # Fixed dhcp addresses for my sanity, and setting a tag for + # specializing other DHCP options + dhcp-host=a0:ce:c8:c8:d9:5d,10.42.0.11,set:cheza1 + dhcp-host=a0:ce:c8:c8:d8:81,10.42.0.12,set:cheza2 + + # Specify the next server, watch out for the double ',,'. The + # filename didn't seem to get picked up by the bootloader, so we use + # tftp-unique-root and mount directories like + # /srv/tftp/10.42.0.11/jwerner/cheza as /tftp in the job containers. + tftp-unique-root + dhcp-boot=tag:cheza1,cheza1/vmlinuz,,10.42.0.10 + dhcp-boot=tag:cheza2,cheza2/vmlinuz,,10.42.0.10 + + dhcp-option=tag:cheza1,option:root-path,/srv/nfs/cheza1 + dhcp-option=tag:cheza2,option:root-path,/srv/nfs/cheza2 + +Setup +----- + +Each board will be registered in fd.o gitlab. You'll want something +like this to register a fastboot board: + +.. code-block:: console + + sudo gitlab-runner register \ + --url https://gitlab.freedesktop.org \ + --registration-token $1 \ + --name MY_BOARD_NAME \ + --tag-list MY_BOARD_TAG \ + --executor docker \ + --docker-image "alpine:latest" \ + --docker-volumes "/dev:/dev" \ + --docker-network-mode "host" \ + --docker-privileged \ + --non-interactive + +For a servo board, you'll need to also volume mount the board's NFS +root dir at /nfs and TFTP kernel directory at /tftp. + +The registration token has to come from a fd.o gitlab admin going to +https://gitlab.freedesktop.org/admin/runners + +The name scheme for Google's lab is google-freedreno-boardname-n, and +our tag is something like google-freedreno-db410c. The tag is what +identifies a board type so that board-specific jobs can be dispatched +into that pool. + +We need privileged mode and the /dev bind mount in order to get at the +serial console and fastboot USB devices (--device arguments don't +apply to devices that show up after container start, which is the case +with fastboot, and the servo serial devices are acctually links to +/dev/pts). We use host network mode so that we can (in the future) +spin up a server to collect XML results for fastboot. + +Once you've added your boards, you're going to need to add a little +more customization in ``/etc/gitlab-runner/config.toml``. First, add +``concurrent = `` at the top ("we should have up to +this many jobs running managed by this gitlab-runner"). Then for each +board's runner, set ``limit = 1`` ("only 1 job served by this board at a +time"). Finally, add the board-specific environment variables +required by your bare-metal script, something like:: + + [[runners]] + name = "google-freedreno-db410c-1" + environment = ["BM_SERIAL=/dev/ttyDB410c8", "BM_POWERUP=google-power-up.sh 8", "BM_FASTBOOT_SERIAL=15e9e390"] + +Once you've updated your runners' configs, restart with ``sudo service +gitlab-runner restart`` diff --git a/docs/ci/docker.rst b/docs/ci/docker.rst new file mode 100644 index 00000000000..3bdfe48f0fe --- /dev/null +++ b/docs/ci/docker.rst @@ -0,0 +1,75 @@ +Docker CI +========= + +For llvmpipe and swrast CI, we run tests in a container containing +VK-GL-CTS, on the shared gitlab runners provided by `freedesktop +`_ + +Software architecture +--------------------- + +The docker containers are rebuilt from the debian-install.sh script +when DEBIAN\_TAG is changed in .gitlab-ci.yml, and +debian-test-install.sh when DEBIAN\_ARM64\_TAG is changed in +.gitlab-ci.yml. The resulting images are around 500MB, and are +expected to change approximately weekly (though an individual +developer working on them may produce many more images while trying to +come up with a working MR!). + +gitlab-runner is a client that polls gitlab.freedesktop.org for +available jobs, with no inbound networking requirements. Jobs can +have tags, so we can have DUT-specific jobs that only run on runners +with that tag marked in the gitlab UI. + +Since dEQP takes a long time to run, we mark the job as "parallel" at +some level, which spawns multiple jobs from one definition, and then +deqp-runner.sh takes the corresponding fraction of the test list for +that job. + +To reduce dEQP runtime (or avoid tests with unreliable results), a +deqp-runner.sh invocation can provide a list of tests to skip. If +your driver is not yet conformant, you can pass a list of expected +failures, and the job will only fail on tests that aren't listed (look +at the job's log for which specific tests failed). + +DUT requirements +---------------- + +In addition to the general :ref:`CI-farm-expectations`, using +docker requiers: + +* DUTs must have a stable kernel and GPU reset (if applicable). + +If the system goes down during a test run, that job will eventually +time out and fail (default 1 hour). However, if the kernel can't +reliably reset the GPU on failure, bugs in one MR may leak into +spurious failures in another MR. This would be an unacceptable impact +on Mesa developers working on other drivers. + +* DUTs must be able to run docker + +The Mesa gitlab-runner based test architecture is built around docker, +so that we can cache the debian package installation and CTS build +step across multiple test runs. Since the images are large and change +approximately weekly, the DUTs also need to be running some script to +prune stale docker images periodically in order to not run out of disk +space as we rev those containers (perhaps `this script +`_). + +Note that docker doesn't allow containers to be stored on NFS, and +doesn't allow multiple docker daemons to interact with the same +network block device, so you will probably need some sort of physical +storage on your DUTs. + +* DUTs must be public + +By including your device in .gitlab-ci.yml, you're effectively letting +anyone on the internet run code on your device. docker containers may +provide some limited protection, but how much you trust that and what +you do to mitigate hostile access is up to you. + +* DUTs must expose the dri device nodes to the containers. + +Obviously, to get access to the HW, we need to pass the render node +through. This is done by adding ``devices = ["/dev/dri"]`` to the +``runners.docker`` section of /etc/gitlab-runner/config.toml. diff --git a/docs/ci/index.rst b/docs/ci/index.rst index 9c34f5090d8..4055f876f91 100644 --- a/docs/ci/index.rst +++ b/docs/ci/index.rst @@ -1,7 +1,6 @@ Continuous Integration ====================== - GitLab CI --------- @@ -18,6 +17,7 @@ The CI runs a number of tests, from trivial build-testing to complex GPU renderi - Sanity checks (``meson test`` & ``scons check``) - Some drivers (softpipe, llvmpipe, freedreno and panfrost) are also tested using `VK-GL-CTS `__ +- Replay of application traces A typical run takes between 20 and 30 minutes, although it can go up very quickly if the GitLab runners are overwhelmed, which happens sometimes. When it does happen, @@ -42,6 +42,15 @@ about it on ``#freedesktop`` on Freenode and tag `Daniel Stone `Eric Anholt `__ (``anholt`` on IRC). +The three gitlab CI systems currently integrated are: + + +.. toctree:: + :maxdepth: 1 + + bare-metal + LAVA + docker Intel CI -------- @@ -74,3 +83,46 @@ it on ``#dri-devel`` on Freenode and tag `Clayton Craft `__ (``craftyguy`` on IRC) or `Nico Cortes `__ (``ngcortes`` on IRC). + +.. _CI-farm-expectations: + +CI farm expectations +-------------------- + +To make sure that testing of one vendor's drivers doesn't block +unrelated work by other vendors, we require that a given driver's test +farm produces a spurious failure no more than once a week. If every +driver had CI and failed once a week, we would be seeing someone's +code getting blocked on a spurious failure daily, which is an +unacceptable cost to the project. + +Additionally, the test farm needs to be able to provide a short enough +turnaround time that we can get our MRs through marge-bot without the +pipeline backing up. As a result, we require that the test farm be +able to handle a whole pipeline's worth of jobs in less than 5 minutes +(to compare, the build stage is about 10 minutes, if you could get all +your jobs scheduled on the shared runners in time.). + +If a test farm is short the HW to provide these guarantees, consider +dropping tests to reduce runtime. +``VK-GL-CTS/scripts/log/bottleneck_report.py`` can help you find what +tests were slow in a ``results.qpa`` file. Or, you can have a job with +no ``parallel`` field set and: + +.. code-block:: yaml + + variables: + CI_NODE_INDEX: 1 + CI_NODE_TOTAL: 10 + +to just run 1/10th of the test list. + +If a HW CI farm goes offline (network dies and all CI pipelines end up +stalled) or its runners are consistenly spuriously failing (disk +full?), and the maintainer is not immediately available to fix the +issue, please push through an MR disabling that farm's jobs by adding +'.' to the front of the jobs names until the maintainer can bring +things back up. If this happens, the farm maintainer should provide a +report to mesa-dev@lists.freedesktop.org after the fact explaining +what happened and what the mitigation plan is for that failure next +time. -- cgit v1.2.3