Parallel CI-agnostic testing using Python and Github Actions

Parallel CI-agnostic testing using Python

Prerequisites and recommendations


Our team’s API project in Django contains about 5000 module and integration tests that are performed via pytest. Executing all of these checks – either locally or in the Github Actions (GA) CI – takes us more or less half an hour. Once the tests are all done, we get generated reports in the form of coverage.xml and junit.xml files, which we then export into GA plugins to have a better visualisation of the results. We’d like to find a solution that would allow us to get everything done faster. 

Introduction to the project’s environment

In order to set up the dev environment, we use docker-compose as demonstrated below:

version: '3.8'
    image: postgres:13
      POSTGRES_DB: default
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      - "5432"

    image: redis
      - "6379"

      - db
      - redis
    build: .

The tests on our project are pretty much run using a single Github Actions command:

docker-compose run -v $(pwd):/code -u 0 app pytest --cov=. --junit-xml=unit.xml .

Some of the main advantages of such an approach: 

  • Tests that were configured to be performed inside our CI environment can also be run on a local machine; 
  • Sidecar containers (e.g. PostgreSQL or Redis) can be reused: in our dev environment as well as in our CI environment (when the tests are performed); 
  • The code for running our tests inside the CI environment can be easily transferred to any other CI tool.

As for its disadvantages, we can name the following couple of points: 

  • One needs to make sure that there exists a CI runner with its own docker-compose available inside a CI job – along with a set environment variable DOCKER_HOST pointing to an active Docker daemon;
  • There’s a possibility that sidecar containers will take a while to be set up and running; in such case, I’d recommend using a tool called – if there are no other dependencies that would require installation, it will literally wait until the necessary container has become available and the corresponding port has been opened for incoming traffic (in our example that would be db:5432 -t 60 && redis:5379 -t 60 && pytest). 

Important note

Following a test run, our CI pipeline produces the following artifacts: files called unit.xml and cover.xml (that we need to be able to display the information through GA’s graphical interface), as well as test coverage results. Combined, these data help us to not only find and view the logs of specific broken tests in a rather quick fashion, but also spare us the trouble of digging through a constant flow of raw outputs and monitoring any possible changes of test coverage metrics.

- name: Publish Unit Test Results
        uses: EnricoMi/publish-unit-test-result-action@v1
        if: always()
        continue-on-error: true
          files: unit.xml
          check_name: Pytest tests
- name: Display coverage
        if: always()
        continue-on-error: true
        uses: ewjoachim/coverage-comment-action@v1
          GITHUB_TOKEN: ${{ github.token }}
          COVERAGE_FILE: coverage.xml

    Passed GUI tests

Task definition 

Usually, if a project requires parallel execution of tests, the majority of teams will consider using pytest-xdist in the Python ecosystem as a fitting solution. 

My coworkers suggested I collect and analyse the Datadog data about flaky tests (tests that return successful results in some cases and failures in others, even though the code stays the same for each consecutive run) on our project, so that our team could get a better idea as to why they kept failing on us so randomly.

At first, we planned to follow up on my investigation with fixing the detected issues. That way, we wanted to improve our code and make it possible to run the tests via pytest-xdist without any worries, as well as to make the parallel execution mode safer to use. We needed to speed up our test runs, and do it as quickly as possible :)

Investigating the matter

To start with, I singled out all tests that had been broken due to a malfunctioning Django translation – which, in turn, was dependent on Linux Gettext. According to the information I found, Gettext is actually considered to be not that thread-safe: 

The GNU Gettext runtime supports the use of only one set of regional parameters (= one locale) per process. Therefore, using more than one locale and/or encoding in a single process is not safe in terms of multithreading. This is perfectly fine if we’re talking about the applications that directly interact with only one user at a time (as is the case with most of the GUI apps) – however, if we take a look at various servers and services, the situation becomes quite problematic. 

A couple of interesting comments can also be found in this thread: based on them, one can assume that the default method of running tests in pytest most likely represents multithreading. But, it’s clear that it’s also possible to opt for multiprocessing (as shown in the example below) – it could help us to get rid of at least a portion of testing-related problems.

pip install pytest-xdist

# The most primitive case: send tests to several CPUs at once
pytest -n NUM

# Run tests in 3 different subprocesses
pytest --dist=each --tx 3*popen//python=python3.6

# Run tests in 3 forked subprocesses (doesn’t work on Windows
pytest --dist=each --tx 3*popen//python=python3.6 --boxed

# Send tests to SSH Slaves
pytest --dist=each --tx ssh=first_slave --tx ssh=seconds_slave --rsyncdir package package

# Send tests to the socket server
python :8889 &
python :8890 &
pytest --dist=each --tx socket=localhost:8889 --tx socket=localhost:8890

Continuing my investigation of the Datadog data, I realised that some of the tests had been broken because they had been using the same cache storage in their shared Redis. Others had stopped working due to an incorrect SQL requests count made by an ORM that couldn’t have been successfully asserted. It became clear that, in fact, it was the outcome of some other process’s running queries to the exact same db container at the exact same time as the test(s) in question. 

Like so, I came to a conclusion that the reason behind the malfunctioning of our tests was related the use of a shared db sidecar. In other words, our tests kept breaking just because they used the same database instance. 

My teammates suggested I fix the identified caching issues with the help of an in-memory solution that would’ve made some processes run in RAM instead of a container. But at that point it started to be pretty obvious to me that, even if I’d followed their advice, we would’ve still been encountering one issue after another, leading to newly produced batches of broken tests.

Considering all of the things said and discussed so far, we can conclude the following: using pytest-xdist means that

  • we would need to increase our development costs in order to fix all of the currently existing issues with parallel execution of tests, as well as to maintain the ‘healthy’ state of our system in the future;
  • our test environment would suffer as well, since replacing sidecar containers (e.g. Redis) by some alternative solutions that only work in RAM would cause quality deterioration of our tests.


Instead of switching over to pytest-xdist, I decided to simply split our pytest tests into groups – luckily, there already exists a library called pytest-split which does exactly that. By using this tool on our project, I managed to achieve the following:

Each set, or group, of tests is run in its own separate set of containers configured and managed via docker-compose. On top of that, each process has its own database and Redis instances, as well as any other sidecar container dependency required. 

As a result, there appears a perfect imitation: on the one hand, if one takes a look at a single set of tests, they will see that, within a thread, the tests are executed consecutively; on the other hand, if one considers all of the test groups and their respective threads as one whole thing, then it will be safe to say that the tests are actually executed in parallel (or at least concurrently). 

Like that, it was now possible to avoid the troubles related to multithreading – which was just what we wanted! 

The only small problem left to solve was merging test coverage results with JUnit output data – as I’ve mentioned in the very beginning, we needed them in order to display the information about our test runs via GA’s graphical interface. Since I was using the Python ecosystem, the solution was going to be implemented with the help of the features and capabilities of that language, including the subprocess and argparse libraries that would allow me to work with multiprocessing and get a self-documented interface respectively. 

1. Self-hosted runner in Github Actions 

Let’s begin the task by creating a self-hosted runner with docker-compose inside, and an available Docker daemon for it. 

Github Actions documentation only contains some information about the process of installing self-hosted runners on Linux, and only manually at that. In this sense, GA is somewhat subpar when compared to Gitlab CI; for example, the latter offers a number of ready-to-use solutions that take care of all the necessary aspects – and thus, making it easy to install the runners on Docker, or even on Kubernetes.

This is why we’re going to develop our own solution that will allow us to automate GA’s standard self-runner installation algorithm and containerise it along with all of the required dependencies at the OS level.

As a result, we get a simple command for quickly creating and starting our runner in Github Actions: TOKEN=your_github_token_to_register_runner docker-compose up (here’s the code for reference).

  • Install Docker Engine – for example, on Ubuntu
  • Install docker-compose (here’s the guide for Ubuntu); 
  • Clone the Git repository using the command git clone;
  • Create and launch the necessary containers using TOKEN=your_github_token_to_register_runner docker-compose up -d; 

Check the logs using the docker-compose logs command to make sure that everything is working as intended.

app_1   | python3: running listener
app_1   | 
app_1   | √ Connected to GitHub
app_1   | 
app_1   | Current runner version: '2.294.0'
app_1   | 2022-09-16 21:39:58Z: Listening for Jobs


2. Triggering CI pipeline

Finally, we need to trigger our pipeline. To do so, we either:

  • push any piece of code to the master branch of our repository with parallel_pytest installed, or
  • open a pull request to merge our commits to master, or
  • request the launch of a new CI workflow through GA’s graphical interface :)

You can find the full code of the solution here.

Here’s what happens under the hood:

  • When a new CI workflow is launched, a copy of the code from the repository gets downloaded into the current running job (in other words, we check out the code);
  • If needed, a Docker image with an assigned name tag gets built;
  • Several subprocesses for running our tests get started, and every one of those subprocesses begins to deploy its own ‘fleet’ of containers with a unique -p project_name using docker-compose; like so, we avoid the previously encountered issue where one container gets used by multiple concurrent tests;
  • Test run results are generated in the form of reports like reports/junit_{number of process}.xml and reports/.coverage_{number of process}.xml; once it’s done, a special script merges numerous files with JUnit output data and test coverage information into a single report;
  • Testing results get published in Github Actions and can be displayed through its GUI. 

Debugging tips 

The python3 -m make parallel_pytest --dry command executed with the --dry flag can be used to simulate the compiled commands’ being launched in shell so that one can see what would happen in that case – without their actually having to run anything.

Moreover, the same python3 -m make parallel_pytest command can be configured in such a way that makes it possible to run a smaller number of tests.


The tests on our project can now be run in parallel, and the whole testing process takes us a lot less time than it did before: 

  • 1 core = 24 minutes 40 seconds (without parallelism),
  • 2 cores = 15 minutes 31 seconds,
  • 6 cores = 8 minutes (tested on a PC with an Intel i5-10400 processor with 6 total cores and 12 threads).

Plus, while working on the solution we:

  • solved the problem of parallel execution of tests without changing anything within the existing code of our project – instead, we dealt with the matter at a higher, more abstract level, which meant that we successfully avoided any possible issues related to multithreading and/or multiprocessing safety;
  • continued to use the code of our sidecar containers (including PostgreSQL, Redis and other required dependencies) for running the tests – as a result, our development, CI, and production environments remained as similar as possible, thus complying with the 10th rule of The Twelve-Factor App;
  • managed to develop a rather universal solution that could be reused to speed up any other Git repository;
  • stayed true to our principle of keeping the testing CI-agnostic: our solution can now be run locally, or even migrated to any other CI tool.

What could be done in the future 

A couple of ideas that I might return to later:

  • Refactor the code to make it even more universal, and create a library based on my solution so that it could be used in many different kinds of repositories and/or programming languages. 
  • Perhaps write a parallel_pytest script available in the form of a compiled binary – for example, in Golang. Such a file would be quite lightweight, installation-free, and could be used in any repository without having much of an effect on the size of Docker containers created and run within a project. (An even simpler solution would be to make a compiled binary from my code in Python.)
  • Try out this new solution on other projects – for example, use it to perform testing of some apps that were written in other programming languages (that is, if there appears a tool for splitting tests into groups that would be compatible with each specific language).