On (not) making Wikimedia CI faster: part 1, the problem

2021-11-29 00:00:00 +0000 UTC


The problem

Every now and then I find myself staring at phab:T225730, “Reduce runtime of MW shared gate Jenkins jobs to 5 min”.

Usually I end up back at this task after a frustrating experience with backporting a patch to production. In the optimistic scenario, you are looking at 25-30 minutes of waiting time once it’s your turn to deploy a patch. But when things go wrong, say with a flaky test, then you need to recheck your patch. Then 25-30 minutes becomes 50-60. Add in some other deployers waiting their turn, and what should what should be quick and painless ends up becoming a messy hour+ long ordeal. 🖥️⏳💥🤦

Flaky tests aside, the problem is that the test and merge phases take too long. That’s not just an issue for backport windows, which are after all not the normal way things are deployed at Wikimedia.

A bigger issue is getting timely feedback on your patch, because as we know1, context switching sucks. You want feedback quickly so you can close the loop on a patch and move on to other things. If instead you get pinged 30 minutes later that something failed, you have to stop whatever else you moved on to and go back to your previous context.

The same goes for reviewers of the patch, who want to see how the tests look as part of the code review process. And it’s not just CI – as we’ll see later in this series of posts, some of the problems that would be nice to solve in CI affect the local development process too.

At the same time, we need to also balanace comprehensiveness and consistency with the desires for faster feedback. A testsuite that is fast but flaky and doesn’t cover enough code paths is worse than a slow testsuite that is comprehensive and consistent.

So, can we have a fast(er) testsuite that doesn’t sacrifice comprehensiveness and consistency? Over the last weeks I tried a few different things to achieve this.2 This series is a catalogue of my mostly failed attempts. Like the scientists say, progress is built on failure, so hopefully what I write here points the way forward for a future reader.

⚠️ Disclaimer

Before I go any farther, I’d just like to say that there are many people who have been down this path before me, including the amazing members of the Release Engineering team who have worked on all of this in the past and present, and do the largely unseen work of keeping our CI systems running and reliable, so I want to give recognition to them 💛 and also note that this series of posts should not be interpreted as a critique of their work – spoiler alert, as we’ll see later, some core issues are decades+ of technical debt.

tl;dr: doing something about T225730 is a lot easier said then done.

Status quo

First, let’s review the status quo3 (h/t to @Krinkle and @Jdforrester-WMF for documenting and updating this):

As of 11 May 2021, the gate usually takes around 25 minutes.

The slowest job typically takes 20-25 minutes per run. The time for the gate overall can never be faster than the slowest job, and can be worse as though we run other jobs in parallel, they don’t always start immediately, due to given limited CI execution slots.

Below is the time results from a sample MediaWiki commit (master branch):

[Snipped: Jobs faster than 5 minutes]

  • 9m 43s: mediawiki-quibble-vendor-mysql-php74-docker/5873/console
  • 9m 47s: mediawiki-quibble-vendor-mysql-php73-docker/8799/console
  • 10m 03s: mediawiki-quibble-vendor-sqlite-php72-docker/10345/console
  • 10m 13s: mediawiki-quibble-composer-mysql-php72-docker/19129/console
  • 10m 28s: mediawiki-quibble-vendor-mysql-php72-docker/46482/console
  • 13m 11s: mediawiki-quibble-vendor-postgres-php72-docker/10259/console
  • 16m 44s: wmf-quibble-core-vendor-mysql-php72-docker/53990/console
  • 22m 26s: wmf-quibble-selenium-php72-docker/94038/console

Clearly the last two jobs are dominant in the timing:

  • wmf-quibble: This jobs installs MW with the gated extensions, and then runs all PHPUnit and QUnit tests.
  • wmf-quibble-selenium: This job installs MW with the gated extensions, and then runs all the Selenium tests.

Note that the mediawiki-quibble jobs each install just the MW bundled extensions, and then run PHPUnit, Selenium and QUnit tests.

Stats from wmf-quibble-core-vendor-mysql-php72-docker:

  • 13-18 minutes (wmf-gated, extensions-only)
  • Select times: ** PHPUnit (unit tests): 9 seconds / 13,170 tests. ** PHPUnit (DB-less integration tests): 3.31 minutes / 21,067 tests. ** PHPUnit (DB-heavy): 7.91 minutes / 4,257 tests. ** QUnit: 31 seconds / 1421 tests.

Stats from wmf-quibble-selenium-php72-docker:

  • 20-25 minutes

Starting with the obvious

The obvious candidate appears to be Selenium, since it’s the longest running job at ~22 minutes.

Can we do better? 🧐 Onwards to part 2!

  1. citation needed ↩︎

  2. also known as “throw things at the wall and see what sticks”. ↩︎

  3. T225730#7080442 ↩︎