Answer.
Flaky tests are tests that can pass or fail without any changes in the code due to random external factors. They undermine trust in automation and complicate CI/CD processes.
Background: The tension with flaky tests arose with the advent of mass E2E, integration, and UI tests when the stability of the environment and dependent services is not guaranteed. Initially, such failures were ignored or "manually restarted".
Problem:
- Flaky tests make automated test runs unreliable.
- Developers start to miss real failures, considering them false.
- The time spent on maintaining tests increases and manual investigation of unstable results is required.
Solution:
- Separate tracking of flaky tests. Label them (e.g., @FlakyTest) or put them in a separate category.
- Automatic re-execution. In case of test failure, repeat it X times — if it doesn't fail every time, mark the test as unstable.
- Analysis of instability causes: using logs, state snapshots, resource monitoring (e.g., unstable network, queues, or GC operations).
- Gradual fixes: working with the testing environment, simplifying scenarios, mocking unstable dependencies.
Example code for auto-retry:
import pytest
@pytest.mark.flaky(reruns=3, reruns_delay=2)
def test_random_fail():
... # test code
Key features:
- It is important to separate flaky tests from real failures
- One should not just "retry everything" — real bugs should not be missed
- The status of tests is regularly analyzed and documented, not silenced
Gotcha questions.
Are flaky tests always an infrastructure problem?
No, flaky tests can be caused by business logic errors, race conditions in the code, asynchronicity, or improper time handling.
Is just restarting failed tests enough?
No, retries merely mask the problem. It is necessary to search for and eliminate the cause.
Should we delete all unstable tests?
No, they should be temporarily isolated and the causes fixed, rather than simply excluded forever.
Common mistakes and anti-patterns
- Ignoring flaky tests, which leads to missing serious bugs.
- Mass retries without analyzing causes.
- Labeling important tests as flaky without taking steps to fix them.
Real-life example
Negative case
In the project, flaky tests were simply marked and left alone, considering the problem "unsolvable".
Pros:
- Reduction in false "red" builds
Cons:
- Real bugs go into production
- Constant manual checks of results
Positive case
For each unstable test, an automatic retry and internal tracking of instability were introduced. Regular joint reviews of causes were conducted, and bugs were fixed as they accumulated.
Pros:
- Gradual improvement in the stability of the test system
- Rapid detection and fixing of new flaky tests
Cons:
- Regular resources are required to work on instability analysis