Answer.

Flaky tests are tests that can pass or fail without any changes in the code due to random external factors. They undermine trust in automation and complicate CI/CD processes.

Background: The tension with flaky tests arose with the advent of mass E2E, integration, and UI tests when the stability of the environment and dependent services is not guaranteed. Initially, such failures were ignored or "manually restarted".

Problem:

Flaky tests make automated test runs unreliable.
Developers start to miss real failures, considering them false.
The time spent on maintaining tests increases and manual investigation of unstable results is required.

Solution:

Separate tracking of flaky tests. Label them (e.g., @FlakyTest) or put them in a separate category.
Automatic re-execution. In case of test failure, repeat it X times — if it doesn't fail every time, mark the test as unstable.
Analysis of instability causes: using logs, state snapshots, resource monitoring (e.g., unstable network, queues, or GC operations).
Gradual fixes: working with the testing environment, simplifying scenarios, mocking unstable dependencies.

Example code for auto-retry:

import pytest
@pytest.mark.flaky(reruns=3, reruns_delay=2)
def test_random_fail():
    ... # test code

Key features:

It is important to separate flaky tests from real failures
One should not just "retry everything" — real bugs should not be missed
The status of tests is regularly analyzed and documented, not silenced

Gotcha questions.

Are flaky tests always an infrastructure problem?

No, flaky tests can be caused by business logic errors, race conditions in the code, asynchronicity, or improper time handling.

Is just restarting failed tests enough?

No, retries merely mask the problem. It is necessary to search for and eliminate the cause.

Should we delete all unstable tests?

No, they should be temporarily isolated and the causes fixed, rather than simply excluded forever.

Common mistakes and anti-patterns

Ignoring flaky tests, which leads to missing serious bugs.
Mass retries without analyzing causes.
Labeling important tests as flaky without taking steps to fix them.

Real-life example

Negative case

In the project, flaky tests were simply marked and left alone, considering the problem "unsolvable".

Pros:

Reduction in false "red" builds

Cons:

Real bugs go into production
Constant manual checks of results

Positive case

For each unstable test, an automatic retry and internal tracking of instability were introduced. Regular joint reviews of causes were conducted, and bugs were fixed as they accumulated.