Automated Testing (IT)QA Automation / SDET

How to properly implement retry for automated tests: when to apply it and when not to?

Pass interviews with Hintsage AI assistant

Answer.

The retry mechanism for tests is implemented to improve the stability of the test pipeline and combat flaky automated tests, but it requires a sensible approach.

Background It emerged as a workaround for temporary infrastructural or flaky errors not related to application errors (unstable environment, network issues, random API timeouts). Many CI/CD systems now support built-in retry.

Problem Excessive or uncontrolled retry can hide real bugs or turn failures into "just temporary phenomena". False positives occur: it seems the test passed, but the bug is still present.

Solution Enable retry only for unstable or external-dependent tests, use tags or markers. A fixed limit of retries (1-2 times). Log all failures and successes of retry runs for analysis of failure causes.

import pytest @pytest.mark.flaky(reruns=2, reruns_delay=5) def test_external_api(): # Test that may fail due to API instability ...

Key features:

  • Apply selectively and consciously
  • Always analyze the causes of retries
  • Do not mask application bugs, only infrastructural/flaky ones

Trick Questions.

Is it acceptable to retry absolutely all automated tests in the pipeline?

This is a bad practice: it masks both real bugs and architectural problems of the tests. Retry is a last resort for certain types of tests.

What to do if a test still fails after 3 retries?

Stop the run and open a bug on instability/investigation — this is how tricky errors that require refactoring are identified.

If the test passes on the second try — does that mean everything is fine?

No, the only correct state is stable passing on the first attempt. Passing on the second indicates a problem (flakiness or infrastructure).

Common Mistakes and Anti-Patterns

  • Uncontrolled use of retry, hiding bugs
  • Lack of analysis of failure causes
  • Using retry instead of fixing real problems

Real-life Example

Negative Case

A global retry was enabled for two runs across the entire Jenkins pipeline. After a month, bugs due to race conditions in the code spread, but the team thought "quality ok". The product suddenly started crashing in production due to the same bugs.

Pros:

  • There were green pipelines

Cons:

  • Application bugs were not identified in a timely manner
  • During investigations, difficult to understand the source of unreliability

Positive Case

For the category of tests with external APIs, we set up retry only for them (by tags). All other tests pass strictly on the first run. The team investigates and fixes repeating failures based on logs.

Pros:

  • True bugs are immediately visible
  • Retry saves from random errors of external systems
  • Can analyze and eliminate root causes of problems

Cons:

  • Need to maintain test tags and reasons for retry
  • Slightly more complex pipeline maintenance