Flaky Tests
What Are Flaky Tests?
Flaky tests are automated tests that produce inconsistent results — passing and failing on the same code without any source changes. The failure is caused not by a code defect but by non-deterministic factors: timing and race conditions, shared mutable state, non-deterministic data, environment dependencies, or resource contention.
Every engineering team encounters flaky tests. Google has publicly reported that approximately 1.5% of their test suite is flaky at any given time. For most organizations, the actual rate is higher — often 2-5%.
A flaky test is worse than no test at all. A missing test creates a known gap. A flaky test creates false signals — sometimes indicating a bug when there is none, and sometimes passing when there is a real bug. Over time, teams learn to distrust the test suite, and the entire value of test automation degrades.
Common Causes of Flaky Tests
Understanding the root cause is the first step to fixing flakiness. Six causes ordered by frequency.
Timing and Race Conditions
Tests assume operations complete within a fixed time window. When CI runners are slow, these assumptions break. Race conditions between threads or tests accessing the same resource produce non-deterministic outcomes.
Non-Deterministic Data
Tests asserting on exact timestamps, UUIDs, random tokens, or auto-incrementing IDs fail when those values differ between runs. This is the leading cause of flakiness in API and integration tests.
Shared Mutable State
Tests sharing a database, cache, or global singletons are vulnerable. When execution order changes due to parallelization, one test sees state left by another.
Environment Dependencies
Tests depending on external services, network conditions, or file system state are inherently non-deterministic. A DNS delay, rate-limited API, or full disk causes intermittent failures.
Test Order Dependency
Tests relying on a specific execution order — test B expects data from test A — fail when order changes. Violates the principle that each test should set up its own preconditions.
Resource Contention
Port conflicts, database connection pool exhaustion, memory pressure on CI runners, and file lock conflicts cause intermittent failures that depend on current load.
The Real Cost of Flaky Tests
Not just an annoyance — measurable costs that compound over time.
1.5%
Google's reported flakiness rate
Considered a significant problem
5-15 min
Investigation time per flaky failure
Before realizing it's not a real bug
15-30 min
Added to merge times by retries
Reducing deployment frequency
60%+
E2E tests' share of flaky failures
Despite being 5-10% of the suite
Wasted Investigation
Each flaky failure triggers 5-15 minutes of investigation. With a 3% flakiness rate on a 2,000-test suite, that is 60 flaky failures per CI run. Over a team of 10 engineers, this adds up to hours of wasted time weekly.
Eroded Trust
Engineers develop the habit of dismissing failures as "probably flaky." Real bugs get merged because failures are assumed to be flakes. The test suite stops serving its primary purpose.
Slower Pipelines
Automatic retries add 15-30 minutes to merge times. Manual re-runs block developers. Either way, flakiness directly increases time-to-merge and reduces deployment frequency.
How to Fix Flaky Tests
Each cause has specific fix patterns. Apply the right fix for the right cause.
Timing Issues
Fixing Timing Issues
- Replace sleep(2000) with explicit waits that poll for a condition
- Use retry mechanisms with exponential backoff for eventual consistency
- Use Playwright's waitForSelector instead of fixed waits
- Wait on events (message consumed, job completed) not time delays
Non-Deterministic Data
Fixing Non-Deterministic Data
- Use field-level matchers: assert timestamps are valid ISO 8601, not exact values
- Mock time functions (Date.now(), time.Now()) for deterministic values
- Use fixed seeds for random number generators in test config
- Use Keploy's automatic AI noise detection and time-freezing
Shared State
Fixing Shared State
- Wrap each test in a database transaction that rolls back on teardown
- Use separate database schemas or containerized databases per worker
- Reset global singletons, caches, and env vars in beforeEach hooks
- Generate unique resource names per test run to avoid collisions
Environment Dependencies
Fixing Environment Dependencies
- Mock external services with WireMock, Keploy auto-mocks, or VCR libraries
- Use containerized dependencies (Testcontainers) instead of shared infra
- Pin dependency versions and container images
Preventing Flakiness by Design
The best approach is to prevent flaky tests from entering the codebase. These architectural practices reduce the surface area for non-determinism.
Follow the Test Pyramid
E2E tests are inherently more flaky than unit tests. Minimize E2E tests and maximize unit tests. Fewer E2E tests means fewer sources of flakiness.
Use Deterministic Test Generation
Keploy generates tests from recorded traffic with built-in noise detection and time-freezing. Tests are deterministic by design.
Isolate Test Environments
Each test run should operate in a clean, isolated environment. Use ephemeral databases, namespace resources, and mock external dependencies.
Enforce Flakiness SLAs
Set a target of < 0.5% flakiness rate. Quarantine tests exceeding 2%. Assign fixes to the owning team. Review metrics weekly.

How Keploy Eliminates Flaky Tests
Keploy addresses flakiness at the root cause — not with band-aids like retries and quarantines.
Deterministic Replay
Recorded traffic is replayed with exactly the same inputs every time. Inbound requests are identical, and all outbound dependency responses are served from recorded mocks. No network variability, no database inconsistency, no external service flakiness.
Time-Freezing
During replay, Keploy sets the system clock to the exact timestamp of the original recording. This eliminates an entire category of flakiness: tests that fail because timestamps differ, tokens expire, cache TTLs change, or scheduled jobs trigger at different times.
AI Noise Detection
Even with time-freezing, some fields are inherently non-deterministic: UUIDs, session tokens, auto-incrementing IDs. AI analyzes multiple recordings, identifies varying fields, and automatically applies flexible matchers. No manual configuration needed.
Key Takeaway: Keploy does not just detect or tolerate flakiness — it prevents it architecturally. Recorded traffic with mocked dependencies, frozen time, and AI-filtered non-deterministic fields means every test run produces the same result. Visit the product page or try the open source project on GitHub.
How to Detect Flaky Tests
You cannot fix what you cannot see. Detection is the first step.
Multi-Run Detection
Run the entire suite N times (3-5) on the same commit. Any test producing both pass and fail is definitively flaky. Most reliable but expensive in CI time. Use periodically or when introducing new tests.
Historical Analysis
Track results across CI runs over time. If a test has both outcomes on commits where its code was not modified, it is likely flaky. Most CI systems store history that can be analyzed.
Quarantine and Retry
Implement retry on failed tests. If a test fails then passes on retry, flag it as flaky. Accumulate logs to build a flakiness score. Tests exceeding 2% flake rate over 30 days are automatically quarantined.
Flakiness Rate Calculation
Calculate: (flaky runs / total runs) x 100. Track weekly with a target of < 0.5%. Break down by test type — E2E tests typically account for 60-80% of flaky failures despite being 5-10% of the suite.
See Flakiness Elimination in Action
How Keploy detects noisy fields and produces deterministic test results every time.
FAQs
Flaky tests are automated tests that produce inconsistent results — passing and failing on the same code without any changes. A test is considered flaky when it fails non-deterministically, meaning the failure is not caused by a code defect but by environmental factors, timing issues, shared state, or non-deterministic data. Flaky tests erode team confidence in the test suite and slow down CI/CD pipelines.
The most common causes are: timing and race conditions (tests depend on specific execution order or async operations), shared mutable state (tests modify global state that affects other tests), non-deterministic data (timestamps, UUIDs, random values that differ between runs), environment dependencies (network calls, file system state, external services), test order dependency (tests that only pass when run in a specific sequence), and resource contention (database locks, port conflicts, memory pressure).
Detection strategies include: running the test suite multiple times on the same commit and flagging tests with inconsistent results, tracking test results over time in CI and identifying tests with both pass and fail outcomes on unchanged code, using quarantine systems that automatically isolate tests exceeding a flakiness threshold, and implementing retry-with-report mechanisms that distinguish genuine failures from flaky failures.
Flaky tests cause three measurable harms: they waste developer time (engineers spend 5-15 minutes investigating each flaky failure before realizing it is not a real bug), they erode trust in the test suite (teams start ignoring failures, which means real bugs get merged), and they slow CI/CD pipelines (retries and manual re-runs add 15-30 minutes to merge times). Google has reported that 1.5% of their tests are flaky, costing significant engineering resources.
Replace fixed sleeps with explicit waits that poll for a condition (element visible, API returns 200, database row exists). Use retry mechanisms with exponential backoff for operations that depend on eventual consistency. Mock time-dependent functions to return deterministic values. For async operations, use synchronization primitives (semaphores, channels) rather than arbitrary delays.
Isolate each test's state by using database transactions that roll back after each test, creating fresh test data fixtures instead of sharing mutable data between tests, running tests in parallel with separate database schemas or containers, and resetting global state (environment variables, singletons, caches) in test setup/teardown hooks.
Keploy eliminates the root causes of flakiness through three mechanisms: deterministic replay (recorded traffic is replayed with exact same inputs every time), time-freezing (the system clock is set to the recording timestamp so time-dependent logic produces identical results), and AI noise detection (statistical analysis identifies non-deterministic fields like timestamps and UUIDs and automatically excludes them from strict assertions).
Industry benchmarks suggest targeting less than 0.5% flakiness rate (percentage of test runs that produce flaky results). Google reports approximately 1.5% flakiness across their test suite and considers it a significant problem. Teams should track flakiness rate as a key metric, quarantine tests exceeding 2% flake rate, and fix or remove tests exceeding 5% flake rate to maintain CI/CD pipeline reliability.
Join the Keploy community
Follow updates, ask questions, share feedback, and ship faster with other Keploy builders.