You Don't Need Netflix's Infrastructure Team to Do Chaos Testing
In 2011, Netflix published a blog post about Chaos Monkey — software that randomly terminated EC2 instances in production to force engineers to build resilient systems. It was brilliant, and it spawned an entire discipline called chaos engineering.
It also accidentally convinced most developers that chaos testing requires Netflix's infrastructure, a dedicated reliability team, and the willingness to break production on purpose.
None of that is true for the problem most development teams actually face.
The Problem Most Teams Actually Have
We shipped a checkout feature in October. Everything worked perfectly in staging against our mock APIs and our internal payment service test environment. We deployed on a Tuesday afternoon.
Wednesday morning: three support tickets. "Checkout page shows a blank white screen." About 8% of checkout attempts were failing.
The root cause: our payment processor's EU datacenter was having intermittent issues. About 8% of requests to their API were returning 503 Service Unavailable. Our frontend received the 503, threw an unhandled promise rejection, and re-rendered as a blank screen.
This is a textbook resilience failure. Not a code bug. Not a logic error. A missing error state for a scenario we'd never simulated.
What We Should Have Done
The fix took 45 minutes: add a try/catch around the payment initiation call, display an error message, offer a retry button. Simple. We just never wrote it because our test data always succeeded.
Chaos testing is how you force yourself to build these states before they hit production users.
Chaos Testing Without Netflix
The practical version of chaos testing for a normal development team looks like this: inject error responses into your mock API at a configurable rate, build your frontend against that mock, and find the broken states before users do.
moqapi.dev has a chaos panel built in. For any mock API, you can configure:
- Error codes to inject (500, 503, 429, 404, 422 — whatever is realistic for your upstream)
- Injection rate (percentage of requests that return an error instead of a 200)
- Latency injection (add artificial delay to simulate slow responses)
Here's what I do now at the start of any feature that touches an external API:
- Build the happy path against the mock (no chaos). Get it working.
- Enable 20% error injection (mix of 500, 503, 429).
- Use the feature manually for 5 minutes. Every interaction has a 1-in-5 chance of failing.
- Fix every blank screen, unhandled error, and missing loading state I find.
- After fixes: enable 50% error rate and repeat until I'm satisfied.
Step 3 and 4 will take 20–30 minutes. The blank screen bug from October would have taken 10 minutes to find and 45 minutes to fix — during development, not in production at 2 AM.
The Error States You'll Find
When you run chaos testing against a real frontend for the first time, here's what you typically discover:
- Blank screens from unhandled promise rejections or uncaught render errors.
- Infinite spinners where the loading state shows forever because the error branch wasn't implemented.
- Missing retry logic — user hits an error with no way to try again.
- Cascading UI failures — one failed API call crashes a parent component that has nothing to do with that endpoint.
- Silent data staleness — the fetch failed but the UI shows the last successful data without indicating it might be stale.
Finding all five of these in 30 minutes of chaos testing is a better return on investment than most sprint activities.
What Makes a Resilient Frontend
After running chaos tests on several products, here's the pattern that works:
// Every data-fetching component follows this structure:
if (isLoading) return <Skeleton />
if (error) return <ErrorState message={error.message} onRetry={refetch} />
if (!data || data.length === 0) return <EmptyState />
return <ActualContent data={data} />
This is the Four States pattern: loading, error, empty, content. Every component that fetches data should handle all four. Chaos testing forces you to implement them because it makes errors a regular part of your development workflow instead of a rare surprise.
The chaos testing docs cover the full configuration options, and the rate limiting guide is useful context for why 429 errors are the most common real-world chaos scenario.
You don't need to randomly terminate EC2 instances. You need to inject a 503 into your checkout flow and see what happens. That's it. That's the part that matters for most teams.
About the Author
Founder and sole developer of moqapi.dev. Full-stack engineer with deep experience in API platforms, serverless runtimes, and developer tooling. Built moqapi to solve the mock data and deployment friction she experienced firsthand building production APIs.
Related Articles
What Is Mock Data and Why It Matters for Modern Development
Understand mock data, its role in frontend and backend testing, and how moqapi.dev automates the creation of realistic test payloads for every API endpoint.
Building Serverless APIs: 10 Best Practices You Should Follow
From cold-start optimisation to function composition, learn battle-tested patterns for shipping production-grade serverless APIs at scale.
API Mocking vs Stubbing vs Faking: The Developer's Definitive Guide
These three terms are used interchangeably but mean very different things. Understand when to use each technique and how they affect your test quality.
Ready to build?
Start deploying serverless functions in under a minute.