Golden Testing
3 min read
Testing
The biggest pitfall with golden tests is that renders differ across platforms — Mac vs Linux CI, font versions, OS rendering — so you must run them in a fixed environment (a Docker container or dedicated platform) or use the golden_toolkit package for cross-platform stability.
| Goldens catch | Goldens miss |
|---|---|
| Layout shifts (1px movement of a button) | Behavior changes (button does the wrong thing) |
Color regressions (#1A73E8 vs #1B75EA) | Logic bugs |
| Font / spacing changes | Async state errors |
| Theme migrations breaking specific screens | Anything that doesn't render visually |
They complement, not replace, regular widget tests. Think of them as snapshot tests for UI.
Code in action
void main() {
testWidgets('LoginScreen renders correctly', (tester) async {
await tester.pumpWidget(const MaterialApp(home: LoginScreen()));
await tester.pumpAndSettle();
await expectLater(
find.byType(LoginScreen),
matchesGoldenFile('goldens/login_screen.png'),
);
});
testWidgets('Button states', (tester) async {
await tester.pumpWidget(MaterialApp(
home: Scaffold(body: Column(children: [
ElevatedButton(onPressed: () {}, child: const Text('Enabled')),
const ElevatedButton(onPressed: null, child: Text('Disabled')),
])),
));
await expectLater(
find.byType(Scaffold),
matchesGoldenFile('goldens/button_states.png'),
);
});
}
# Generate/update golden files (run once after intentional changes)
flutter test --update-goldens
# Compare against saved goldens (CI default)
flutter test
The golden_toolkit pattern (recommended for teams)
golden_toolkit (or its successor packages) handles cross-platform font loading, multi-device captures (phone + tablet, light + dark), and a single API for theme/locale matrices. Without it, your CI goldens diverge from your laptop goldens.
testGoldens('LoginScreen — phone & tablet, light & dark', (tester) async {
await loadAppFonts();
await tester.pumpWidgetBuilder(
const LoginScreen(),
wrapper: materialAppWrapper(theme: appTheme),
surfaceSize: const Size(375, 812),
);
await screenMatchesGolden(tester, 'login_screen_phone_light');
});
Strategy — keeping goldens maintainable
| Practice | Why |
|---|---|
| One golden per component variant | When it breaks you know what changed |
| Run on a fixed platform in CI (Linux container, pinned font version) | Cross-platform diffs = noise |
| Commit the PNGs to git (yes, binary files) | Reviewers see the diff in the PR |
| Use a CI step that uploads diffs as artifacts on failure | Reviewers can see expected vs actual without running locally |
| Pair with a visual review tool (Chromatic-like, Percy) for complex apps | Easier diff review at scale |
Regenerate intentionally with --update-goldens | Don't let "just rerun with --update-goldens" become muscle memory |
Common mistakes to avoid
❌ Generating goldens on macOS, running CI on Linux
Antialiasing differs → every test fails on CI day 1.
✅ Either pick one platform and stick to it, or use golden_toolkit + loadAppFonts.
❌ One huge golden of an entire screen
When it breaks, "something changed" — you can't tell what.
✅ Smaller goldens per component variant.
❌ Not committing goldens to version control
PR reviewers can't see expected images. Tests pass locally, fail on a new contributor's machine.
✅ Commit them. Yes, even though they're binary.
❌ Updating goldens with --update-goldens without reviewing the diff
You just shipped a visual regression and erased the evidence.
✅ Open the diff in a viewer; require code review on golden updates.
❌ Trying to make goldens cover async / loading states
Loading spinners animate → flaky goldens.
✅ Pump a controller to a known state, or use AnimatedSwitcher-free composition.
❌ Pixel-perfect goldens for every screen
Maintenance cost > value. Pick critical components: design system widgets,
typography, theme application, complex layouts.
Interview follow-ups
-
How do you handle golden tests when CI and dev machines differ? Pin the rendering environment. Options: (1) a Linux Docker container that both CI and devs use, (2)
golden_toolkitwithloadAppFonts()so font rendering is deterministic, (3) tolerate small differences viagoldenFileComparatorwith a fuzzy diff. The last is a slippery slope — best to fix the environment. -
When would you choose a golden test over a widget test? For design-system components, theme migrations, complex layouts where "the right rendering" is hard to assert procedurally. For behavior (
tap → state changes), widget tests are clearer. Many apps run both: behavior tests via widget tests, visual regression via goldens for shared components. -
What's a
goldenFileComparatorand when would you customize it? It's the strategy Flutter uses to compare candidate vs golden images. Default is exact pixel match. You can subclass it to tolerate small differences (e.g., antialiasing within ±2 pixels) — used sparingly to handle minor cross-platform rendering quirks. -
How do you stop goldens from becoming maintenance debt? Treat them as design specs: small, focused, owned by the team that owns the component. Auto-publish diffs as PR comments so reviewers can see "this changed". Refuse goldens that try to capture too much.
How helpful was this content?
Please sign in to rate this article.