Feature Flags & A/B Testing
4 min read
Operations
| Type | Purpose | Examples |
|---|---|---|
| Release flag | Hide work-in-progress until ready | "new checkout flow" |
| Kill switch | Disable a feature in production if broken | "disable image picker on Android 14" |
| A/B variant | Compare two implementations | "control vs new onboarding" |
| Config flag | Tweak parameter values remotely | max_upload_size_mb: 10 |
A good flag system supports all four with the same API.
Code in action — Firebase Remote Config + Riverpod
class FeatureFlagService {
FeatureFlagService(this._rc);
final FirebaseRemoteConfig _rc;
Future<void> init() async {
await _rc.setDefaults(const {
'new_checkout_flow': false,
'max_upload_size_mb': 10,
'onboarding_variant': 'control',
});
await _rc.setConfigSettings(RemoteConfigSettings(
fetchTimeout: const Duration(seconds: 10),
minimumFetchInterval: const Duration(hours: 1),
));
await _rc.fetchAndActivate();
}
bool isEnabled(String key) => _rc.getBool(key);
int getInt(String key) => _rc.getInt(key);
String getString(String key) => _rc.getString(key);
}
// DI
final featureFlagsProvider = Provider<FeatureFlagService>((ref) =>
FeatureFlagService(FirebaseRemoteConfig.instance));
// Release flag — branch on it
class CheckoutScreen extends ConsumerWidget {
@override
Widget build(BuildContext context, WidgetRef ref) {
final flags = ref.watch(featureFlagsProvider);
return flags.isEnabled('new_checkout_flow')
? const NewCheckoutFlow()
: const LegacyCheckoutFlow();
}
}
// A/B variant — assign + track
class OnboardingScreen extends ConsumerWidget {
@override
Widget build(BuildContext context, WidgetRef ref) {
final variant = ref.watch(featureFlagsProvider).getString('onboarding_variant');
// Track exposure ONCE — not every rebuild
useEffect(() {
Observability.track('experiment_viewed', props: {
'experiment': 'onboarding',
'variant': variant,
});
return null;
}, [variant]);
return switch (variant) {
'variant_a' => const OnboardingVariantA(),
'variant_b' => const OnboardingVariantB(),
_ => const OnboardingControl(),
};
}
}
Best practices
| Practice | Why |
|---|---|
| Set hard-coded defaults BEFORE fetch | First launch / offline still works |
| Cache last known values on disk | Survives app restarts before remote fetch |
| Track variant exposure in analytics | Needed for valid A/B analysis |
| Fetch on app start + periodically | New users get fresh; existing get updates |
| Kill switches should fail closed | If config can't load, default to "feature off" |
Provider abstraction (FeatureFlagService) | Swap Firebase ↔ LaunchDarkly ↔ in-memory for tests |
| Document each flag's purpose, owner, removal date | Otherwise you accrue zombie flags |
| Clean up flags after rollouts | Dead flag code = bug surface |
A/B testing pitfalls
| Mistake | Effect |
|---|---|
| Sampling assignment on every screen view | Variant flips per session → results meaningless |
| Tracking variant by inferring from UI | Drift between assignment and tracking |
| Comparing groups by primary metric without statistical test | False-positive winners |
| Running experiments without a stop date | Decisions drift; resources wasted |
| Not pre-registering the hypothesis | Post-hoc "p-hacking" |
| Testing too many things at once | Hard to attribute lift |
The platform (Firebase A/B Testing, GrowthBook, Statsig) handles assignment + statistics if you let it.
Common mistakes to avoid
❌ Reading flags from network on every screen build
Slows UI; runs the risk of network failures.
✅ Read from a local FlagService that's already hydrated.
❌ No defaults
First launch / no network → flags return zero values → app broken.
❌ Treating flags as permanent
Six months later you have 50 flags, half are zombies.
✅ Assign an owner + removal date to every flag.
❌ Branching deeply on flags throughout the codebase
Two flags × two flags × two flags = 8 paths to test.
✅ Branch at one well-defined seam (the screen, the service).
❌ Forgetting to track variant in analytics
You can't measure the A/B test without recording who saw what.
❌ A/B testing without sufficient sample size
N=100 doesn't tell you anything statistically. Estimate sample size up-front.
❌ Mixing release flags with kill switches
A kill switch should be obvious, not "well it's also a release flag."
✅ Distinguish in naming and tooling.
Interview follow-ups
-
What's the difference between a feature flag, an A/B test, and a kill switch? They're all variants of remote config. A feature flag hides incomplete work until ready (boolean). An A/B test randomly assigns variants and measures impact (group label). A kill switch disables a feature remotely without a release (boolean, but with urgency). The implementation is similar; the lifecycle and ownership differ.
-
How do you ensure A/B test results are attributable? Fire an
experiment_viewedevent withexperiment_name+variantexactly once per user when they first see the variant. Tools like Firebase A/B Testing handle this automatically when you read flags via their SDK. Without exposure tracking, you can't analyse outcomes — you only know "this flag exists." -
What happens if Remote Config fetch fails? Firebase falls back to the last cached values; if there are none, to your
setDefaults. Your code should never throw on missing flags —isEnabled('unknown_key')returnsfalse. That's why hard-coded defaults are non-negotiable. -
How do you clean up dead flags? Track ownership and intended removal date in code or a flag registry. Set CI rules that warn on flags older than N days. Periodically audit: "is anyone reading this flag?" → delete the flag + the dead branch. The cost of not cleaning up is more than the engineering time spent flagging things.
How helpful was this content?
Please sign in to rate this article.