Feature Flags & A/B Testing

Medium Priority

4 min read

Operations

Type	Purpose	Examples
Release flag	Hide work-in-progress until ready	"new checkout flow"
Kill switch	Disable a feature in production if broken	"disable image picker on Android 14"
A/B variant	Compare two implementations	"control vs new onboarding"
Config flag	Tweak parameter values remotely	`max_upload_size_mb: 10`

A good flag system supports all four with the same API.

Code in action — Firebase Remote Config + Riverpod

class FeatureFlagService {
  FeatureFlagService(this._rc);
  final FirebaseRemoteConfig _rc;
 
  Future<void> init() async {
    await _rc.setDefaults(const {
      'new_checkout_flow': false,
      'max_upload_size_mb': 10,
      'onboarding_variant': 'control',
    });
    await _rc.setConfigSettings(RemoteConfigSettings(
      fetchTimeout: const Duration(seconds: 10),
      minimumFetchInterval: const Duration(hours: 1),
    ));
    await _rc.fetchAndActivate();
  }
 
  bool   isEnabled(String key) => _rc.getBool(key);
  int    getInt(String key)    => _rc.getInt(key);
  String getString(String key) => _rc.getString(key);
}
 
// DI
final featureFlagsProvider = Provider<FeatureFlagService>((ref) =>
    FeatureFlagService(FirebaseRemoteConfig.instance));

// Release flag — branch on it
class CheckoutScreen extends ConsumerWidget {
  @override
  Widget build(BuildContext context, WidgetRef ref) {
    final flags = ref.watch(featureFlagsProvider);
    return flags.isEnabled('new_checkout_flow')
        ? const NewCheckoutFlow()
        : const LegacyCheckoutFlow();
  }
}
 
// A/B variant — assign + track
class OnboardingScreen extends ConsumerWidget {
  @override
  Widget build(BuildContext context, WidgetRef ref) {
    final variant = ref.watch(featureFlagsProvider).getString('onboarding_variant');
 
    // Track exposure ONCE — not every rebuild
    useEffect(() {
      Observability.track('experiment_viewed', props: {
        'experiment': 'onboarding',
        'variant': variant,
      });
      return null;
    }, [variant]);
 
    return switch (variant) {
      'variant_a' => const OnboardingVariantA(),
      'variant_b' => const OnboardingVariantB(),
      _           => const OnboardingControl(),
    };
  }
}

Best practices

Practice	Why
Set hard-coded defaults BEFORE fetch	First launch / offline still works
Cache last known values on disk	Survives app restarts before remote fetch
Track variant exposure in analytics	Needed for valid A/B analysis
Fetch on app start + periodically	New users get fresh; existing get updates
Kill switches should fail closed	If config can't load, default to "feature off"
Provider abstraction (`FeatureFlagService`)	Swap Firebase ↔ LaunchDarkly ↔ in-memory for tests
Document each flag's purpose, owner, removal date	Otherwise you accrue zombie flags
Clean up flags after rollouts	Dead flag code = bug surface

A/B testing pitfalls

Mistake	Effect
Sampling assignment on every screen view	Variant flips per session → results meaningless
Tracking variant by inferring from UI	Drift between assignment and tracking
Comparing groups by primary metric without statistical test	False-positive winners
Running experiments without a stop date	Decisions drift; resources wasted
Not pre-registering the hypothesis	Post-hoc "p-hacking"
Testing too many things at once	Hard to attribute lift

The platform (Firebase A/B Testing, GrowthBook, Statsig) handles assignment + statistics if you let it.

Common mistakes to avoid

❌ Reading flags from network on every screen build
   Slows UI; runs the risk of network failures.
   ✅ Read from a local FlagService that's already hydrated.

❌ No defaults
   First launch / no network → flags return zero values → app broken.

❌ Treating flags as permanent
   Six months later you have 50 flags, half are zombies.
   ✅ Assign an owner + removal date to every flag.

❌ Branching deeply on flags throughout the codebase
   Two flags × two flags × two flags = 8 paths to test.
   ✅ Branch at one well-defined seam (the screen, the service).

❌ Forgetting to track variant in analytics
   You can't measure the A/B test without recording who saw what.

❌ A/B testing without sufficient sample size
   N=100 doesn't tell you anything statistically. Estimate sample size up-front.

❌ Mixing release flags with kill switches
   A kill switch should be obvious, not "well it's also a release flag."
   ✅ Distinguish in naming and tooling.

Interview follow-ups

How do you ensure A/B test results are attributable? Fire an experiment_viewed event with experiment_name + variant exactly once per user when they first see the variant. Tools like Firebase A/B Testing handle this automatically when you read flags via their SDK. Without exposure tracking, you can't analyse outcomes — you only know "this flag exists."
What happens if Remote Config fetch fails? Firebase falls back to the last cached values; if there are none, to your setDefaults. Your code should never throw on missing flags — isEnabled('unknown_key') returns false. That's why hard-coded defaults are non-negotiable.
How do you clean up dead flags? Track ownership and intended removal date in code or a flag registry. Set CI rules that warn on flags older than N days. Periodically audit: "is anyone reading this flag?" → delete the flag + the dead branch. The cost of not cleaning up is more than the engineering time spent flagging things.

How helpful was this content?

Please sign in to rate this article.