Module 18: Failure Is the Default (Error & Resilience)
Chapter 18 • Advanced
Failure Is the Default (Error & Resilience)
The Real Problem
Scenario: It's 2 AM. Products app deployment fails. App is down. Users are trying to checkout. What happens?
The Question: What happens when Products app is down at 2 AM and users are trying to checkout?
This is where most courses chicken out. Don't.
In production, failure is not an exception. Failure is the default.
The Production Reality
Failure Modes
- Remote Unavailable - Products app is down
- Partial Failure - Products fails, Checkout works
- Network Failure - Slow network, timeouts
- Version Mismatch - Products updated, Checkout hasn't
- Load Failure - remoteEntry.js fails to load
The Question: How do we handle each failure mode gracefully?
Strategy 1: Remote Unavailable at Runtime
The Problem
Products app is down. User navigates to /products. What happens?
The Solution: Graceful Degradation
// Host App - Route with error handling
const routes: Routes = [
{
path: 'products',
loadChildren: () => import('products/ProductsModule')
.then(m => m.ProductsModule)
.catch(err => {
console.error('Products app unavailable', err)
// Return fallback module
return import('./fallback/products-fallback.module')
.then(m => m.ProductsFallbackModule)
})
}
]
// Fallback Module - Shows cached or error state
@NgModule({
declarations: [ProductsFallbackComponent],
imports: [CommonModule]
})
export class ProductsFallbackModule {}
@Component({
template: `
<div class="error-state">
<h2>Products temporarily unavailable</h2>
<p>We're working on it. Please try again later.</p>
<button (click)="retry()">Retry</button>
</div>
`
})
export class ProductsFallbackComponent {
retry() {
window.location.reload()
}
}
Why This Works
✅ App Doesn't Crash - Host continues working
✅ User Sees Message - Not a blank screen
✅ Retry Option - User can try again
✅ Other Apps Work - Checkout still functional
Strategy 2: Partial App Failure
The Problem
Products app loads, but API calls fail. Products shows error, but rest of app works.
The Solution: Error Boundaries Per Remote
// Host App - Error boundary component
@Component({
selector: 'app-remote-container',
template: `
<ng-container *ngIf="!hasError; else errorTemplate">
<router-outlet></router-outlet>
</ng-container>
<ng-template #errorTemplate>
<app-error-fallback [error]="error"></app-error-fallback>
</ng-template>
`
})
export class RemoteContainerComponent implements OnInit {
hasError = false
error: any = null
constructor(
private router: Router,
private errorHandler: ErrorHandlerService
) {}
ngOnInit() {
// Listen for errors from remotes
this.errorHandler.errors$.subscribe(error => {
if (error.source === 'products') {
this.hasError = true
this.error = error
}
})
}
}
// Products App - Error handling
export class ProductsComponent {
constructor(
private productsService: ProductsService,
private errorHandler: ErrorHandlerService
) {}
ngOnInit() {
this.productsService.getProducts().subscribe({
next: products => {
this.products = products
},
error: err => {
// Report error to host
this.errorHandler.reportError({
source: 'products',
message: 'Failed to load products',
error: err
})
}
})
}
}
Strategy 3: Timeouts and Retries
The Problem
Network is slow. Products remoteEntry.js takes 30 seconds to load. User gives up.
The Solution: Timeout with Retry
// Host App - Timeout wrapper
export function loadRemoteWithTimeout<T>(
loadFn: () => Promise<T>,
timeout: number = 10000,
retries: number = 3
): Promise<T> {
return new Promise((resolve, reject) => {
let attempts = 0
const attempt = () => {
attempts++
const timeoutId = setTimeout(() => {
if (attempts < retries) {
console.warn(`Remote load timeout, retrying... (${attempts}/${retries})`)
attempt()
} else {
reject(new Error('Remote load timeout after retries'))
}
}, timeout)
loadFn()
.then(module => {
clearTimeout(timeoutId)
resolve(module)
})
.catch(err => {
clearTimeout(timeoutId)
if (attempts < retries) {
console.warn(`Remote load failed, retrying... (${attempts}/${retries})`)
setTimeout(attempt, 1000 * attempts) // Exponential backoff
} else {
reject(err)
}
})
}
attempt()
})
}
// Use in routes
const routes: Routes = [
{
path: 'products',
loadChildren: () => loadRemoteWithTimeout(
() => import('products/ProductsModule').then(m => m.ProductsModule),
10000, // 10 second timeout
3 // 3 retries
).catch(err => {
return import('./fallback/products-fallback.module')
.then(m => m.ProductsFallbackModule)
})
}
]
Retry Strategy
- Exponential Backoff - Wait longer between retries
- Max Retries - Don't retry forever
- User Feedback - Show loading/retry status
Strategy 4: Fallback UI Strategies
Option 1: Cached Content
// Show cached products if remote fails
export class ProductsFallbackComponent {
cachedProducts: Product[] = []
ngOnInit() {
// Load from cache
const cached = localStorage.getItem('products-cache')
if (cached) {
this.cachedProducts = JSON.parse(cached)
}
}
}
Option 2: Disable Features
// Disable checkout if payment service down
export class CheckoutComponent {
canCheckout = true
ngOnInit() {
this.paymentService.healthCheck().subscribe({
next: () => this.canCheckout = true,
error: () => {
this.canCheckout = false
this.showMessage('Checkout temporarily unavailable')
}
})
}
}
Option 3: Feature Kill-Switches
// Kill switch for features
@Injectable({ providedIn: 'root' })
export class FeatureFlagService {
private flags: Map<string, boolean> = new Map()
constructor(private http: HttpClient) {
this.loadFlags()
}
isEnabled(feature: string): boolean {
return this.flags.get(feature) ?? true
}
private loadFlags() {
// Load from config service
this.http.get('/api/feature-flags').subscribe(flags => {
Object.entries(flags).forEach(([key, value]) => {
this.flags.set(key, value)
})
})
}
}
// Use kill switch
export class ProductsComponent {
constructor(private featureFlags: FeatureFlagService) {}
ngOnInit() {
if (!this.featureFlags.isEnabled('products')) {
// Show disabled message
this.showDisabledMessage()
}
}
}
Strategy 5: Health Checks
The Problem
How do we know if a remote is available before trying to load it?
The Solution: Pre-flight Health Check
// Health check service
@Injectable({ providedIn: 'root' })
export class RemoteHealthService {
private healthCache: Map<string, boolean> = new Map()
async checkHealth(remoteUrl: string): Promise<boolean> {
// Check cache first
if (this.healthCache.has(remoteUrl)) {
return this.healthCache.get(remoteUrl)!
}
try {
const response = await fetch(`${remoteUrl}/health`, {
method: 'HEAD',
signal: AbortSignal.timeout(5000) // 5 second timeout
})
const isHealthy = response.ok
this.healthCache.set(remoteUrl, isHealthy)
// Cache for 30 seconds
setTimeout(() => {
this.healthCache.delete(remoteUrl)
}, 30000)
return isHealthy
} catch {
this.healthCache.set(remoteUrl, false)
return false
}
}
}
// Use in route loading
const routes: Routes = [
{
path: 'products',
loadChildren: async () => {
const healthService = inject(RemoteHealthService)
const isHealthy = await healthService.checkHealth('https://products.example.com')
if (isHealthy) {
return import('products/ProductsModule').then(m => m.ProductsModule)
} else {
return import('./fallback/products-fallback.module')
.then(m => m.ProductsFallbackModule)
}
}
}
]
Error Tracking
The Problem
How do we know when failures happen in production?
The Solution: Centralized Error Tracking
// Error tracking service
@Injectable({ providedIn: 'root' })
export class ErrorTrackingService {
constructor(private http: HttpClient) {}
trackError(error: Error, context: ErrorContext) {
const errorReport = {
message: error.message,
stack: error.stack,
source: context.source, // 'products', 'checkout', 'host'
url: window.location.href,
userAgent: navigator.userAgent,
timestamp: new Date().toISOString()
}
// Send to error tracking service
this.http.post('/api/errors', errorReport).subscribe()
// Also log to console in development
if (!environment.production) {
console.error('Error tracked:', errorReport)
}
}
}
// Global error handler
@Injectable()
export class GlobalErrorHandler implements ErrorHandler {
constructor(private errorTracking: ErrorTrackingService) {}
handleError(error: Error) {
this.errorTracking.trackError(error, {
source: 'unknown'
})
}
}
Production Checklist
- [ ] Error boundaries for each remote
- [ ] Timeout strategies implemented
- [ ] Retry logic with backoff
- [ ] Fallback UI for each remote
- [ ] Health checks for remotes
- [ ] Error tracking configured
- [ ] Feature kill-switches implemented
- [ ] Graceful degradation tested
Key Takeaways
- Failure is the default - Design for it
- Graceful degradation - App should never crash
- User feedback - Show what's happening
- Retry strategies - But don't retry forever
- Monitor failures - Track what breaks
Remember: In production, the question isn't "Will it fail?" It's "What happens when it fails?"
The next module: How to test this architecture.