Module 8: Kafka Monitoring & Optimization
Chapter 8 β’ Advanced
Kafka Monitoring & Optimization
Running Kafka in production is not just about sending messages successfully. You also need to observe, tune, and scale the cluster over time.
In this module, you'll learn how to:
- Monitor Kafka using tools like Kafka UI, Prometheus, and Grafana
- Track key metrics for brokers, topics, and consumers
- Collect metrics via JMX and export them to Prometheus
- Apply performance tuning on producers, consumers, and brokers
- Do basic capacity planning and handle common production issues
π― What You Will Learn
By the end of this module, you will be able to:
- Set up monitoring infrastructure using Kafka UI, Prometheus, and Grafana
- Identify and track critical Kafka metrics (throughput, latency, consumer lag, resource usage)
- Configure JMX exporters to collect broker and consumer metrics
- Build custom Grafana dashboards for Kafka cluster health
- Apply performance tuning strategies for producers, consumers, and brokers
- Perform capacity planning and scale Kafka clusters effectively
- Troubleshoot common production issues (consumer lag, hot partitions, under-replication)
- Implement alerting rules for proactive issue detection
- Optimize Kafka configuration for your specific workload patterns
π Monitoring Tools Overview
You usually want multiple layers:
- A UI to browse topics and consumer groups
- A metrics system to monitor performance over time
- Alerts so you donβt discover issues from angry users
1. Kafka Manager (CMAK)
- Web-based UI for managing Kafka clusters
- Real-time view of:
- Topics, partitions, and replication
- Consumer groups and consumer lag
- Broker status and configurations
- Useful for:
- Quick inspection
- Partition reallocation
- βWhatβs broken?β debugging
2. Confluent Control Center
- Enterprise-grade monitoring and management (part of Confluent Platform)
- Features:
- Detailed metrics and alerting
- Schema Registry integration
- End-to-end stream monitoring
- Governance and audit features
- Best suited for teams invested in the Confluent ecosystem.
3. Kafka UI (Open Source)
- Modern, lightweight web UI
- Features:
- Browse topics and messages
- Inspect consumer groups and lag
- Manage schemas (if using Schema Registry)
- Easy to run via Docker, great for dev and staging.
4. Prometheus + Grafana
- De facto standard for metrics + dashboards:
- Prometheus pulls metrics from Kafka (via JMX exporter)
- Grafana visualizes graphs and panels
- Benefits:
- Flexible dashboards
- Alerting rules (Prometheus Alertmanager)
- Works with the rest of your infra (DBs, apps, etc.)
π Key Metrics to Monitor
You donβt need all metrics. You need the right ones.
1οΈβ£ Broker Metrics
Throughput
- Messages per second β how many records you are ingesting/serving
- Bytes in/out per second β data volume
- Requests per second β produce + fetch requests
- Network I/O β broker network utilization
These help answer: _βAre we hitting capacity?β_
Latency
- Produce latency β time to write messages
- Fetch latency β time to read messages
- Request latency β overall request round-trip
- Replication latency β delay between leader and replicas
These help answer: _βAre producers/consumers seeing slow responses?β_
Resource Usage
- CPU usage β per broker
- Memory usage β JVM heap + OS page cache
- Disk I/O β read/write ops on log directories
- Network I/O β per broker bandwidth
These help answer: _βIs Kafka the bottleneck or the hardware?β_
2οΈβ£ Topic & Partition Metrics
Partition-Level
- Partition count β enough partitions for parallelism?
- Leader distribution β evenly spread across brokers?
- ISR size (In-Sync Replicas) β replication health
- Under-replicated partitions β indicator of replication issues
Message-Level
- Messages per second per topic
- Bytes per second per topic
- Compression ratio β how effective compression is
- Retention behavior β how data is being deleted/compacted
These help answer: _βIs this topic healthy and well-balanced?β_
3οΈβ£ Consumer Metrics
Consumer Group Metrics
- Consumer lag β how far behind consumers are
- Active consumers per group
- Partition assignment β balanced or skewed?
- Offset commit rate β how often offsets are committed
Processing Metrics
- Messages processed per second
- Error rate β failed processing attempts
- Processing time per message
- Batch size β how many messages processed per poll/batch
These help answer: _βIs our processing keeping up with the input?β_
π‘ JMX Metrics Collection
Kafka exposes metrics via JMX (Java Management Extensions). Prometheus can scrape those using a JMX exporter.
Enable JMX (Conceptual Example)
Typically enabled via JVM options:
export KAFKA_JMX_OPTS=" -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9999 -Djava.rmi.server.hostname=localhost"
Then Kafka will expose JMX on port 9999.
Useful JMX Metric Names
# Broker metrics
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
# Consumer metrics
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=consumer-1
kafka.consumer:type=consumer-coordinator-metrics,client-id=consumer-1
# Producer metrics
kafka.producer:type=producer-metrics,client-id=producer-1
Prometheus JMX Exporter Rules (Example)
# jmx_prometheus_javaagent configuration
---
startDelaySeconds: 0
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
rules:
- pattern: kafka.server<type=(.+), name=(.+)PerSec><>Value
name: kafka_server_$1_$2_per_sec
type: COUNTER
- pattern: kafka.server<type=(.+), name=(.+)PerSec, topic=(.+)><>Value
name: kafka_server_$1_$2_per_sec
type: COUNTER
labels:
topic: "$3"
π Performance Optimization
Now letβs look at knobs you can tune for producers, consumers, and brokers.
π§ͺ Producer Optimization
Batching & Compression
# producer.properties
# Larger batches = better throughput (but slightly higher latency)
batch.size=65536
linger.ms=10
compression.type=lz4
# Buffer capacity
buffer.memory=33554432
max.block.ms=60000
- `batch.size` β max size of a batch per partition
- `linger.ms` β how long to wait to accumulate a batch
- `compression.type` β reduces bandwidth & disk usage, often improves throughput
Acknowledgment Strategy
Trade-off between throughput and durability:
# Higher throughput, lower durability
acks=1
# Higher durability, lower throughput
acks=all
min.insync.replicas=2
- For critical data: use `acks=all` + `min.insync.replicas`
- For less critical, high-volume telemetry: `acks=1` may be acceptable
π§ͺ Consumer Optimization
Fetch & Poll Configuration
# consumer.properties
# Fetch configuration
fetch.min.bytes=1
fetch.max.wait.ms=500
max.partition.fetch.bytes=1048576
# Session management
session.timeout.ms=30000
heartbeat.interval.ms=3000
max.poll.records=500
- Increase `max.poll.records` for batch processing
- Use `fetch.max.wait.ms` and `fetch.min.bytes` to trade latency vs throughput
Offset Management
# Better control over exactly-when offsets are committed
enable.auto.commit=false
auto.commit.interval.ms=5000
# Where to start if no committed offset
auto.offset.reset=latest
- For serious systems, you usually want manual commits after successful processing.
π§ͺ Broker Optimization
JVM Tuning
# Example heap size for a 32GB RAM server
export KAFKA_HEAP_OPTS="-Xmx6G -Xms6G"
# G1GC tuning (good default for Kafka)
export KAFKA_JVM_PERFORMANCE_OPTS=" -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35"
Goal: consistent low GC pause times, not necessarily minimal memory usage.
Log Configuration
# broker log settings (server.properties)
# Segment size (1GB)
log.segment.bytes=1073741824
# Retention (7 days or 1GB per partition, whichever first)
log.retention.hours=168
log.retention.bytes=1073741824
# Cleanup policy
log.cleanup.policy=delete
log.cleaner.enable=true
- For compaction topics, use `log.cleanup.policy=compact` or `compact,delete`.
π¦ Monitoring Setup Example (Dev Stack)
Docker Compose with Kafka + UI + Monitoring
version: '3.8'
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:latest
depends_on:
- zookeeper
ports:
- "9092:9092"
- "9999:9999"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_JMX_PORT: 9999
KAFKA_JMX_HOSTNAME: localhost
kafka-ui:
image: provectuslabs/kafka-ui:latest
depends_on:
- kafka
ports:
- "8080:8080"
environment:
KAFKA_CLUSTERS_0_NAME: local
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
Prometheus Scrape Config (Example)
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets: ['kafka:9999']
metrics_path: /metrics
scrape_interval: 5s
π¨ Alerting Rules (Conceptual)
Alerts help catch problems early.
Example Prometheus Alerts
groups:
- name: kafka-critical
rules:
- alert: KafkaDown
expr: up{job="kafka"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka broker is down"
- alert: HighConsumerLag
expr: kafka_consumer_lag_sum > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer lag detected"
- alert: UnderReplicatedPartitions
expr: kafka_cluster_partition_under_replicated_partitions > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Under-replicated partitions detected"
Typical critical alerts:
- Broker down
- Under-replicated partitions
- Very high consumer lag
- Disk at critical usage
π Capacity Planning
You donβt want to discover βwe ran out of diskβ during a traffic spike.
Hardware Guidelines (Rule-of-Thumb)
CPU
- Minimum: 2 cores per broker
- Recommended: 4β8 cores per broker
- High throughput: 8+ cores per broker
Memory
- Minimum: 4GB RAM
- Recommended: 8β16GB
- High throughput: 16β32GB
- JVM heap is typically 30β50% of total RAM; the rest is for OS page cache.
Disk
- Prefer SSD for production
- Size based on:
- Write rate (MB/s)
- Retention period (days)
- Replication factor
- Watch disk I/O metrics and retention settings.
Network
- Bandwidth: 1 Gbps minimum for serious clusters
- Latency: Ideally < 1ms for brokers in the same DC
- Plan for peak writes + reads + replication.
Scaling Guidelines
Horizontal Scaling
- Add more brokers to increase capacity and resilience
- Increase partitions to enable more parallelism
- Rebalance topics so leaders are distributed evenly
Vertical Scaling
- Add more CPU for throughput-heavy workloads
- Add more memory for caches and buffers
- Add more disk for longer retention or higher volume
π§― Troubleshooting Common Issues
1. High Consumer Lag
# List consumer groups
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
# Describe a specific group
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-group --describe
Possible fixes:
- Add more consumers to the group
- Increase partitions on the topic
- Optimize processing logic (DB calls, external APIs)
- Tune consumer configs (
max.poll.records, fetch sizes)
2. Memory Issues / Large GC Pauses
# Check JVM GC and heap
jstat -gc <kafka_pid>
Potential fixes:
- Increase heap size (but not too large)
- Tune GC (e.g., G1GC options)
- Reduce batch sizes / buffer sizes
- Avoid too many active topics/partitions per broker
3. Disk Space Problems
# Check disk usage
df -h /kafka-logs
# Inspect log directories
bin/kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe
Possible fixes:
- Reduce retention time or retention bytes
- Enable or tune log compaction where applicable
- Add more disk or move logs to larger volumes
- Clean up unused topics
β Best Practices Summary
Monitoring
- Set up monitoring and alerting from day one
- Track both Kafka metrics and app-level metrics
- Regularly review dashboards and tune alerts
- Periodically test your alerting (simulate failures)
Performance
- Tune producers, consumers, and brokers based on actual workloads
- Run load tests before big launches
- Monitor consumer lag, throughput, and latency continuously
- Keep an eye on hot partitions and imbalanced leaders
Reliability
- Use replication factor β₯ 3 for critical topics
- Monitor under-replicated partitions and ISR size
- Plan for broker failures and test failover
- Have a disaster recovery and backup strategy for critical data
β Key Takeaways
- Monitoring is essential for production Kafka deployments - you can't optimize what you can't measure
- Key metrics to track: throughput, latency, consumer lag, resource usage (CPU, memory, disk, network)
- Kafka UI provides quick visual inspection, while Prometheus + Grafana offer powerful metrics and alerting
- JMX is the primary way to expose Kafka metrics for collection
- Performance tuning requires understanding your workload: batch size, compression, partition strategy, consumer parallelism
- Consumer lag is a critical metric - high lag indicates processing bottlenecks
- Hot partitions and imbalanced leaders can cause performance issues - monitor and rebalance as needed
- Capacity planning involves estimating throughput, retention, and replication requirements
- Proper monitoring and alerting help you catch issues before they impact users
- The difference between "Kafka is running" and "Kafka is production-ready" is observability and optimization
π Next Steps
After this module, you should be able to:
- Deploy Kafka with proper monitoring and dashboards
- Keep an eye on performance, lag, and resource usage
- Tune configuration for your specific workload
- Scale brokers, topics, and consumers intentionally, not reactively
- Debug the most common production issues with confidence
This is the difference between "Kafka is running" and "Kafka is a reliable backbone of our system".
π Continue Learning
- Practice: Set up a monitoring stack for your local Kafka cluster and create custom dashboards
- Quiz: Test your understanding of Kafka monitoring and optimization concepts
- Next Module: Move on to Module 9 - Final Project to build a complete real-time analytics platform
- Related Resources:
- [Kafka Monitoring Best Practices](https://kafka.apache.org/documentation/#monitoring)
- [Prometheus Kafka Exporter](https://github.com/danielqsj/kafka_exporter)
- [Grafana Kafka Dashboards](https://grafana.com/grafana/dashboards/?search=kafka)
Hands-on Examples
Complete Monitoring Setup
# Complete Monitoring Setup
## docker-compose.monitoring.yml
version: '3.8'
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:latest
depends_on:
- zookeeper
ports:
- "9092:9092"
- "9999:9999"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_JMX_PORT: 9999
KAFKA_JMX_HOSTNAME: localhost
KAFKA_JMX_OPTS: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.rmi.port=9999 -Djava.rmi.server.hostname=localhost
kafka-ui:
image: provectuslabs/kafka-ui:latest
depends_on:
- kafka
ports:
- "8080:8080"
environment:
KAFKA_CLUSTERS_0_NAME: local
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
jmx-exporter:
image: solsson/kafka-prometheus-jmx-exporter@sha256:6f82e2b0464f50da8104acd7363a9ddd122f5f6e2d78a8b1bfe0f7d3e90e7c0a
ports:
- "5555:5555"
environment:
KAFKA_JMX_HOSTNAME: kafka
KAFKA_JMX_PORT: 9999
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-simple-json-datasource
volumes:
- grafana-storage:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
volumes:
grafana-storage:
## prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kafka-jmx'
static_configs:
- targets: ['jmx-exporter:5555']
scrape_interval: 5s
- job_name: 'kafka-ui'
static_configs:
- targets: ['kafka-ui:8080']
## alerts.yml
groups:
- name: kafka-alerts
rules:
- alert: KafkaDown
expr: up{job="kafka-jmx"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka broker is down"
description: "Kafka broker has been down for more than 1 minute."
- alert: HighConsumerLag
expr: kafka_consumer_lag_sum > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer lag detected"
description: "Consumer lag is {{ $value }} messages."
- alert: UnderReplicatedPartitions
expr: kafka_cluster_partition_under_replicated_partitions > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Under-replicated partitions detected"
description: "{{ $value }} partitions are under-replicated."
## Start monitoring stack:
docker-compose -f docker-compose.monitoring.yml up -d
## Access points:
# Kafka UI: http://localhost:8080
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin)This setup provides comprehensive monitoring with real-time metrics, alerting, and visualization for Kafka clusters.