Module 8: Kafka Monitoring & Optimization

Chapter 8 • Advanced

55 min

Kafka Monitoring & Optimization

Running Kafka in production is not just about sending messages successfully. You also need to observe, tune, and scale the cluster over time.

In this module, you'll learn how to:

Monitor Kafka using tools like Kafka UI, Prometheus, and Grafana
Track key metrics for brokers, topics, and consumers
Collect metrics via JMX and export them to Prometheus
Apply performance tuning on producers, consumers, and brokers
Do basic capacity planning and handle common production issues

🎯 What You Will Learn

By the end of this module, you will be able to:

Set up monitoring infrastructure using Kafka UI, Prometheus, and Grafana
Identify and track critical Kafka metrics (throughput, latency, consumer lag, resource usage)
Configure JMX exporters to collect broker and consumer metrics
Build custom Grafana dashboards for Kafka cluster health
Apply performance tuning strategies for producers, consumers, and brokers
Perform capacity planning and scale Kafka clusters effectively
Troubleshoot common production issues (consumer lag, hot partitions, under-replication)
Implement alerting rules for proactive issue detection
Optimize Kafka configuration for your specific workload patterns

🛠 Monitoring Tools Overview

You usually want multiple layers:

A UI to browse topics and consumer groups
A metrics system to monitor performance over time
Alerts so you don’t discover issues from angry users

1. Kafka Manager (CMAK)

Web-based UI for managing Kafka clusters
Real-time view of:
Topics, partitions, and replication
Consumer groups and consumer lag
Broker status and configurations
Useful for:
Quick inspection
Partition reallocation
“What’s broken?” debugging

2. Confluent Control Center

Enterprise-grade monitoring and management (part of Confluent Platform)
Features:
Detailed metrics and alerting
Schema Registry integration
End-to-end stream monitoring
Governance and audit features
Best suited for teams invested in the Confluent ecosystem.

3. Kafka UI (Open Source)

Modern, lightweight web UI
Features:
Browse topics and messages
Inspect consumer groups and lag
Manage schemas (if using Schema Registry)
Easy to run via Docker, great for dev and staging.

4. Prometheus + Grafana

De facto standard for metrics + dashboards:
Prometheus pulls metrics from Kafka (via JMX exporter)
Grafana visualizes graphs and panels
Benefits:
Flexible dashboards
Alerting rules (Prometheus Alertmanager)
Works with the rest of your infra (DBs, apps, etc.)

📊 Key Metrics to Monitor

You don’t need all metrics. You need the right ones.

1️⃣ Broker Metrics

Throughput

Messages per second – how many records you are ingesting/serving
Bytes in/out per second – data volume
Requests per second – produce + fetch requests
Network I/O – broker network utilization

These help answer: _“Are we hitting capacity?”_

Latency

Produce latency – time to write messages
Fetch latency – time to read messages
Request latency – overall request round-trip
Replication latency – delay between leader and replicas

These help answer: _“Are producers/consumers seeing slow responses?”_

Resource Usage

CPU usage – per broker
Memory usage – JVM heap + OS page cache
Disk I/O – read/write ops on log directories
Network I/O – per broker bandwidth

These help answer: _“Is Kafka the bottleneck or the hardware?”_

2️⃣ Topic & Partition Metrics

Partition-Level

Partition count – enough partitions for parallelism?
Leader distribution – evenly spread across brokers?
ISR size (In-Sync Replicas) – replication health
Under-replicated partitions – indicator of replication issues

Message-Level

Messages per second per topic
Bytes per second per topic
Compression ratio – how effective compression is
Retention behavior – how data is being deleted/compacted

These help answer: _“Is this topic healthy and well-balanced?”_

3️⃣ Consumer Metrics

Consumer Group Metrics

Consumer lag – how far behind consumers are
Active consumers per group
Partition assignment – balanced or skewed?
Offset commit rate – how often offsets are committed

Processing Metrics

Messages processed per second
Error rate – failed processing attempts
Processing time per message
Batch size – how many messages processed per poll/batch

These help answer: _“Is our processing keeping up with the input?”_

📡 JMX Metrics Collection

Kafka exposes metrics via JMX (Java Management Extensions). Prometheus can scrape those using a JMX exporter.

Enable JMX (Conceptual Example)

Typically enabled via JVM options:

bash.js

    export KAFKA_JMX_OPTS="      -Dcom.sun.management.jmxremote       -Dcom.sun.management.jmxremote.authenticate=false       -Dcom.sun.management.jmxremote.ssl=false       -Dcom.sun.management.jmxremote.port=9999       -Djava.rmi.server.hostname=localhost"

Then Kafka will expose JMX on port 9999.

Useful JMX Metric Names

bash.js

    # Broker metrics
    kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
    kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
    kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
    
    # Consumer metrics
    kafka.consumer:type=consumer-fetch-manager-metrics,client-id=consumer-1
    kafka.consumer:type=consumer-coordinator-metrics,client-id=consumer-1
    
    # Producer metrics
    kafka.producer:type=producer-metrics,client-id=producer-1

Prometheus JMX Exporter Rules (Example)

yaml.js

    # jmx_prometheus_javaagent configuration
    ---
    startDelaySeconds: 0
    ssl: false
    lowercaseOutputName: false
    lowercaseOutputLabelNames: false
    rules:
      - pattern: kafka.server<type=(.+), name=(.+)PerSec><>Value
        name: kafka_server_$1_$2_per_sec
        type: COUNTER
      - pattern: kafka.server<type=(.+), name=(.+)PerSec, topic=(.+)><>Value
        name: kafka_server_$1_$2_per_sec
        type: COUNTER
        labels:
          topic: "$3"

🚀 Performance Optimization

Now let’s look at knobs you can tune for producers, consumers, and brokers.

🧪 Producer Optimization

Batching & Compression

properties.js

    # producer.properties
    
    # Larger batches = better throughput (but slightly higher latency)
    batch.size=65536
    linger.ms=10
    compression.type=lz4
    
    # Buffer capacity
    buffer.memory=33554432
    max.block.ms=60000

`batch.size` – max size of a batch per partition
`linger.ms` – how long to wait to accumulate a batch
`compression.type` – reduces bandwidth & disk usage, often improves throughput

Acknowledgment Strategy

Trade-off between throughput and durability:

properties.js

    # Higher throughput, lower durability
    acks=1
    
    # Higher durability, lower throughput
    acks=all
    min.insync.replicas=2

For critical data: use `acks=all` + `min.insync.replicas`
For less critical, high-volume telemetry: `acks=1` may be acceptable

🧪 Consumer Optimization

Fetch & Poll Configuration

properties.js

    # consumer.properties
    
    # Fetch configuration
    fetch.min.bytes=1
    fetch.max.wait.ms=500
    max.partition.fetch.bytes=1048576
    
    # Session management
    session.timeout.ms=30000
    heartbeat.interval.ms=3000
    max.poll.records=500

Increase `max.poll.records` for batch processing
Use `fetch.max.wait.ms` and `fetch.min.bytes` to trade latency vs throughput

Offset Management

properties.js

    # Better control over exactly-when offsets are committed
    enable.auto.commit=false
    auto.commit.interval.ms=5000
    
    # Where to start if no committed offset
    auto.offset.reset=latest

For serious systems, you usually want manual commits after successful processing.

🧪 Broker Optimization

JVM Tuning

bash.js

    # Example heap size for a 32GB RAM server
    export KAFKA_HEAP_OPTS="-Xmx6G -Xms6G"
    
    # G1GC tuning (good default for Kafka)
    export KAFKA_JVM_PERFORMANCE_OPTS="      -server       -XX:+UseG1GC       -XX:MaxGCPauseMillis=20       -XX:InitiatingHeapOccupancyPercent=35"

Goal: consistent low GC pause times, not necessarily minimal memory usage.

Log Configuration

properties.js

    # broker log settings (server.properties)
    
    # Segment size (1GB)
    log.segment.bytes=1073741824
    
    # Retention (7 days or 1GB per partition, whichever first)
    log.retention.hours=168
    log.retention.bytes=1073741824
    
    # Cleanup policy
    log.cleanup.policy=delete
    log.cleaner.enable=true

For compaction topics, use `log.cleanup.policy=compact` or `compact,delete`.

📦 Monitoring Setup Example (Dev Stack)

Docker Compose with Kafka + UI + Monitoring

yaml.js

    version: '3.8'
    services:
      zookeeper:
        image: confluentinc/cp-zookeeper:latest
        environment:
          ZOOKEEPER_CLIENT_PORT: 2181
          ZOOKEEPER_TICK_TIME: 2000
    
      kafka:
        image: confluentinc/cp-kafka:latest
        depends_on:
          - zookeeper
        ports:
          - "9092:9092"
          - "9999:9999"
        environment:
          KAFKA_BROKER_ID: 1
          KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
          KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
          KAFKA_JMX_PORT: 9999
          KAFKA_JMX_HOSTNAME: localhost
    
      kafka-ui:
        image: provectuslabs/kafka-ui:latest
        depends_on:
          - kafka
        ports:
          - "8080:8080"
        environment:
          KAFKA_CLUSTERS_0_NAME: local
          KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
    
      prometheus:
        image: prom/prometheus:latest
        ports:
          - "9090:9090"
        volumes:
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
    
      grafana:
        image: grafana/grafana:latest
        ports:
          - "3000:3000"
        environment:
          GF_SECURITY_ADMIN_PASSWORD: admin

Prometheus Scrape Config (Example)

yaml.js

    # prometheus.yml
    global:
      scrape_interval: 15s
    
    scrape_configs:
      - job_name: 'kafka'
        static_configs:
          - targets: ['kafka:9999']
        metrics_path: /metrics
        scrape_interval: 5s

🚨 Alerting Rules (Conceptual)

Alerts help catch problems early.

Example Prometheus Alerts

yaml.js

    groups:
      - name: kafka-critical
        rules:
          - alert: KafkaDown
            expr: up{job="kafka"} == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Kafka broker is down"
    
          - alert: HighConsumerLag
            expr: kafka_consumer_lag_sum > 10000
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High consumer lag detected"
    
          - alert: UnderReplicatedPartitions
            expr: kafka_cluster_partition_under_replicated_partitions > 0
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "Under-replicated partitions detected"

Typical critical alerts:

Broker down
Under-replicated partitions
Very high consumer lag
Disk at critical usage

📐 Capacity Planning

You don’t want to discover “we ran out of disk” during a traffic spike.

Hardware Guidelines (Rule-of-Thumb)

CPU

Minimum: 2 cores per broker
Recommended: 4–8 cores per broker
High throughput: 8+ cores per broker

Memory

Minimum: 4GB RAM
Recommended: 8–16GB
High throughput: 16–32GB
JVM heap is typically 30–50% of total RAM; the rest is for OS page cache.

Disk

Prefer SSD for production
Size based on:
Write rate (MB/s)
Retention period (days)
Replication factor
Watch disk I/O metrics and retention settings.

Network

Bandwidth: 1 Gbps minimum for serious clusters
Latency: Ideally < 1ms for brokers in the same DC
Plan for peak writes + reads + replication.

Scaling Guidelines

Horizontal Scaling

Add more brokers to increase capacity and resilience
Increase partitions to enable more parallelism
Rebalance topics so leaders are distributed evenly

Vertical Scaling

Add more CPU for throughput-heavy workloads
Add more memory for caches and buffers
Add more disk for longer retention or higher volume

🧯 Troubleshooting Common Issues

1. High Consumer Lag

bash.js

    # List consumer groups
    bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
    
    # Describe a specific group
    bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-group --describe

Possible fixes:

Add more consumers to the group
Increase partitions on the topic
Optimize processing logic (DB calls, external APIs)
Tune consumer configs (max.poll.records, fetch sizes)

2. Memory Issues / Large GC Pauses

bash.js

    # Check JVM GC and heap
    jstat -gc <kafka_pid>

Potential fixes:

Increase heap size (but not too large)
Tune GC (e.g., G1GC options)
Reduce batch sizes / buffer sizes
Avoid too many active topics/partitions per broker

3. Disk Space Problems

bash.js

    # Check disk usage
    df -h /kafka-logs
    
    # Inspect log directories
    bin/kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe

Possible fixes:

Reduce retention time or retention bytes
Enable or tune log compaction where applicable
Add more disk or move logs to larger volumes
Clean up unused topics

✅ Best Practices Summary

Monitoring

Set up monitoring and alerting from day one
Track both Kafka metrics and app-level metrics
Regularly review dashboards and tune alerts
Periodically test your alerting (simulate failures)

Performance

Tune producers, consumers, and brokers based on actual workloads
Run load tests before big launches
Monitor consumer lag, throughput, and latency continuously
Keep an eye on hot partitions and imbalanced leaders

Reliability

Use replication factor ≥ 3 for critical topics
Monitor under-replicated partitions and ISR size
Plan for broker failures and test failover
Have a disaster recovery and backup strategy for critical data

✅ Key Takeaways

Monitoring is essential for production Kafka deployments - you can't optimize what you can't measure
Key metrics to track: throughput, latency, consumer lag, resource usage (CPU, memory, disk, network)
Kafka UI provides quick visual inspection, while Prometheus + Grafana offer powerful metrics and alerting
JMX is the primary way to expose Kafka metrics for collection
Performance tuning requires understanding your workload: batch size, compression, partition strategy, consumer parallelism
Consumer lag is a critical metric - high lag indicates processing bottlenecks
Hot partitions and imbalanced leaders can cause performance issues - monitor and rebalance as needed
Capacity planning involves estimating throughput, retention, and replication requirements
Proper monitoring and alerting help you catch issues before they impact users
The difference between "Kafka is running" and "Kafka is production-ready" is observability and optimization

🚀 Next Steps

After this module, you should be able to:

Deploy Kafka with proper monitoring and dashboards
Keep an eye on performance, lag, and resource usage
Tune configuration for your specific workload
Scale brokers, topics, and consumers intentionally, not reactively
Debug the most common production issues with confidence

This is the difference between "Kafka is running" and "Kafka is a reliable backbone of our system".

📚 Continue Learning

Practice: Set up a monitoring stack for your local Kafka cluster and create custom dashboards
Quiz: Test your understanding of Kafka monitoring and optimization concepts
Next Module: Move on to Module 9 - Final Project to build a complete real-time analytics platform
Related Resources:
[Kafka Monitoring Best Practices](https://kafka.apache.org/documentation/#monitoring)
[Prometheus Kafka Exporter](https://github.com/danielqsj/kafka_exporter)
[Grafana Kafka Dashboards](https://grafana.com/grafana/dashboards/?search=kafka)

Hands-on Examples

Complete Monitoring Setup

# Complete Monitoring Setup
    
      ## docker-compose.monitoring.yml
      version: '3.8'
    
      services:
        zookeeper:
          image: confluentinc/cp-zookeeper:latest
          environment:
            ZOOKEEPER_CLIENT_PORT: 2181
            ZOOKEEPER_TICK_TIME: 2000
    
        kafka:
          image: confluentinc/cp-kafka:latest
          depends_on:
            - zookeeper
          ports:
            - "9092:9092"
            - "9999:9999"
          environment:
            KAFKA_BROKER_ID: 1
            KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
            KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
            KAFKA_JMX_PORT: 9999
            KAFKA_JMX_HOSTNAME: localhost
            KAFKA_JMX_OPTS: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.rmi.port=9999 -Djava.rmi.server.hostname=localhost
    
        kafka-ui:
          image: provectuslabs/kafka-ui:latest
          depends_on:
            - kafka
          ports:
            - "8080:8080"
          environment:
            KAFKA_CLUSTERS_0_NAME: local
            KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
    
        jmx-exporter:
          image: solsson/kafka-prometheus-jmx-exporter@sha256:6f82e2b0464f50da8104acd7363a9ddd122f5f6e2d78a8b1bfe0f7d3e90e7c0a
          ports:
            - "5555:5555"
          environment:
            KAFKA_JMX_HOSTNAME: kafka
            KAFKA_JMX_PORT: 9999
    
        prometheus:
          image: prom/prometheus:latest
          ports:
            - "9090:9090"
          volumes:
            - ./prometheus.yml:/etc/prometheus/prometheus.yml
            - ./alerts.yml:/etc/prometheus/alerts.yml
          command:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--web.console.libraries=/etc/prometheus/console_libraries'
            - '--web.console.templates=/etc/prometheus/consoles'
            - '--web.enable-lifecycle'
            - '--web.enable-admin-api'
    
        grafana:
          image: grafana/grafana:latest
          ports:
            - "3000:3000"
          environment:
            GF_SECURITY_ADMIN_PASSWORD: admin
            GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-simple-json-datasource
          volumes:
            - grafana-storage:/var/lib/grafana
            - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
            - ./grafana/datasources:/etc/grafana/provisioning/datasources
    
      volumes:
        grafana-storage:
    
      ## prometheus.yml
      global:
        scrape_interval: 15s
        evaluation_interval: 15s
    
      rule_files:
        - "alerts.yml"
    
      scrape_configs:
        - job_name: 'prometheus'
          static_configs:
            - targets: ['localhost:9090']
    
        - job_name: 'kafka-jmx'
          static_configs:
            - targets: ['jmx-exporter:5555']
          scrape_interval: 5s
    
        - job_name: 'kafka-ui'
          static_configs:
            - targets: ['kafka-ui:8080']
    
      ## alerts.yml
      groups:
        - name: kafka-alerts
          rules:
            - alert: KafkaDown
              expr: up{job="kafka-jmx"} == 0
              for: 1m
              labels:
                severity: critical
              annotations:
                summary: "Kafka broker is down"
                description: "Kafka broker has been down for more than 1 minute."
    
            - alert: HighConsumerLag
              expr: kafka_consumer_lag_sum > 10000
              for: 5m
              labels:
                severity: warning
              annotations:
                summary: "High consumer lag detected"
                description: "Consumer lag is {{ $value }} messages."
    
            - alert: UnderReplicatedPartitions
              expr: kafka_cluster_partition_under_replicated_partitions > 0
              for: 2m
              labels:
                severity: critical
              annotations:
                summary: "Under-replicated partitions detected"
                description: "{{ $value }} partitions are under-replicated."
    
      ## Start monitoring stack:
      docker-compose -f docker-compose.monitoring.yml up -d
    
      ## Access points:
      # Kafka UI: http://localhost:8080
      # Prometheus: http://localhost:9090
      # Grafana: http://localhost:3000 (admin/admin)

This setup provides comprehensive monitoring with real-time metrics, alerting, and visualization for Kafka clusters.

Module 7: Kafka with Python

Module 9: Final Project - Real-Time Analytics Platform

Module 8: Kafka Monitoring & Optimization

Kafka Monitoring & Optimization

🎯 What You Will Learn

🛠 Monitoring Tools Overview

1. Kafka Manager (CMAK)

2. Confluent Control Center

3. Kafka UI (Open Source)

4. Prometheus + Grafana

📊 Key Metrics to Monitor

1️⃣ Broker Metrics

Throughput

Latency

Resource Usage

2️⃣ Topic & Partition Metrics

Partition-Level

Message-Level

3️⃣ Consumer Metrics

Consumer Group Metrics

Processing Metrics

📡 JMX Metrics Collection

Enable JMX (Conceptual Example)

Useful JMX Metric Names

Prometheus JMX Exporter Rules (Example)

🚀 Performance Optimization

🧪 Producer Optimization

Batching & Compression

Acknowledgment Strategy

🧪 Consumer Optimization

Fetch & Poll Configuration

Offset Management

🧪 Broker Optimization

JVM Tuning

Log Configuration

📦 Monitoring Setup Example (Dev Stack)

Docker Compose with Kafka + UI + Monitoring

Prometheus Scrape Config (Example)

🚨 Alerting Rules (Conceptual)

Example Prometheus Alerts

📐 Capacity Planning

Hardware Guidelines (Rule-of-Thumb)

CPU

Memory

Disk

Network

Scaling Guidelines

Horizontal Scaling

Vertical Scaling

🧯 Troubleshooting Common Issues

1. High Consumer Lag

2. Memory Issues / Large GC Pauses

3. Disk Space Problems

✅ Best Practices Summary

Monitoring

Performance

Reliability

✅ Key Takeaways

🚀 Next Steps

📚 Continue Learning

Hands-on Examples

Complete Monitoring Setup

Related Tutorials

Previous: Module 7: Kafka with Python

Next: Module 9: Final Project - Real-Time Analytics Platform