08

Module 8: Kafka Monitoring & Optimization

Chapter 8 β€’ Advanced

55 min

Kafka Monitoring & Optimization

Running Kafka in production is not just about sending messages successfully. You also need to observe, tune, and scale the cluster over time.

In this module, you'll learn how to:

  • Monitor Kafka using tools like Kafka UI, Prometheus, and Grafana
  • Track key metrics for brokers, topics, and consumers
  • Collect metrics via JMX and export them to Prometheus
  • Apply performance tuning on producers, consumers, and brokers
  • Do basic capacity planning and handle common production issues

🎯 What You Will Learn

By the end of this module, you will be able to:

  • Set up monitoring infrastructure using Kafka UI, Prometheus, and Grafana
  • Identify and track critical Kafka metrics (throughput, latency, consumer lag, resource usage)
  • Configure JMX exporters to collect broker and consumer metrics
  • Build custom Grafana dashboards for Kafka cluster health
  • Apply performance tuning strategies for producers, consumers, and brokers
  • Perform capacity planning and scale Kafka clusters effectively
  • Troubleshoot common production issues (consumer lag, hot partitions, under-replication)
  • Implement alerting rules for proactive issue detection
  • Optimize Kafka configuration for your specific workload patterns

πŸ›  Monitoring Tools Overview

You usually want multiple layers:

  • A UI to browse topics and consumer groups
  • A metrics system to monitor performance over time
  • Alerts so you don’t discover issues from angry users

1. Kafka Manager (CMAK)

  • Web-based UI for managing Kafka clusters
  • Real-time view of:
  • Topics, partitions, and replication
  • Consumer groups and consumer lag
  • Broker status and configurations
  • Useful for:
  • Quick inspection
  • Partition reallocation
  • β€œWhat’s broken?” debugging

2. Confluent Control Center

  • Enterprise-grade monitoring and management (part of Confluent Platform)
  • Features:
  • Detailed metrics and alerting
  • Schema Registry integration
  • End-to-end stream monitoring
  • Governance and audit features
  • Best suited for teams invested in the Confluent ecosystem.

3. Kafka UI (Open Source)

  • Modern, lightweight web UI
  • Features:
  • Browse topics and messages
  • Inspect consumer groups and lag
  • Manage schemas (if using Schema Registry)
  • Easy to run via Docker, great for dev and staging.

4. Prometheus + Grafana

  • De facto standard for metrics + dashboards:
  • Prometheus pulls metrics from Kafka (via JMX exporter)
  • Grafana visualizes graphs and panels
  • Benefits:
  • Flexible dashboards
  • Alerting rules (Prometheus Alertmanager)
  • Works with the rest of your infra (DBs, apps, etc.)

πŸ“Š Key Metrics to Monitor

You don’t need all metrics. You need the right ones.

1️⃣ Broker Metrics

Throughput

  • Messages per second – how many records you are ingesting/serving
  • Bytes in/out per second – data volume
  • Requests per second – produce + fetch requests
  • Network I/O – broker network utilization

These help answer: _β€œAre we hitting capacity?”_

Latency

  • Produce latency – time to write messages
  • Fetch latency – time to read messages
  • Request latency – overall request round-trip
  • Replication latency – delay between leader and replicas

These help answer: _β€œAre producers/consumers seeing slow responses?”_

Resource Usage

  • CPU usage – per broker
  • Memory usage – JVM heap + OS page cache
  • Disk I/O – read/write ops on log directories
  • Network I/O – per broker bandwidth

These help answer: _β€œIs Kafka the bottleneck or the hardware?”_


2️⃣ Topic & Partition Metrics

Partition-Level

  • Partition count – enough partitions for parallelism?
  • Leader distribution – evenly spread across brokers?
  • ISR size (In-Sync Replicas) – replication health
  • Under-replicated partitions – indicator of replication issues

Message-Level

  • Messages per second per topic
  • Bytes per second per topic
  • Compression ratio – how effective compression is
  • Retention behavior – how data is being deleted/compacted

These help answer: _β€œIs this topic healthy and well-balanced?”_


3️⃣ Consumer Metrics

Consumer Group Metrics

  • Consumer lag – how far behind consumers are
  • Active consumers per group
  • Partition assignment – balanced or skewed?
  • Offset commit rate – how often offsets are committed

Processing Metrics

  • Messages processed per second
  • Error rate – failed processing attempts
  • Processing time per message
  • Batch size – how many messages processed per poll/batch

These help answer: _β€œIs our processing keeping up with the input?”_


πŸ“‘ JMX Metrics Collection

Kafka exposes metrics via JMX (Java Management Extensions). Prometheus can scrape those using a JMX exporter.

Enable JMX (Conceptual Example)

Typically enabled via JVM options:

bash.js
    export KAFKA_JMX_OPTS="      -Dcom.sun.management.jmxremote       -Dcom.sun.management.jmxremote.authenticate=false       -Dcom.sun.management.jmxremote.ssl=false       -Dcom.sun.management.jmxremote.port=9999       -Djava.rmi.server.hostname=localhost"
    

Then Kafka will expose JMX on port 9999.

Useful JMX Metric Names

bash.js
    # Broker metrics
    kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
    kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
    kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
    
    # Consumer metrics
    kafka.consumer:type=consumer-fetch-manager-metrics,client-id=consumer-1
    kafka.consumer:type=consumer-coordinator-metrics,client-id=consumer-1
    
    # Producer metrics
    kafka.producer:type=producer-metrics,client-id=producer-1
    

Prometheus JMX Exporter Rules (Example)

yaml.js
    # jmx_prometheus_javaagent configuration
    ---
    startDelaySeconds: 0
    ssl: false
    lowercaseOutputName: false
    lowercaseOutputLabelNames: false
    rules:
      - pattern: kafka.server<type=(.+), name=(.+)PerSec><>Value
        name: kafka_server_$1_$2_per_sec
        type: COUNTER
      - pattern: kafka.server<type=(.+), name=(.+)PerSec, topic=(.+)><>Value
        name: kafka_server_$1_$2_per_sec
        type: COUNTER
        labels:
          topic: "$3"
    

πŸš€ Performance Optimization

Now let’s look at knobs you can tune for producers, consumers, and brokers.

πŸ§ͺ Producer Optimization

Batching & Compression

properties.js
    # producer.properties
    
    # Larger batches = better throughput (but slightly higher latency)
    batch.size=65536
    linger.ms=10
    compression.type=lz4
    
    # Buffer capacity
    buffer.memory=33554432
    max.block.ms=60000
    
  • `batch.size` – max size of a batch per partition
  • `linger.ms` – how long to wait to accumulate a batch
  • `compression.type` – reduces bandwidth & disk usage, often improves throughput

Acknowledgment Strategy

Trade-off between throughput and durability:

properties.js
    # Higher throughput, lower durability
    acks=1
    
    # Higher durability, lower throughput
    acks=all
    min.insync.replicas=2
    
  • For critical data: use `acks=all` + `min.insync.replicas`
  • For less critical, high-volume telemetry: `acks=1` may be acceptable

πŸ§ͺ Consumer Optimization

Fetch & Poll Configuration

properties.js
    # consumer.properties
    
    # Fetch configuration
    fetch.min.bytes=1
    fetch.max.wait.ms=500
    max.partition.fetch.bytes=1048576
    
    # Session management
    session.timeout.ms=30000
    heartbeat.interval.ms=3000
    max.poll.records=500
    
  • Increase `max.poll.records` for batch processing
  • Use `fetch.max.wait.ms` and `fetch.min.bytes` to trade latency vs throughput

Offset Management

properties.js
    # Better control over exactly-when offsets are committed
    enable.auto.commit=false
    auto.commit.interval.ms=5000
    
    # Where to start if no committed offset
    auto.offset.reset=latest
    
  • For serious systems, you usually want manual commits after successful processing.

πŸ§ͺ Broker Optimization

JVM Tuning

bash.js
    # Example heap size for a 32GB RAM server
    export KAFKA_HEAP_OPTS="-Xmx6G -Xms6G"
    
    # G1GC tuning (good default for Kafka)
    export KAFKA_JVM_PERFORMANCE_OPTS="      -server       -XX:+UseG1GC       -XX:MaxGCPauseMillis=20       -XX:InitiatingHeapOccupancyPercent=35"
    

Goal: consistent low GC pause times, not necessarily minimal memory usage.

Log Configuration

properties.js
    # broker log settings (server.properties)
    
    # Segment size (1GB)
    log.segment.bytes=1073741824
    
    # Retention (7 days or 1GB per partition, whichever first)
    log.retention.hours=168
    log.retention.bytes=1073741824
    
    # Cleanup policy
    log.cleanup.policy=delete
    log.cleaner.enable=true
    
  • For compaction topics, use `log.cleanup.policy=compact` or `compact,delete`.

πŸ“¦ Monitoring Setup Example (Dev Stack)

Docker Compose with Kafka + UI + Monitoring

yaml.js
    version: '3.8'
    services:
      zookeeper:
        image: confluentinc/cp-zookeeper:latest
        environment:
          ZOOKEEPER_CLIENT_PORT: 2181
          ZOOKEEPER_TICK_TIME: 2000
    
      kafka:
        image: confluentinc/cp-kafka:latest
        depends_on:
          - zookeeper
        ports:
          - "9092:9092"
          - "9999:9999"
        environment:
          KAFKA_BROKER_ID: 1
          KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
          KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
          KAFKA_JMX_PORT: 9999
          KAFKA_JMX_HOSTNAME: localhost
    
      kafka-ui:
        image: provectuslabs/kafka-ui:latest
        depends_on:
          - kafka
        ports:
          - "8080:8080"
        environment:
          KAFKA_CLUSTERS_0_NAME: local
          KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
    
      prometheus:
        image: prom/prometheus:latest
        ports:
          - "9090:9090"
        volumes:
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
    
      grafana:
        image: grafana/grafana:latest
        ports:
          - "3000:3000"
        environment:
          GF_SECURITY_ADMIN_PASSWORD: admin
    

Prometheus Scrape Config (Example)

yaml.js
    # prometheus.yml
    global:
      scrape_interval: 15s
    
    scrape_configs:
      - job_name: 'kafka'
        static_configs:
          - targets: ['kafka:9999']
        metrics_path: /metrics
        scrape_interval: 5s
    

🚨 Alerting Rules (Conceptual)

Alerts help catch problems early.

Example Prometheus Alerts

yaml.js
    groups:
      - name: kafka-critical
        rules:
          - alert: KafkaDown
            expr: up{job="kafka"} == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Kafka broker is down"
    
          - alert: HighConsumerLag
            expr: kafka_consumer_lag_sum > 10000
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High consumer lag detected"
    
          - alert: UnderReplicatedPartitions
            expr: kafka_cluster_partition_under_replicated_partitions > 0
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "Under-replicated partitions detected"
    

Typical critical alerts:

  • Broker down
  • Under-replicated partitions
  • Very high consumer lag
  • Disk at critical usage

πŸ“ Capacity Planning

You don’t want to discover β€œwe ran out of disk” during a traffic spike.

Hardware Guidelines (Rule-of-Thumb)

CPU

  • Minimum: 2 cores per broker
  • Recommended: 4–8 cores per broker
  • High throughput: 8+ cores per broker

Memory

  • Minimum: 4GB RAM
  • Recommended: 8–16GB
  • High throughput: 16–32GB
  • JVM heap is typically 30–50% of total RAM; the rest is for OS page cache.

Disk

  • Prefer SSD for production
  • Size based on:
  • Write rate (MB/s)
  • Retention period (days)
  • Replication factor
  • Watch disk I/O metrics and retention settings.

Network

  • Bandwidth: 1 Gbps minimum for serious clusters
  • Latency: Ideally < 1ms for brokers in the same DC
  • Plan for peak writes + reads + replication.

Scaling Guidelines

Horizontal Scaling

  • Add more brokers to increase capacity and resilience
  • Increase partitions to enable more parallelism
  • Rebalance topics so leaders are distributed evenly

Vertical Scaling

  • Add more CPU for throughput-heavy workloads
  • Add more memory for caches and buffers
  • Add more disk for longer retention or higher volume

🧯 Troubleshooting Common Issues

1. High Consumer Lag

bash.js
    # List consumer groups
    bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
    
    # Describe a specific group
    bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-group --describe
    

Possible fixes:

  1. Add more consumers to the group
  2. Increase partitions on the topic
  3. Optimize processing logic (DB calls, external APIs)
  4. Tune consumer configs (max.poll.records, fetch sizes)

2. Memory Issues / Large GC Pauses

bash.js
    # Check JVM GC and heap
    jstat -gc <kafka_pid>
    

Potential fixes:

  1. Increase heap size (but not too large)
  2. Tune GC (e.g., G1GC options)
  3. Reduce batch sizes / buffer sizes
  4. Avoid too many active topics/partitions per broker

3. Disk Space Problems

bash.js
    # Check disk usage
    df -h /kafka-logs
    
    # Inspect log directories
    bin/kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe
    

Possible fixes:

  1. Reduce retention time or retention bytes
  2. Enable or tune log compaction where applicable
  3. Add more disk or move logs to larger volumes
  4. Clean up unused topics

βœ… Best Practices Summary

Monitoring

  1. Set up monitoring and alerting from day one
  2. Track both Kafka metrics and app-level metrics
  3. Regularly review dashboards and tune alerts
  4. Periodically test your alerting (simulate failures)

Performance

  1. Tune producers, consumers, and brokers based on actual workloads
  2. Run load tests before big launches
  3. Monitor consumer lag, throughput, and latency continuously
  4. Keep an eye on hot partitions and imbalanced leaders

Reliability

  1. Use replication factor β‰₯ 3 for critical topics
  2. Monitor under-replicated partitions and ISR size
  3. Plan for broker failures and test failover
  4. Have a disaster recovery and backup strategy for critical data

βœ… Key Takeaways

  • Monitoring is essential for production Kafka deployments - you can't optimize what you can't measure
  • Key metrics to track: throughput, latency, consumer lag, resource usage (CPU, memory, disk, network)
  • Kafka UI provides quick visual inspection, while Prometheus + Grafana offer powerful metrics and alerting
  • JMX is the primary way to expose Kafka metrics for collection
  • Performance tuning requires understanding your workload: batch size, compression, partition strategy, consumer parallelism
  • Consumer lag is a critical metric - high lag indicates processing bottlenecks
  • Hot partitions and imbalanced leaders can cause performance issues - monitor and rebalance as needed
  • Capacity planning involves estimating throughput, retention, and replication requirements
  • Proper monitoring and alerting help you catch issues before they impact users
  • The difference between "Kafka is running" and "Kafka is production-ready" is observability and optimization

πŸš€ Next Steps

After this module, you should be able to:

  1. Deploy Kafka with proper monitoring and dashboards
  2. Keep an eye on performance, lag, and resource usage
  3. Tune configuration for your specific workload
  4. Scale brokers, topics, and consumers intentionally, not reactively
  5. Debug the most common production issues with confidence

This is the difference between "Kafka is running" and "Kafka is a reliable backbone of our system".

πŸ“š Continue Learning

  • Practice: Set up a monitoring stack for your local Kafka cluster and create custom dashboards
  • Quiz: Test your understanding of Kafka monitoring and optimization concepts
  • Next Module: Move on to Module 9 - Final Project to build a complete real-time analytics platform
  • Related Resources:
  • [Kafka Monitoring Best Practices](https://kafka.apache.org/documentation/#monitoring)
  • [Prometheus Kafka Exporter](https://github.com/danielqsj/kafka_exporter)
  • [Grafana Kafka Dashboards](https://grafana.com/grafana/dashboards/?search=kafka)

Hands-on Examples

Complete Monitoring Setup

# Complete Monitoring Setup
    
      ## docker-compose.monitoring.yml
      version: '3.8'
    
      services:
        zookeeper:
          image: confluentinc/cp-zookeeper:latest
          environment:
            ZOOKEEPER_CLIENT_PORT: 2181
            ZOOKEEPER_TICK_TIME: 2000
    
        kafka:
          image: confluentinc/cp-kafka:latest
          depends_on:
            - zookeeper
          ports:
            - "9092:9092"
            - "9999:9999"
          environment:
            KAFKA_BROKER_ID: 1
            KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
            KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
            KAFKA_JMX_PORT: 9999
            KAFKA_JMX_HOSTNAME: localhost
            KAFKA_JMX_OPTS: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.rmi.port=9999 -Djava.rmi.server.hostname=localhost
    
        kafka-ui:
          image: provectuslabs/kafka-ui:latest
          depends_on:
            - kafka
          ports:
            - "8080:8080"
          environment:
            KAFKA_CLUSTERS_0_NAME: local
            KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
    
        jmx-exporter:
          image: solsson/kafka-prometheus-jmx-exporter@sha256:6f82e2b0464f50da8104acd7363a9ddd122f5f6e2d78a8b1bfe0f7d3e90e7c0a
          ports:
            - "5555:5555"
          environment:
            KAFKA_JMX_HOSTNAME: kafka
            KAFKA_JMX_PORT: 9999
    
        prometheus:
          image: prom/prometheus:latest
          ports:
            - "9090:9090"
          volumes:
            - ./prometheus.yml:/etc/prometheus/prometheus.yml
            - ./alerts.yml:/etc/prometheus/alerts.yml
          command:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--web.console.libraries=/etc/prometheus/console_libraries'
            - '--web.console.templates=/etc/prometheus/consoles'
            - '--web.enable-lifecycle'
            - '--web.enable-admin-api'
    
        grafana:
          image: grafana/grafana:latest
          ports:
            - "3000:3000"
          environment:
            GF_SECURITY_ADMIN_PASSWORD: admin
            GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-simple-json-datasource
          volumes:
            - grafana-storage:/var/lib/grafana
            - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
            - ./grafana/datasources:/etc/grafana/provisioning/datasources
    
      volumes:
        grafana-storage:
    
      ## prometheus.yml
      global:
        scrape_interval: 15s
        evaluation_interval: 15s
    
      rule_files:
        - "alerts.yml"
    
      scrape_configs:
        - job_name: 'prometheus'
          static_configs:
            - targets: ['localhost:9090']
    
        - job_name: 'kafka-jmx'
          static_configs:
            - targets: ['jmx-exporter:5555']
          scrape_interval: 5s
    
        - job_name: 'kafka-ui'
          static_configs:
            - targets: ['kafka-ui:8080']
    
      ## alerts.yml
      groups:
        - name: kafka-alerts
          rules:
            - alert: KafkaDown
              expr: up{job="kafka-jmx"} == 0
              for: 1m
              labels:
                severity: critical
              annotations:
                summary: "Kafka broker is down"
                description: "Kafka broker has been down for more than 1 minute."
    
            - alert: HighConsumerLag
              expr: kafka_consumer_lag_sum > 10000
              for: 5m
              labels:
                severity: warning
              annotations:
                summary: "High consumer lag detected"
                description: "Consumer lag is {{ $value }} messages."
    
            - alert: UnderReplicatedPartitions
              expr: kafka_cluster_partition_under_replicated_partitions > 0
              for: 2m
              labels:
                severity: critical
              annotations:
                summary: "Under-replicated partitions detected"
                description: "{{ $value }} partitions are under-replicated."
    
      ## Start monitoring stack:
      docker-compose -f docker-compose.monitoring.yml up -d
    
      ## Access points:
      # Kafka UI: http://localhost:8080
      # Prometheus: http://localhost:9090
      # Grafana: http://localhost:3000 (admin/admin)

This setup provides comprehensive monitoring with real-time metrics, alerting, and visualization for Kafka clusters.