Kafka Retention, Cleanup, and Performance Tuning

Kafka Retention, Cleanup, and Performance Tuning

Kafka Retention, Cleanup, and Performance Tuning 

Category: Administration → Kafka 

Applies to: Apache Kafka 2.x 3.x 

Issue Summary 

This document outlines critical configurations and best practices for managing data retention, ensuring efficient cleanup, and optimizing the performance of Apache Kafka brokers. Proper configuration in these areas is essential for preventing disk space issues, maintaining broker stability, and achieving desired throughput and latency. 

Critical Properties (in server.properties) 

  • log.retention.hours 

  • Description: The default number of hours to retain log segments (data) before they are marked for deletion. This applies to topics that don't have a specific retention.ms or retention.bytes configured. 

  • Value: Default is 168 (7 days). Adjust based on your data retention requirements and disk capacity. For high-volume topics, a shorter retention period might be necessary. 

  • log.retention.bytes 

  • Description: The default maximum size (in bytes) of log segments to retain for a topic partition. This applies if log.retretion.hours is not explicitly set or when both are set, the earliest condition triggers deletion. 

  • Value: Default is -1 (no limit). Set a specific byte limit for tighter disk space control, especially for topics with variable message sizes or bursty traffic. 

  • log.segment.bytes 

  • Description: The maximum size of a single log segment file. When a segment reaches this size, it's rolled over, and a new active segment is created. Smaller segments mean more files and more frequent rolls, potentially impacting performance. 

  • Value: Default is 1073741824 (1 GB). A common value is 536870912 (512 MB) to 1073741824 (1 GB). Larger segments can improve throughput for long-running producers. 

  • Description: The frequency (in milliseconds) with which the log cleaner checks for log segments that are eligible for retention or deletion. 

  • Value: Default is 300000 (5 minutes). For aggressive cleanup, you might reduce this, but too frequent checks can add overhead. 

  • message.max.bytes 

  • Description: The largest size of a message that can be appended to the log. This is a broker-side limit. 

  • Value: Default is 1048576 (1 MB). Adjust if you need to send larger messages. Must be coordinated with producer/consumer max.request.size and fetch.max.bytes respectively. 

  • num.partitions 

  • Description: The default number of partitions for newly created topics when no partition count is specified. 

  • Value: Default is 1. For better parallelism and throughput, set a higher default (e.g., 3 or 5) or specify a per-topic. 

  • num.recovery.threads.per.data.dir 

  • Description: The number of threads used to rebuild log indices and clean up log segments on broker startup or shutdown. 

  • Value: Default is 1. Increase for faster recovery of brokers with many partitions. 

  • Description: The number of I/O threads that handle requests from producers and consumers. 

  • Value: Default is 8. Increasing this can improve throughput on brokers with many concurrent client connections or high message volumes, but it depends on available CPU cores. 

  • Description: The number of threads that handle network requests (receiving/sending data). 

  • Value: Default is 3. Increase for very high client connections or throughput but ensure sufficient CPU. 

  • log.dirs 

  • Description: A comma-separated list of directories where the log data for Kafka will be stored. 

  • Value: Multiple directories on separate disk drives (ideally dedicated, high-speed drives like NVMe SSDs or RAID 0 arrays) are crucial for I/O performance and resilience. 

Additional Notes: 

  • Topic-Level Overrides: Retention policies (retention.ms, retention.bytes, segment.bytes) can be set at the topic level using kafka-topics.sh --alter or when creating a topic. This allows fine-grained control over data lifecycle for different data streams. Topic-level configurations override broker-level defaults. 

  • Disk Space Management: Kafka's log cleanup mechanism primarily relies on retention policies (log.retention.hours, log.retention.bytes). Once a segment is old enough or the topic reaches its size limit, it's marked for deletion. Ensure sufficient free disk space (at least 20-30%) on your Kafka data directories to prevent broker crashes due to disk full errors. 

  • Hardware Considerations: 

  • Disks: Use fast, dedicated storage (NVMe SSDs are highly recommended) for log.dirs. Avoid sharing these disks with OS or other applications. Distribute partitions across multiple mount points (directories in log.dirs) backed by separate physical disks/RAIDs to parallelize I/O. 

  • RAM: Allocate sufficient RAM for the JVM heap (KAFKA_HEAP_OPTS) for Kafka brokers, typically 6-8 GB for production, but not excessively large to avoid long garbage collection pauses. Kafka leverages OS page cache heavily, so ample system RAM is more critical than a massive JVM heap. 

  • Network: Ensure high-bandwidth, low-latency network connectivity between brokers, and between brokers and clients. 

  • Garbage Collection Tuning: For JVM tuning, consider using G1GC (-XX:+UseG1GC) and setting appropriate heap sizes (-Xms, -Xmx). Monitor GC logs for pauses. 

  • Producer/Consumer Tuning: 

  • Producers: Adjust acks (acknowledgments), batch.size, linger.ms, compression.type to balance durability, throughput, and latency. 

  • Consumers: Tune fetch.min.bytes, fetch.max.wait.ms, and parallel processing for optimal consumption. 

  • Monitoring: Implement comprehensive monitoring (e.g., JMX metrics for broker health, log size, message rates, CPU, disk I/O, network I/O) to identify bottlenecks and anticipate capacity needs. Tools like Prometheus/Grafana are commonly used. 

 

 

    • Related Articles

    • Memory Overhead and OOM in Spark – Tuning Executor Settings

      Memory Overhead and OOM in Spark – Tuning Executor Settings Title: Memory Overhead and OOM in Spark - Tuning Executor Settings Category: Troubleshooting → Spark Applies To: Apache Spark 2.x, 3.x (running on YARN, Mesos, or Standalone) Issue Summary ...
    • Managing HDFS Space and Replication

      Managing HDFS Space and Replication Category: Troubleshooting → HDFS Applies To: Apache Hadoop HDFS 2.x, 3.x Issue summary: Effective management of HDFS disk space and data replication is crucial for the stability, performance, and data availability ...
    • Troubleshooting Out-of-Memory or Slow Execution – spark

      Troubleshooting Out-of-Memory or Slow Execution – spark Title: Memory Overhead and OOM in Spark - Tuning Executor Settings Category: Troubleshooting → Spark Applies To: Apache Spark 2.x, 3.x (running on YARN, Mesos, or Standalone) Issue Summary Spark ...
    • Backups and Disaster Recovery Strategy – MySQL

      Backups and Disaster Recovery Strategy – MySQL Category: Administration → MySQL Applies To: MySQL 8.x Issues Summary: A robust backup and disaster recovery (DR) strategy is paramount for any MySQL database production. It ensures data protection ...
    • How to Debug Spark Application Logs (YARN UI)

      How to Debug Spark Application Logs (YARN UI) Category: Troubleshooting → Apache Spark Applies To: Apache Spark 2.x, 3.x running on Apache Hadoop YARN 2.x, 3.x Issue summary: When a Spark application fails on a YARN cluster, the application logs are ...