Kafka Retention, Cleanup, and Performance Tuning
Category: Administration → Kafka
Applies to: Apache Kafka 2.x 3.x
Issue Summary
This document outlines critical configurations and best practices for managing data retention, ensuring efficient cleanup, and optimizing the performance of Apache Kafka brokers. Proper configuration in these areas is essential for preventing disk space issues, maintaining broker stability, and achieving desired throughput and latency.
Critical Properties (in server.properties)
log.retention.hours
Description: The default number of hours to retain log segments (data) before they are marked for deletion. This applies to topics that don't have a specific retention.ms or retention.bytes configured.
Value: Default is 168 (7 days). Adjust based on your data retention requirements and disk capacity. For high-volume topics, a shorter retention period might be necessary.
log.retention.bytes
Description: The default maximum size (in bytes) of log segments to retain for a topic partition. This applies if log.retretion.hours is not explicitly set or when both are set, the earliest condition triggers deletion.
Value: Default is -1 (no limit). Set a specific byte limit for tighter disk space control, especially for topics with variable message sizes or bursty traffic.
log.segment.bytes
Description: The maximum size of a single log segment file. When a segment reaches this size, it's rolled over, and a new active segment is created. Smaller segments mean more files and more frequent rolls, potentially impacting performance.
Value: Default is 1073741824 (1 GB). A common value is 536870912 (512 MB) to 1073741824 (1 GB). Larger segments can improve throughput for long-running producers.
Description: The frequency (in milliseconds) with which the log cleaner checks for log segments that are eligible for retention or deletion.
Value: Default is 300000 (5 minutes). For aggressive cleanup, you might reduce this, but too frequent checks can add overhead.
message.max.bytes
Description: The largest size of a message that can be appended to the log. This is a broker-side limit.
Value: Default is 1048576 (1 MB). Adjust if you need to send larger messages. Must be coordinated with producer/consumer max.request.size and fetch.max.bytes respectively.
num.partitions
Description: The default number of partitions for newly created topics when no partition count is specified.
Value: Default is 1. For better parallelism and throughput, set a higher default (e.g., 3 or 5) or specify a per-topic.
num.recovery.threads.per.data.dir
Description: The number of threads used to rebuild log indices and clean up log segments on broker startup or shutdown.
Value: Default is 1. Increase for faster recovery of brokers with many partitions.
num.io.threads
Description: The number of I/O threads that handle requests from producers and consumers.
Value: Default is 8. Increasing this can improve throughput on brokers with many concurrent client connections or high message volumes, but it depends on available CPU cores.
num.network.threads
Description: The number of threads that handle network requests (receiving/sending data).
Value: Default is 3. Increase for very high client connections or throughput but ensure sufficient CPU.
log.dirs
Description: A comma-separated list of directories where the log data for Kafka will be stored.
Value: Multiple directories on separate disk drives (ideally dedicated, high-speed drives like NVMe SSDs or RAID 0 arrays) are crucial for I/O performance and resilience.
Additional Notes:
Topic-Level Overrides: Retention policies (retention.ms, retention.bytes, segment.bytes) can be set at the topic level using kafka-topics.sh --alter or when creating a topic. This allows fine-grained control over data lifecycle for different data streams. Topic-level configurations override broker-level defaults.
Disk Space Management: Kafka's log cleanup mechanism primarily relies on retention policies (log.retention.hours, log.retention.bytes). Once a segment is old enough or the topic reaches its size limit, it's marked for deletion. Ensure sufficient free disk space (at least 20-30%) on your Kafka data directories to prevent broker crashes due to disk full errors.
Hardware Considerations:
Disks: Use fast, dedicated storage (NVMe SSDs are highly recommended) for log.dirs. Avoid sharing these disks with OS or other applications. Distribute partitions across multiple mount points (directories in log.dirs) backed by separate physical disks/RAIDs to parallelize I/O.
RAM: Allocate sufficient RAM for the JVM heap (KAFKA_HEAP_OPTS) for Kafka brokers, typically 6-8 GB for production, but not excessively large to avoid long garbage collection pauses. Kafka leverages OS page cache heavily, so ample system RAM is more critical than a massive JVM heap.
Network: Ensure high-bandwidth, low-latency network connectivity between brokers, and between brokers and clients.
Garbage Collection Tuning: For JVM tuning, consider using G1GC (-XX:+UseG1GC) and setting appropriate heap sizes (-Xms, -Xmx). Monitor GC logs for pauses.
Producer/Consumer Tuning:
Producers: Adjust acks (acknowledgments), batch.size, linger.ms, compression.type to balance durability, throughput, and latency.
Consumers: Tune fetch.min.bytes, fetch.max.wait.ms, and parallel processing for optimal consumption.
Monitoring: Implement comprehensive monitoring (e.g., JMX metrics for broker health, log size, message rates, CPU, disk I/O, network I/O) to identify bottlenecks and anticipate capacity needs. Tools like Prometheus/Grafana are commonly used.