Managing HDFS Space and Replication

Managing HDFS Space and Replication

Managing HDFS Space and Replication 

Category: Troubleshooting → HDFS 

Applies To: Apache Hadoop HDFS 2.x, 3.x 

Issue summary: 

Effective management of HDFS disk space and data replication is crucial for the stability, performance, and data availability of a Hadoop cluster. This document covers critical configurations, common challenges like excessive disk usage and under-replication, and best practices for maintaining a healthy HDFS environment. 

Critical Properties and Concepts: 

  • dfs.replication (in hdfs-site.xml): 

  • Description: The default replication factor for new files written to HDFS. 

  • Value: 3 is the standard for the production environment. Lowering it saves space but reduces fault tolerance. Increasing it improves availability and read throughput but consumes more disk space. 

  • Note: Can be overridden per file using hdfs dfs -setrep. 

  • dfs.blocksize (in hdfs-site.xml): 

  • Description: The default block size for files stored in HDFS. 

  • Value: 134217728 (128 MB) or 268435456 (256 MB) are common. Larger blocks reduce NameNode metadata overhead and improve sequential read performance for large files. 

  • dfs.namenode.safemode.threshold-pct (in hdfs-site.xml): 

  • Description: The minimum percentage of HDFS blocks that must be reported to the NameNode before it leaves Safe Mode (read-only state). 

  • Value: 0.999f (default). Ensure enough DataNodes are active to meet this threshold after a restart. 

  • fs.trash.interval (in core-site.xml): 

  • Description: The amount of time (in minutes) that deleted files are retained in the user's HDFS Trash directory before being permanently purged. 

  • Value: Default is 0 (disabled, immediate deletion). Common production value is 1440 (1 day) or 10080 (7 days). 

  • Note: If enabled, users must periodically hdfs dfs -expunge or wait for the interval to pass for space to be freed. 

  • dfs.namenode.fs-limits.max-blocks-per-file (in hdfs-site.xml): 

  • Description: The maximum number of blocks a file can consist of. Limits the impact of the "small files problem." 

  • Value: Default 1048576 (1 million). Can be adjusted if you have extremely large files that exceed this. 

Best Practices & Administrative Tasks 

  • Monitor HDFS Disk Usage Regularly: 

  • hdfs dfsadmin -report: Provides a high-level overview of total capacity, used space, and free space across all DataNodes. It also shows DataNode health. 

  • hdfs dfs -du -h /<path>: Displays disk usage for specific directories and files in human-readable format. Use this to identify where space is being consumed (e.g., /user/hive/warehouse, /tmp). 

  • Monitoring Tools: Integrate with monitoring solutions (e.g., NameNode Web UI, Ambari, Cloudera Manager, Prometheus/Grafana) for historical trends and alerts on high utilization. 

  • Mitigate the Small Files Problem: 

  • Problem: HDFS is inefficient with many small files. Each file (and block) requires metadata in the NameNode's memory, leading to memory pressure and slow operations. 

  • Solutions: 

  • Combine Small Files: Before writing to HDFS, combine small files into larger ones (e.g., 128MB to 256MB) using tools like hadoop archive (HAR), SequenceFile, or Spark's repartition or coalesce followed by saveAs...File. 

  • Optimal File Formats: Use columnar formats like Parquet or ORC, which allow for efficient storage and compression, often grouping small records into larger block structures. 

  • Increase Block Size: If your workload consists mainly of very large files, consider increasing dfs.blocksize. 

  • Implement Data Retention & Cleanup Policies: 

  • Identify and Delete Old Data: Regularly review HDFS paths and delete old, unused, or temporary data that is no longer needed. 

  • Automate Cleanup: For temporary directories (e.g., /tmp), set up cron jobs or lifecycle policies to automatically clean up files older than a certain age. 

  • Empty HDFS Trash: If fs.trash.interval is enabled, remember that deleted files remain in trash and occupy space. Users should periodically run hdfs dfs -expunge to permanently remove them or wait for the configured interval. 

  • Manage Replication Factor and Under/Over-Replication: 

  • Check Replication Health: 

  • hdfs fsck / -files -blocks -locations: Identifies under-replicated, over-replicated, or corrupt blocks. 

  • hdfs fsck / -files -blocks -locations | grep -v 'OK' to filter for issues. 

  • Adjust Replication Factor: 

  • Use hdfs dfs -setrep -w <replication_factor> <path> to change the replication factor for existing files/directories. -w waits for completion. 

  • Under-replication: If a block has fewer replicas than dfs.replication, HDFS will automatically re-replicate it if DataNodes have free space. If not, add DataNodes or free up space. 

  • Over-replication: If a block has more replicas than desired, HDFS will remove excess replicas to free space. 

  • Handle Corrupt Blocks: 

  • Identify: hdfs fsck / will report corrupt blocks. 

  • Resolution: If possible, restore from backup. If not, you may need to manually delete the corrupt file (if data is not critical) or replace it. For lease recovery issues (when a file write didn't complete cleanly), hdfs debug recoverLease -path <file_path> can sometimes help. 

  • Manage HDFS Snapshots: 

  • List Snapshottable Directories: hdfs lsSnapshottableDir 

  • List Snapshots: hdfs dfs -lsSnapshot /<snapshottable_dir> 

  • Delete Old Snapshots: hdfs rmdir /<snapshottable_dir>/.snapshot/<snapshot_name>. Snapshots consume space by preserving blocks that would otherwise be deleted or overwritten. Regularly clean up old, unnecessary snapshots. 

  • Capacity Planning and DataNode Management: 

  • Adding DataNodes: Plan for growth by adding new DataNodes to increase storage capacity and aggregate I/O throughput. 

  • Decommissioning DataNodes: Use the HDFS decommissioning process to safely remove DataNodes without data loss. 

  • Balance Data Across DataNodes: 

  • start-balancer.sh: Use the HDFS Balancer to evenly distribute blocks across DataNodes, especially after adding new DataNodes or after significant data deletion/addition. Run it periodically during off-peak hours. 

  • dfs.balancer.bandwidthPerSec: Configure this property in hdfs-site.xml or as an argument to start-balancer.sh to control the bandwidth used by the balancer, preventing it from consuming too much network/disk resources. 

    • Related Articles

    • Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

      Category: Configuration → Hadoop Platform Applies To: Hadoop 3.x, spark 3.x Issue Summary This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and ...
    • Disk space full on /tmp Mount in Linux

      Title: Disk Full on /tmp Location in Linux Category: Troubleshooting Applies To: Linux 9.4(Linux version) Last Updated: 23/06/2025 Issue Summary: The /tmp directory has consumed all available disk space, causing application failures, service crashes, ...
    • Monitoring and Managing Resource Consumption on a Single Node.

      Applies To: Distributed Systems (Hadoop, Spark, etc.), Standalone Servers Category: Troubleshooting → Performance, Resource Management Issue Summary A single node within a cluster or a standalone server is experiencing disproportionately high CPU, ...
    • Fixing 'HDFS Command Not Found' Errors: Step-by-Step Guide.

      Applies To: Hadoop HDFS Client Category: Troubleshooting → HDFS environment Issue Summary When attempting to execute hdfs commands (e.g., hdfs dfs -ls), the system returns “command not found” indicating that the Hadoop client binaries are not ...
    • Namenode Not Exiting Safe Mode.

      Applies To: Hadoop HDFS NameNode Category: Troubleshooting → HDFS Issue Summary The HDFS NameNode remains in "safemode" even after startup, preventing write operations to HDFS and signaling that the cluster is not fully healthy. This means the ...