Managing HDFS Space and Replication
Category: Troubleshooting → HDFS
Applies To: Apache Hadoop HDFS 2.x, 3.x
Issue summary:
Effective management of HDFS disk space and data replication is crucial for the stability, performance, and data availability of a Hadoop cluster. This document covers critical configurations, common challenges like excessive disk usage and under-replication, and best practices for maintaining a healthy HDFS environment.
Critical Properties and Concepts:
dfs.replication (in hdfs-site.xml):
Description: The default replication factor for new files written to HDFS.
Value: 3 is the standard for the production environment. Lowering it saves space but reduces fault tolerance. Increasing it improves availability and read throughput but consumes more disk space.
Note: Can be overridden per file using hdfs dfs -setrep.
dfs.blocksize (in hdfs-site.xml):
Description: The default block size for files stored in HDFS.
Value: 134217728 (128 MB) or 268435456 (256 MB) are common. Larger blocks reduce NameNode metadata overhead and improve sequential read performance for large files.
dfs.namenode.safemode.threshold-pct (in hdfs-site.xml):
Description: The minimum percentage of HDFS blocks that must be reported to the NameNode before it leaves Safe Mode (read-only state).
Value: 0.999f (default). Ensure enough DataNodes are active to meet this threshold after a restart.
fs.trash.interval (in core-site.xml):
Description: The amount of time (in minutes) that deleted files are retained in the user's HDFS Trash directory before being permanently purged.
Value: Default is 0 (disabled, immediate deletion). Common production value is 1440 (1 day) or 10080 (7 days).
Note: If enabled, users must periodically hdfs dfs -expunge or wait for the interval to pass for space to be freed.
dfs.namenode.fs-limits.max-blocks-per-file (in hdfs-site.xml):
Description: The maximum number of blocks a file can consist of. Limits the impact of the "small files problem."
Value: Default 1048576 (1 million). Can be adjusted if you have extremely large files that exceed this.
Best Practices & Administrative Tasks
Monitor HDFS Disk Usage Regularly:
hdfs dfsadmin -report: Provides a high-level overview of total capacity, used space, and free space across all DataNodes. It also shows DataNode health.
hdfs dfs -du -h /<path>: Displays disk usage for specific directories and files in human-readable format. Use this to identify where space is being consumed (e.g., /user/hive/warehouse, /tmp).
Monitoring Tools: Integrate with monitoring solutions (e.g., NameNode Web UI, Ambari, Cloudera Manager, Prometheus/Grafana) for historical trends and alerts on high utilization.
Mitigate the Small Files Problem:
Problem: HDFS is inefficient with many small files. Each file (and block) requires metadata in the NameNode's memory, leading to memory pressure and slow operations.
Solutions:
Combine Small Files: Before writing to HDFS, combine small files into larger ones (e.g., 128MB to 256MB) using tools like hadoop archive (HAR), SequenceFile, or Spark's repartition or coalesce followed by saveAs...File.
Optimal File Formats: Use columnar formats like Parquet or ORC, which allow for efficient storage and compression, often grouping small records into larger block structures.
Increase Block Size: If your workload consists mainly of very large files, consider increasing dfs.blocksize.
Implement Data Retention & Cleanup Policies:
Identify and Delete Old Data: Regularly review HDFS paths and delete old, unused, or temporary data that is no longer needed.
Automate Cleanup: For temporary directories (e.g., /tmp), set up cron jobs or lifecycle policies to automatically clean up files older than a certain age.
Empty HDFS Trash: If fs.trash.interval is enabled, remember that deleted files remain in trash and occupy space. Users should periodically run hdfs dfs -expunge to permanently remove them or wait for the configured interval.
Manage Replication Factor and Under/Over-Replication:
Check Replication Health:
hdfs fsck / -files -blocks -locations: Identifies under-replicated, over-replicated, or corrupt blocks.
hdfs fsck / -files -blocks -locations | grep -v 'OK' to filter for issues.
Adjust Replication Factor:
Use hdfs dfs -setrep -w <replication_factor> <path> to change the replication factor for existing files/directories. -w waits for completion.
Under-replication: If a block has fewer replicas than dfs.replication, HDFS will automatically re-replicate it if DataNodes have free space. If not, add DataNodes or free up space.
Over-replication: If a block has more replicas than desired, HDFS will remove excess replicas to free space.
Handle Corrupt Blocks:
Identify: hdfs fsck / will report corrupt blocks.
Resolution: If possible, restore from backup. If not, you may need to manually delete the corrupt file (if data is not critical) or replace it. For lease recovery issues (when a file write didn't complete cleanly), hdfs debug recoverLease -path <file_path> can sometimes help.
Manage HDFS Snapshots:
List Snapshottable Directories: hdfs lsSnapshottableDir
List Snapshots: hdfs dfs -lsSnapshot /<snapshottable_dir>
Delete Old Snapshots: hdfs rmdir /<snapshottable_dir>/.snapshot/<snapshot_name>. Snapshots consume space by preserving blocks that would otherwise be deleted or overwritten. Regularly clean up old, unnecessary snapshots.
Capacity Planning and DataNode Management:
Adding DataNodes: Plan for growth by adding new DataNodes to increase storage capacity and aggregate I/O throughput.
Decommissioning DataNodes: Use the HDFS decommissioning process to safely remove DataNodes without data loss.
Balance Data Across DataNodes:
start-balancer.sh: Use the HDFS Balancer to evenly distribute blocks across DataNodes, especially after adding new DataNodes or after significant data deletion/addition. Run it periodically during off-peak hours.
dfs.balancer.bandwidthPerSec: Configure this property in hdfs-site.xml or as an argument to start-balancer.sh to control the bandwidth used by the balancer, preventing it from consuming too much network/disk resources.