Applies To: Hadoop HDFS NameNode
Category: Troubleshooting → HDFS
Issue Summary
The HDFS NameNode remains in "safemode" even after startup, preventing write operations to HDFS and signaling that the cluster is not fully healthy. This means the NameNode is waiting for a sufficient percentage of DataNodes to report their blocks.
Possible Cause(s)
List common reasons why this issue may occur.
Insufficient number of DataNodes reporting blocks.
DataNode failures or slow DataNode startups.
Misconfigured dfs.namenode.safemode.threshold-pct (percentage of blocks required).
Network issues preventing DataNodes from communicating with the NameNode.
Disk space full on DataNodes preventing block reports.
Step-by-Step Resolution
1. Check NameNode Logs:
Examine the NameNode logs. Look for messages about entering/exiting safemode, block reports received, and any errors related to block corruption or DataNode communication.
cat $HADOOP_HOME/logs/hadoop-hadoop-namenode-<hostname>.log
Look for lines like “Exiting Safe mode” or “Leaving safe mode” to see if it's attempting to exit.
cat $HADOOP_HOME/logs/hadoop-hadoop-namenode-<hostname>.log | grep –E “Exiting Safe mode | Leaving safe mode”
Read the DataNode logs at:
cat $HADOOP_HOME/logs/hadoop-hadoop-datanode-<hostname>.log
2. Check DataNode Status:
Access the NameNode UI.
Verify that a sufficient number of DataNodes are live and registered.
http://<hostname>:9870/dfshealth.html#tab-datanode
Check for any DataNodes listed as “dead” or “unhealthy”
http://<hostname>:9870/dfshealth.html#tab-datanode
3. Inspect Missing/Corrupted Blocks:
On the NameNode UI, check the “Summary” section for “Missing Blocks” or “Corrupt Blocks”.
If missing blocks are present, identify the files causing them. These files may need to be deleted or recovered if possible.
hdfs fsck / -files –blocks -locations
4. Verify DataNode Connectivity:
Ensure DataNodes can communicate with the NameNode.
To check the connectivity from datanode to namenode:
ping <IP-address-of-namenode>
telnet <IP-address-of-namenode> 8020
Check DataNode logs for connection errors to the NameNode.
cat $HADOOP_HOME/logs/hadoop-hadoop-datanode-<hostname>.log
5. Force Exit Safemode (with caution):
Only do this if you understand the implications and are certain no critical data will be lost. Forcing an exit when blocks are truly missing can lead to data loss.
To leave the safe mode:
hdfs dfsadmin -safemode leave
6. Adjust Safemode Threshold (if necessary):
If you have a very small cluster or a specific use case, you might temporarily lower dfs.namenode.safemode.threshold-pct in hdfs-site.xml to allow the NameNode to exit safemode with fewer reported blocks. This is generally not recommended for production.
Edit on hdfs-site.xml:
<property>
<name>dfs.namenode.safemode.threshold-pct</name>
<value>0.999f </value>
<description> Specifies the percentage of blocks that should satisfy the minimal replication requirement defined by dfs.namenode.replication.min. Values less than or equal to 0 mean not to wait for any particular percentage of blocks before exiting safemode. Values greater than 1 will make safe mode permanent.
</description>
</property>
7. Restart DataNodes (if they are stuck):
If DataNodes are sluggish or have issues reporting blocks, try restarting them one by one.
hadoop-daemon.sh stop datanode
hadoop-daemon.sh start datanode
Additional Notes:
Safemode is a protective measure. The NameNode enters it to prevent data corruption when the cluster block replica count is below the configured threshold.
The ideal solution is to bring up all DataNodes and ensure all blocks are reported successfully.
In an HA (High Availability) setup, safemode behavior is managed differently, as there are active and standby NameNodes.