Namenode Not Exiting Safe Mode.

Namenode Not Exiting Safe Mode.

Applies To: Hadoop HDFS NameNode 
Category: Troubleshooting HDFS 

Issue Summary 

The HDFS NameNode remains in "safemode" even after startup, preventing write operations to HDFS and signaling that the cluster is not fully healthy. This means the NameNode is waiting for a sufficient percentage of DataNodes to report their blocks. 

Possible Cause(s)  

List common reasons why this issue may occur.  

  1. Insufficient number of DataNodes reporting blocks. 

  1. DataNode failures or slow DataNode startups. 

  1. Misconfigured dfs.namenode.safemode.threshold-pct (percentage of blocks required). 

  1. Network issues preventing DataNodes from communicating with the NameNode. 

  1. Disk space full on DataNodes preventing block reports. 

Step-by-Step Resolution  

  1. 1. Check NameNode Logs: 

  1. Examine the NameNode logs. Look for messages about entering/exiting safemode, block reports received, and any errors related to block corruption or DataNode communication. 

cat $HADOOP_HOME/logs/hadoop-hadoop-namenode-<hostname>.log  

  1. Look for lines like Exiting Safe mode or Leaving safe modeto see if it's attempting to exit. 

cat $HADOOP_HOME/logs/hadoop-hadoop-namenode-<hostname>.log | grep –E “Exiting Safe mode | Leaving safe mode” 

      2. Check DataNode Logs:
  1. Read the DataNode logs at:

cat $HADOOP_HOME/logs/hadoop-hadoop-datanode-<hostname>.log

  1. 2. Check DataNode Status: 

  1. Access the NameNode UI. 

  1. Verify that a sufficient number of DataNodes are live and registered. 

http://<hostname>:9870/dfshealth.html#tab-datanode 

  1. Check for any DataNodes listed as dead or unhealthy 

http://<hostname>:9870/dfshealth.html#tab-datanode 

  1. 3. Inspect Missing/Corrupted Blocks: 

  1. On the NameNode UI, check the Summary section for Missing Blocksor Corrupt Blocks. 

  1. If missing blocks are present, identify the files causing them. These files may need to be deleted or recovered if possible. 

hdfs fsck / -files –blocks -locations 

  1. 4. Verify DataNode Connectivity: 

  1. Ensure DataNodes can communicate with the NameNode. 

To check the connectivity from datanode to namenode: 

ping <IP-address-of-namenode> 

telnet <IP-address-of-namenode> 8020 

  1. Check DataNode logs for connection errors to the NameNode. 

cat $HADOOP_HOME/logs/hadoop-hadoop-datanode-<hostname>.log 

  1. 5. Force Exit Safemode (with caution): 

  1. Only do this if you understand the implications and are certain no critical data will be lost. Forcing an exit when blocks are truly missing can lead to data loss. 

To leave the safe mode: 

hdfs dfsadmin -safemode leave 

  1. 6. Adjust Safemode Threshold (if necessary): 

  1. If you have a very small cluster or a specific use case, you might temporarily lower dfs.namenode.safemode.threshold-pct in hdfs-site.xml to allow the NameNode to exit safemode with fewer reported blocks. This is generally not recommended for production. 

Edit on hdfs-site.xml: 

<property> 

  <name>dfs.namenode.safemode.threshold-pct</name> 

  <value>0.999f </value> 

<description> Specifies the percentage of blocks that should satisfy the minimal replication requirement defined by dfs.namenode.replication.min. Values less than or equal to 0 mean not to wait for any particular percentage of blocks before exiting safemode. Values greater than 1 will make safe mode permanent.  

</description> 

</property>    

  1. 7. Restart DataNodes (if they are stuck): 

  1. If DataNodes are sluggish or have issues reporting blocks, try restarting them one by one. 

hadoop-daemon.sh stop datanode 

hadoop-daemon.sh start datanode 

Additional Notes: 

  • Safemode is a protective measure. The NameNode enters it to prevent data corruption when the cluster block replica count is below the configured threshold. 

  • The ideal solution is to bring up all DataNodes and ensure all blocks are reported successfully. 

  • In an HA (High Availability) setup, safemode behavior is managed differently, as there are active and standby NameNodes. 

    • Related Articles

    • Standby Namenode Startup Failures After a recent system Crash.

      Applies To: Hadoop HDFS (HA deployement with QuorumJournalManager or zkfc) Category: Troubleshooting → HDFS, High Availability Issue Summary In a Hadoop High Availability (HA) cluster, after a failover or restart, the standby NameNode fails to ...
    • Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

      Category: Configuration → Hadoop Platform Applies To: Hadoop 3.x, spark 3.x Issue Summary This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and ...
    • Managing HDFS Space and Replication

      Managing HDFS Space and Replication Category: Troubleshooting → HDFS Applies To: Apache Hadoop HDFS 2.x, 3.x Issue summary: Effective management of HDFS disk space and data replication is crucial for the stability, performance, and data availability ...
    • Troubleshooting Yarn Application Failures

      Troubleshooting Yarn Application Failures Category: Troubleshooting → YARN Applies To: Apache YARN 2.x, 3.x Issues Summary: YARN applications (such as Spark, MapReduce, Tez jobs) fail to complete successfully, often exiting with a FAILED status, or ...
    • Resolving Delayed DataNode Initialization: Effective Strategies.

      Applies To: Hadoop HDFS DataNode Category: Troubleshooting → HDFS Issue Summary An HDFS DataNode is taking an unusually long time to start up and join the cluster, potentially delaying data availability and cluster operations. Possible Cause(s) List ...