Standby Namenode Startup Failures After a recent system Crash.

Standby Namenode Startup Failures After a recent system Crash.

Applies To: Hadoop HDFS (HA deployement with QuorumJournalManager or zkfc) 
Category: Troubleshooting HDFS, High Availability 

Issue Summary  

In a Hadoop High Availability (HA) cluster, after a failover or restart, the standby NameNode fails to transition to the standby state or remains in a faulty state, preventing automatic failover and potentially leading to service disruption. 

Possible Cause(s)  

List common reasons why this issue may occur.  

  1. JournalNode issues (QJM): JournalNodes are down, inaccessible, or have inconsistent data. 

  1. ZooKeeper issues (ZKFC): ZooKeeper quorum is not established, ZKFC processes are not running, or ZooKeeper data is corrupted. 

  1. Network connectivity: Standby NameNode cannot communicate with JournalNodes and ZooKeeper. 

  1. Configuration mismatch: hdfs-site.xml or core-site.xml is inconsistent with NameNodes. 

  1. Standby NameNode metadata corruption: The standby NameNode's local state is corrupted. 

  1. Insufficient resources: Standby NameNode lacks sufficient memory or CPU. 

Step-by-Step Resolution 

  1. 1. Verify JournalNode/ZooKeeper Health: 

  1. JournalNodes (QJM): 

  • Check JournalNode logs for errors. 

cat $HADOOP_HOME/logs/hadoop-hadoop-journalnode-<hostname>.log 

  • Ensure all JournalNode processes are running. 

jps 

  • Verify JournalNodes are reachable from the standby NameNode. 

hdfs haadmin -getServiceState <nameserviceId> 

  1. ZooKeeper (ZKFC): 

  • Check ZooKeeper logs and ZKFC logs on both NameNodes. 

To check the zookeeper logs 

cat $ZOOKEEPER_HOME/logs/zookeeper-hadoop-server-<hostname>.out 

To check the zkfc logs: 

cat $HADOOP_HOME/logs/hadoop-hadoop-zkfc-<hostname>.log   

  • Ensure the ZooKeeper quorum is established. 

To check the zookeeper quorum is established. 

echo srvr | nc <hostname> 2181 

  • Verify that ZKFC processes are running on both NameNodes. 

jps 

      2. Last resort to format the zkfc:
                  hdfs zkfc -formatZK

Additional Notes: 

  • In an HA setup, automatic failover relies on the standby NameNode being healthy and in the standby state. 

  • If you are using QJM, ensure the majority of JournalNodes are running for the system to function correctly. 

  • If you are using ZKFC, ensure a ZooKeeper quorum is established. 

  • Regularly check the health of both NameNodes, JournalNodes (or ZooKeeper), and ZKFC processes. 

  • Monitor logs for any recurring errors or warnings. 

  • Backups of NameNode metadata are crucial in case of failures. 

    • Related Articles

    • Troubleshooting Yarn Application Failures

      Troubleshooting Yarn Application Failures Category: Troubleshooting → YARN Applies To: Apache YARN 2.x, 3.x Issues Summary: YARN applications (such as Spark, MapReduce, Tez jobs) fail to complete successfully, often exiting with a FAILED status, or ...
    • Namenode Not Exiting Safe Mode.

      Applies To: Hadoop HDFS NameNode Category: Troubleshooting → HDFS Issue Summary The HDFS NameNode remains in "safemode" even after startup, preventing write operations to HDFS and signaling that the cluster is not fully healthy. This means the ...
    • Understanding Hadoop Logs – Types, Use Cases, and Common Locations.

      Category: Troubleshooting → Logging and montoring Applies To: Hadoop HA cluster. Issue Summary In a distributed Hadoop HA cluster, component logs are the primary source of truth for monitoring system health, diagnosing failures, and troubleshooting ...
    • Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

      Category: Configuration → Hadoop Platform Applies To: Hadoop 3.x, spark 3.x Issue Summary This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and ...
    • Disk space full on /tmp Mount in Linux

      Title: Disk Full on /tmp Location in Linux Category: Troubleshooting Applies To: Linux 9.4(Linux version) Last Updated: 23/06/2025 Issue Summary: The /tmp directory has consumed all available disk space, causing application failures, service crashes, ...