Standby Namenode Startup Failures After a recent system Crash.

Applies To: Hadoop HDFS (HA deployement with QuorumJournalManager or zkfc)
Category: Troubleshooting → HDFS, High Availability

Issue Summary

In a Hadoop High Availability (HA) cluster, after a failover or restart, the standby NameNode fails to transition to the standby state or remains in a faulty state, preventing automatic failover and potentially leading to service disruption.

Possible Cause(s)

List common reasons why this issue may occur.

JournalNode issues (QJM): JournalNodes are down, inaccessible, or have inconsistent data.

ZooKeeper issues (ZKFC): ZooKeeper quorum is not established, ZKFC processes are not running, or ZooKeeper data is corrupted.

Network connectivity: Standby NameNode cannot communicate with JournalNodes and ZooKeeper.

Configuration mismatch: hdfs-site.xml or core-site.xml is inconsistent with NameNodes.

Standby NameNode metadata corruption: The standby NameNode's local state is corrupted.

Insufficient resources: Standby NameNode lacks sufficient memory or CPU.

Step-by-Step Resolution

1. Verify JournalNode/ZooKeeper Health:

JournalNodes (QJM):

Check JournalNode logs for errors.

cat $HADOOP_HOME/logs/hadoop-hadoop-journalnode-<hostname>.log

Ensure all JournalNode processes are running.

jps

Verify JournalNodes are reachable from the standby NameNode.

hdfs haadmin -getServiceState <nameserviceId>

ZooKeeper (ZKFC):

Check ZooKeeper logs and ZKFC logs on both NameNodes.

To check the zookeeper logs

cat $ZOOKEEPER_HOME/logs/zookeeper-hadoop-server-<hostname>.out

To check the zkfc logs:

cat $HADOOP_HOME/logs/hadoop-hadoop-zkfc-<hostname>.log

Ensure the ZooKeeper quorum is established.

To check the zookeeper quorum is established.

echo srvr | nc <hostname> 2181

Verify that ZKFC processes are running on both NameNodes.

jps

2. Last resort to format the zkfc:

hdfs zkfc -formatZK

Additional Notes:

In an HA setup, automatic failover relies on the standby NameNode being healthy and in the standby state.

If you are using QJM, ensure the majority of JournalNodes are running for the system to function correctly.

If you are using ZKFC, ensure a ZooKeeper quorum is established.

Regularly check the health of both NameNodes, JournalNodes (or ZooKeeper), and ZKFC processes.

Monitor logs for any recurring errors or warnings.

Backups of NameNode metadata are crucial in case of failures.

Related Articles
Troubleshooting Yarn Application Failures
Troubleshooting Yarn Application Failures Category: Troubleshooting → YARN Applies To: Apache YARN 2.x, 3.x Issues Summary: YARN applications (such as Spark, MapReduce, Tez jobs) fail to complete successfully, often exiting with a FAILED status, or ...
Namenode Not Exiting Safe Mode.
Applies To: Hadoop HDFS NameNode Category: Troubleshooting → HDFS Issue Summary The HDFS NameNode remains in "safemode" even after startup, preventing write operations to HDFS and signaling that the cluster is not fully healthy. This means the ...
Understanding Hadoop Logs – Types, Use Cases, and Common Locations.
Category: Troubleshooting → Logging and montoring Applies To: Hadoop HA cluster. Issue Summary In a distributed Hadoop HA cluster, component logs are the primary source of truth for monitoring system health, diagnosing failures, and troubleshooting ...
Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components
Category: Configuration → Hadoop Platform Applies To: Hadoop 3.x, spark 3.x Issue Summary This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and ...
Disk space full on /tmp Mount in Linux
Title: Disk Full on /tmp Location in Linux Category: Troubleshooting Applies To: Linux 9.4(Linux version) Last Updated: 23/06/2025 Issue Summary: The /tmp directory has consumed all available disk space, causing application failures, service crashes, ...

Standby Namenode Startup Failures After a recent system Crash.

Standby Namenode Startup Failures After a recent system Crash.

Related Articles

Troubleshooting Yarn Application Failures

Namenode Not Exiting Safe Mode.

Understanding Hadoop Logs – Types, Use Cases, and Common Locations.

Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

Disk space full on /tmp Mount in Linux