Applies To: Hadoop HDFS (HA deployement with QuorumJournalManager or zkfc)
Category: Troubleshooting → HDFS, High Availability
Issue Summary
In a Hadoop High Availability (HA) cluster, after a failover or restart, the standby NameNode fails to transition to the standby state or remains in a faulty state, preventing automatic failover and potentially leading to service disruption.
Possible Cause(s)
List common reasons why this issue may occur.
JournalNode issues (QJM): JournalNodes are down, inaccessible, or have inconsistent data.
ZooKeeper issues (ZKFC): ZooKeeper quorum is not established, ZKFC processes are not running, or ZooKeeper data is corrupted.
Network connectivity: Standby NameNode cannot communicate with JournalNodes and ZooKeeper.
Configuration mismatch: hdfs-site.xml or core-site.xml is inconsistent with NameNodes.
Standby NameNode metadata corruption: The standby NameNode's local state is corrupted.
Insufficient resources: Standby NameNode lacks sufficient memory or CPU.
Step-by-Step Resolution
1. Verify JournalNode/ZooKeeper Health:
JournalNodes (QJM):
Check JournalNode logs for errors.
cat $HADOOP_HOME/logs/hadoop-hadoop-journalnode-<hostname>.log
Ensure all JournalNode processes are running.
jps
Verify JournalNodes are reachable from the standby NameNode.
hdfs haadmin -getServiceState <nameserviceId>
ZooKeeper (ZKFC):
Check ZooKeeper logs and ZKFC logs on both NameNodes.
To check the zookeeper logs
cat $ZOOKEEPER_HOME/logs/zookeeper-hadoop-server-<hostname>.out
To check the zkfc logs:
cat $HADOOP_HOME/logs/hadoop-hadoop-zkfc-<hostname>.log
Ensure the ZooKeeper quorum is established.
To check the zookeeper quorum is established.
echo srvr | nc <hostname> 2181
Verify that ZKFC processes are running on both NameNodes.
jps
Additional Notes:
In an HA setup, automatic failover relies on the standby NameNode being healthy and in the standby state.
If you are using QJM, ensure the majority of JournalNodes are running for the system to function correctly.
If you are using ZKFC, ensure a ZooKeeper quorum is established.
Regularly check the health of both NameNodes, JournalNodes (or ZooKeeper), and ZKFC processes.
Monitor logs for any recurring errors or warnings.
Backups of NameNode metadata are crucial in case of failures.