Category: Troubleshooting → Logging and montoring
Applies To: Hadoop HA cluster.
Issue Summary
In a distributed Hadoop HA cluster, component logs are the primary source of truth for monitoring system health, diagnosing failures, and troubleshooting performance issues. This document outlines the key log files, their locations, and how to use them effectively for clusters with High Availability.
Common Log File Locations
All Hadoop component logs are typically located in the $HADOOP_HOME/logs directory on their respective nodes.
The log file name follows the format:
hadoop-<user>-<component>-<hostname>.log.
For example, on a master node named master1, the NameNode log would be $HADOOP_HOME/logs/hadoop-hadoop-namenode-master1.log.
Step-by-Step Resolution: Log File Breakdown by Node Role
NameNode Log
Logs generate in $HADOOP_HOME/logs/hadoop-hadoop-namenode-<namenode_hostname>.log
What it logs:
The complete lifecycle of the HDFS namespace, including block management, DataNode heartbeats, file system operations, and most importantly, the synchronization state with the other NameNode.
Use for troubleshooting:
Failover Events: On both master nodes, this log will show when a NameNode transitions from active to standby or vice-versa. Look for messages like "Transitioning to active state" or "Transitioning to standby state."
Synchronization: On the standby NameNode, this log confirms it is successfully reading EditLog transactions from the JournalNodes to stay in sync with the active NameNode.
Safemode: Tracks when a NameNode enters or exits safemode, which is a key indicator of cluster health.
DataNode Health: Records heartbeats from DataNodes. Missing heartbeats from a DataNode will be logged here, indicating a potential worker node failure.
JournalNode Log
Logs generate in $HADOOP_HOME/logs/hadoop-hadoop-journalnode-<namenode_hostname>.log
What it logs:
These logs track the writes from the active NameNode and reads from the standby NameNode, ensuring consistent metadata across the cluster.
Use for troubleshooting:
EditLog Write Errors: If the active NameNode fails to write EditLog entries to the JournalNodes, it will be logged here. This is a critical issue that can prevent failover.
Standby Read Errors: Shows if the standby NameNode is having trouble reading from the JournalNodes, which would prevent it from staying in sync and taking over as active.
ZKFC Log
Logs generate in $HADOOP_HOME/logs/hadoop-hadoop-zkfc-<namenode_hostname>.log
What it logs:
The ZKFC (ZooKeeper Failover Controller) manages the automatic failover process. Its log provides details on communication with ZooKeeper and the NameNode's health checks.
Use for troubleshooting:
ZooKeeper Session: Verifies that the ZKFC has a healthy session with the ZooKeeper quorum.
Health Checks: Logs the results of periodic health checks on the NameNode. If a NameNode is unhealthy, the ZKFC will attempt a failover.
Failover Attempts: Detailed logs of failover attempts, including election results and fencing actions. This is the first place to check if an automatic failover fails.
ResourceManager Log
Logs generate in $HADOOP_HOME/logs/hadoop-hadoop-resourcemanager-<namenode_hostname>.log
What it logs:
The ResourceManager tracks overall cluster resources, application submissions, and the health of all NodeManagers.
Use for troubleshooting:
Application Lifecycle: Tracks jobs from submission to completion. If a job is stuck, this log will show if it's due to resource limitations.
NodeManager Health: Records heartbeats from NodeManagers. A lack of heartbeats indicates a worker node is down.
Resource Scheduling: Provides insight into how resources are allocated to applications.
Datanode log
Logs generate in $HADOOP_HOME/logs/hadoop-hadoop-datanode-<datanode_hostname>.log
What it logs:
Data block storage, serving read/write requests, and communication with the NameNode.
Use for troubleshooting:
Block Corruption: Reports if any data blocks on the local disk are corrupted.
Disk I/O Errors: Records any issues with the local disks where data blocks are stored.
Heartbeats: Confirms that heartbeats are being successfully sent to the NameNode.
Nodemanager log
Logs generate in $HADOOP_HOME/logs/hadoop-hadoop-nodemanager-<datanode_hostname>.log
What it logs:
The lifecycle of YARN containers on the worker node, resource usage by containers, and communication with the ResourceManager.
Use for troubleshooting:
Container Failures: Logs container-level failures, often due to exceeding resource limits (e.g., OutOfMemoryError).
Local Disk Issues: Reports issues with the local directories used for storing container logs and data.
Application-Specific Logs: Provides pointers to the location of individual application container logs.
Metastore Log
Logs generate in var/log/mysql/mysqld.log
What it logs:
All interactions with the metastore database, including table creation, partition management, and schema updates.
Use for troubleshooting:
Essential for diagnosing any metadata-related issues, such as "Table not found" errors, permissions issues on the database, or failures to connect to the external MYSQL metastore.
Hive CLI / Beeline Client Logs
Logs generate in /tmp/user_name/hive.log
What it logs:
The client-side activity, including commands issued and responses received from HiveServer2.
Use for troubleshooting:
Hive uses log4j for logging. These logs are not emitted to the standard output by default but are instead captured to a log file specified by Hive's log4j properties file. By default, Hive will use hive-log4j.default in the conf/ directory of the Hive installation, which writes logs to /tmp/<user_name>/hive.log and uses the WARN level.
Additional Notes:
Logging Levels: To get more detailed information, you can adjust the logging level for any component in its respective log4j.properties file (e.g., from INFO to DEBUG).
Correlating Timestamps: When troubleshooting a cluster-wide issue, always compare timestamps across logs from different nodes to create a chronological sequence of events.