Troubleshooting Yarn Application Failures
Category: Troubleshooting → YARN
Applies To: Apache YARN 2.x, 3.x
Issues Summary:
YARN applications (such as Spark, MapReduce, Tez jobs) fail to complete successfully, often exiting with a FAILED status, or getting stuck in an ACCEPTED state, or failing to launch any containers. This can impact data processing, analytics, and overall cluster utility.
Possible Cause(s):
Resource Exhaustion:
Insufficient memory or CPU on NodeManagers to run requested containers.
Cluster-wide resource starvation preventing application master or tasks from being allocated.
Application requesting more resources than YARN or NodeManager limits allow.
Application Code Errors:
Bugs in the application logic leading to exceptions (e.g., NullPointerException, ArrayIndexOutOfBoundsException).
Application-level OutOfMemoryError due to inefficient code or data handling.
Unhandled exceptions or infinite loops cause the application to crash or hang.
Configuration Mismatches:
Application resource requests (spark.executor.memory, mapreduce.map.memory.mb, etc.) exceed YARN's maximum allocation limits.
Incorrect application-specific settings (e.g., wrong input/output paths, invalid parameters).
Incorrect classpath or missing dependencies for the application.
Network Issues:
Communication failures between the Application Master and NodeManagers.
Containers unable to communicate with each other or with external services (HDFS, databases, Kafka).
Firewall restrictions between cluster components.
Disk Space Issues:
NodeManager local disk full (yarn.nodemanager.local-dirs, yarn.nodemanager.log-dirs).
HDFS disk full preventing new data writes or temporary file creation.
Container Launch Failures:
Incorrect permissions for creating container directories or running processes.
Missing JAVA_HOME or incorrect environment setup on NodeManagers.
Corrupted application JARs or libraries.
YARN Component Health:
Unhealthy or unresponsive NodeManagers.
ResourceManager issues (e.g., instability, high load, or HA failover problems).
Data Locality Issues:
Tasks failing due to inability to access data on local DataNodes (e.g., DataNode failures, network isolation).
Step-by-Step Resolution:
Initial Triage: Check YARN UI:
Access the YARN ResourceManager UI (typically http://<ResourceManager_hostname>:8088).
Go to the "Applications" tab and locate the failed application.
Application Status: Note the State (e.g., FAILED, KILLED, ACCEPTED) and FinalStatus for clues.
Attempts: Check the "Attempts" column. Multiple attempts usually indicate a recurring issue or Application Master failure.
Logs: Click on the ID of the application. Then click "Logs" next to the "Application Master" to see its logs. Also, view logs for failed Containers.
Analyze Application Logs (Most Critical Step):
Use yarn logs -applicationId <application_id> to fetch all logs for the application. Alternatively, you can browse specific container logs via the YARN UI.
Nano $HADOOP_HOME/logs-
Search for Keywords: Look for ERROR, FATAL, Exception, OutOfMemoryError, Container killed by YARN, Exit code, Permission denied, No such file or directory.
Stack Trace: Identify the full stack trace to pinpoint the exact line of code or component causing the failure.
Exit Codes: Look for container exit codes (e.g., Exit code 143 often means KILLED by YARN for memory, Exit code 137 is also memory, Exit code 1 is general application error).
Address Resource-Related Failures:
Container killed by YARN for exceeding memory limits (Exit code 137/143):
Increase Application Memory: For MapReduce, increase mapreduce.map.memory.mb or mapreduce.reduce.memory.mb. For Spark, increase spark.executor.memory and spark.driver.memory.
Adjust Overhead: For Spark, increase spark.executor.memoryOverhead (default is max(384MB, 10% of spark.executor.memory)) or spark.driver.memoryOverhead.
Check YARN Limits: Ensure the requested memory doesn't exceed yarn.scheduler.maximum-allocation-mb or yarn.nodemanager.resource.memory-mb. If it does, either reduce application requests or increase YARN limits (requires cluster restart).
Application Stuck in ACCEPTED or SUBMITTED:
Check Cluster Resources: Go to YARN UI "Cluster Metrics" to see available vs. used memory and vcores. If resources are fully utilized, wait for other jobs to finish or increase cluster capacity.
Queue Capacity: Verify the configured capacity and maximum-capacity of the YARN queue (capacity-scheduler.xml) where the application is submitted.
Verify Configuration and Classpath:
Application-Specific Config: Double-check configurations relevant to your specific application (e.g., Hive properties for Hive queries; Spark configurations for Spark jobs).
Missing JARs/Dependencies: Ensure all necessary JARs for your application are correctly packaged or provided via --jars in spark-submit/hadoop jar. Check the application's environment setup (HADOOP_CLASSPATH, JAVA_HOME) on NodeManagers.
Investigate Disk Space and I/O:
NodeManager Local Disk: On NodeManager hosts, check free space in directories configured by yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs. Full disks can prevent container launch or logging.
HDFS Disk: Run hdfs dfsadmin -report to check HDFS disk utilization. If close to full, clean up old data.
Troubleshoot Network Issues:
From a NodeManager host, ping the ResourceManager, NameNode, and other NodeManagers.
Use telnet <hostname> <port> to verify connectivity on relevant ports (e.g., 8088 for RM, 9000 for NameNode RPC, 8040 for NM).
Check NodeManager and ResourceManager logs for java.net.SocketException or Connection refused errors.
Check YARN Component Health:
In YARN UI, navigate to the "Nodes" tab. If any NodeManagers are Unhealthy or Lost, investigate their host:
Check NodeManager logs.
cat $HADOOP_HOME/logs/hadoop-nodemanager-<hostname>*.log
Verify that the NodeManager process is running.
jps
Check system resources (CPU, memory, disk I/O) on that host.
top
Restart the NodeManager service if it is safe to do so.
yarn-daemon.sh stop nodemanager
yarn-daemon.sh start nodemanager
Review ResourceManager logs for any instability, OOM errors, or issues handling requests.
cat $HADOOP_HOME/logs/hadoop-resourcemanager-<hostname>*.log
Data Locality and HDFS Health:
If tasks fail consistently on specific nodes and related to data access, check the DataNodes on those nodes (hdfs dfsadmin -report, DataNode logs).
hdfs dfsadmin -report
Run hdfs fsck / to check the overall health of your HDFS filesystem.
Additional Notes:
Iterative Process: Troubleshooting YARN failures is often an iterative process. Start with the logs, identify the root cause, apply a fix, and re-run the application.
Version Compatibility: Ensure all components (Hadoop, Spark, Hive, JDBC drivers) are compatible with each other. Incompatible versions are a common source of obscure errors.
Monitoring Tools: Robust monitoring (e.g., Prometheus/Grafana, Ganglia) providing historical data for CPU, memory, network, and disk I/O metrics on all cluster nodes can be invaluable for pinpointing resource bottlenecks or correlating failures with system events.
Historical Server Logs: Don't just rely on application logs. System logs (/var/log/messages, dmesg), YARN daemon logs (yarn-nodemanager.log, yarn-resourcemanager.log), and HDFS daemon logs can provide crucial context, especially for container kills.
Container Re-attempts: YARN can re-attempt failed containers or even restart the Application Master. Check how many attempts were made for a container or application to determine if it's a transient or persistent issue.