Troubleshooting Yarn Application Failures

Troubleshooting Yarn Application Failures

Troubleshooting Yarn Application Failures 

Category: Troubleshooting → YARN 

Applies To: Apache YARN 2.x, 3.x 

Issues Summary: 

YARN applications (such as Spark, MapReduce, Tez jobs) fail to complete successfully, often exiting with a FAILED status, or getting stuck in an ACCEPTED state, or failing to launch any containers. This can impact data processing, analytics, and overall cluster utility. 

Possible Cause(s): 

  • Resource Exhaustion: 

  • Insufficient memory or CPU on NodeManagers to run requested containers. 

  • Cluster-wide resource starvation preventing application master or tasks from being allocated. 

  • Application requesting more resources than YARN or NodeManager limits allow. 

  • Application Code Errors: 

  • Bugs in the application logic leading to exceptions (e.g., NullPointerException, ArrayIndexOutOfBoundsException). 

  • Application-level OutOfMemoryError due to inefficient code or data handling. 

  • Unhandled exceptions or infinite loops cause the application to crash or hang. 

  • Configuration Mismatches: 

  • Application resource requests (spark.executor.memory, mapreduce.map.memory.mb, etc.) exceed YARN's maximum allocation limits. 

  • Incorrect application-specific settings (e.g., wrong input/output paths, invalid parameters). 

  • Incorrect classpath or missing dependencies for the application. 

  • Network Issues: 

  • Communication failures between the Application Master and NodeManagers. 

  • Containers unable to communicate with each other or with external services (HDFS, databases, Kafka). 

  • Firewall restrictions between cluster components. 

  • Disk Space Issues: 

  • NodeManager local disk full (yarn.nodemanager.local-dirs, yarn.nodemanager.log-dirs). 

  • HDFS disk full preventing new data writes or temporary file creation. 

  • Container Launch Failures: 

  • Incorrect permissions for creating container directories or running processes. 

  • Missing JAVA_HOME or incorrect environment setup on NodeManagers. 

  • Corrupted application JARs or libraries. 

  • YARN Component Health: 

  • Unhealthy or unresponsive NodeManagers. 

  • ResourceManager issues (e.g., instability, high load, or HA failover problems). 

  • Data Locality Issues: 

  • Tasks failing due to inability to access data on local DataNodes (e.g., DataNode failures, network isolation). 

Step-by-Step Resolution: 

  • Initial Triage: Check YARN UI: 

  • Access the YARN ResourceManager UI (typically http://<ResourceManager_hostname>:8088). 

  • Go to the "Applications" tab and locate the failed application. 

  • Application Status: Note the State (e.g., FAILED, KILLED, ACCEPTED) and FinalStatus for clues. 

  • Attempts: Check the "Attempts" column. Multiple attempts usually indicate a recurring issue or Application Master failure. 

  • Logs: Click on the ID of the application. Then click "Logs" next to the "Application Master" to see its logs. Also, view logs for failed Containers. 

  • Analyze Application Logs (Most Critical Step): 

  • Use yarn logs -applicationId <application_id> to fetch all logs for the application. Alternatively, you can browse specific container logs via the YARN UI. 

Nano $HADOOP_HOME/logs- 

  • Search for Keywords: Look for ERROR, FATAL, Exception, OutOfMemoryError, Container killed by YARN, Exit code, Permission denied, No such file or directory. 

  • Stack Trace: Identify the full stack trace to pinpoint the exact line of code or component causing the failure. 

  • Exit Codes: Look for container exit codes (e.g., Exit code 143 often means KILLED by YARN for memory, Exit code 137 is also memory, Exit code 1 is general application error). 

  • Address Resource-Related Failures: 

  • Container killed by YARN for exceeding memory limits (Exit code 137/143): 

  • Increase Application Memory: For MapReduce, increase mapreduce.map.memory.mb or mapreduce.reduce.memory.mb. For Spark, increase spark.executor.memory and spark.driver.memory. 

  • Adjust Overhead: For Spark, increase spark.executor.memoryOverhead (default is max(384MB, 10% of spark.executor.memory)) or spark.driver.memoryOverhead. 

  • Check YARN Limits: Ensure the requested memory doesn't exceed yarn.scheduler.maximum-allocation-mb or yarn.nodemanager.resource.memory-mb. If it does, either reduce application requests or increase YARN limits (requires cluster restart). 

  • Application Stuck in ACCEPTED or SUBMITTED: 

  • Check Cluster Resources: Go to YARN UI "Cluster Metrics" to see available vs. used memory and vcores. If resources are fully utilized, wait for other jobs to finish or increase cluster capacity. 

  • Queue Capacity: Verify the configured capacity and maximum-capacity of the YARN queue (capacity-scheduler.xml) where the application is submitted. 

  • Verify Configuration and Classpath: 

  • Application-Specific Config: Double-check configurations relevant to your specific application (e.g., Hive properties for Hive queries; Spark configurations for Spark jobs). 

  • Missing JARs/Dependencies: Ensure all necessary JARs for your application are correctly packaged or provided via --jars in spark-submit/hadoop jar. Check the application's environment setup (HADOOP_CLASSPATH, JAVA_HOME) on NodeManagers. 

  • Investigate Disk Space and I/O: 

  • NodeManager Local Disk: On NodeManager hosts, check free space in directories configured by yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs. Full disks can prevent container launch or logging. 

  • HDFS Disk: Run hdfs dfsadmin -report to check HDFS disk utilization. If close to full, clean up old data. 

  • Troubleshoot Network Issues: 

  • From a NodeManager host, ping the ResourceManager, NameNode, and other NodeManagers. 

  • Use telnet <hostname> <port> to verify connectivity on relevant ports (e.g., 8088 for RM, 9000 for NameNode RPC, 8040 for NM). 

  • Check NodeManager and ResourceManager logs for java.net.SocketException or Connection refused errors. 

  • Check YARN Component Health: 

  • In YARN UI, navigate to the "Nodes" tab. If any NodeManagers are Unhealthy or Lost, investigate their host: 

  • Check NodeManager logs. 

cat $HADOOP_HOME/logs/hadoop-nodemanager-<hostname>*.log 

  • Verify that the NodeManager process is running. 

jps 

  • Check system resources (CPU, memory, disk I/O) on that host. 

top 

  • Restart the NodeManager service if it is safe to do so. 

yarn-daemon.sh stop nodemanager 

yarn-daemon.sh start nodemanager 

  • Review ResourceManager logs for any instability, OOM errors, or issues handling requests. 

cat $HADOOP_HOME/logs/hadoop-resourcemanager-<hostname>*.log 

  • Data Locality and HDFS Health: 

  • If tasks fail consistently on specific nodes and related to data access, check the DataNodes on those nodes (hdfs dfsadmin -report, DataNode logs). 

hdfs dfsadmin -report 

  • Run hdfs fsck / to check the overall health of your HDFS filesystem. 

Additional Notes: 

  • Iterative Process: Troubleshooting YARN failures is often an iterative process. Start with the logs, identify the root cause, apply a fix, and re-run the application. 

  • Version Compatibility: Ensure all components (Hadoop, Spark, Hive, JDBC drivers) are compatible with each other. Incompatible versions are a common source of obscure errors. 

  • Monitoring Tools: Robust monitoring (e.g., Prometheus/Grafana, Ganglia) providing historical data for CPU, memory, network, and disk I/O metrics on all cluster nodes can be invaluable for pinpointing resource bottlenecks or correlating failures with system events. 

  • Historical Server Logs: Don't just rely on application logs. System logs (/var/log/messages, dmesg), YARN daemon logs (yarn-nodemanager.log, yarn-resourcemanager.log), and HDFS daemon logs can provide crucial context, especially for container kills. 

  • Container Re-attempts: YARN can re-attempt failed containers or even restart the Application Master. Check how many attempts were made for a container or application to determine if it's a transient or persistent issue. 

    • Related Articles

    • How to Debug Spark Application Logs (YARN UI)

      How to Debug Spark Application Logs (YARN UI) Category: Troubleshooting → Apache Spark Applies To: Apache Spark 2.x, 3.x running on Apache Hadoop YARN 2.x, 3.x Issue summary: When a Spark application fails on a YARN cluster, the application logs are ...
    • Hadoop/Yarn Jobs Not starting - stuck in accepted state

      Title: hadoop yarn job stuck in accepted state - Step-by-Step Troubleshooting Guide Category: Troubleshooting Applies To: Last Updated: 23/06/2025 Issue Summary A job submitted via YARN remains in the ACCEPTED state indefinitely and does not ...
    • Troubleshooting Out-of-Memory or Slow Execution – spark

      Troubleshooting Out-of-Memory or Slow Execution – spark Title: Memory Overhead and OOM in Spark - Tuning Executor Settings Category: Troubleshooting → Spark Applies To: Apache Spark 2.x, 3.x (running on YARN, Mesos, or Standalone) Issue Summary Spark ...
    • Standby Namenode Startup Failures After a recent system Crash.

      Applies To: Hadoop HDFS (HA deployement with QuorumJournalManager or zkfc) Category: Troubleshooting → HDFS, High Availability Issue Summary In a Hadoop High Availability (HA) cluster, after a failover or restart, the standby NameNode fails to ...
    • Understanding Hadoop Logs – Types, Use Cases, and Common Locations.

      Category: Troubleshooting → Logging and montoring Applies To: Hadoop HA cluster. Issue Summary In a distributed Hadoop HA cluster, component logs are the primary source of truth for monitoring system health, diagnosing failures, and troubleshooting ...