How to Debug Spark Application Logs (YARN UI)

How to Debug Spark Application Logs (YARN UI)

How to Debug Spark Application Logs (YARN UI) 

Category: Troubleshooting → Apache Spark 

Applies To: Apache Spark 2.x, 3.x running on Apache Hadoop YARN 2.x, 3.x 

Issue summary: 

When a Spark application fails on a YARN cluster, the application logs are the primary source of information for diagnosing the root cause. The YARN UI provides a web-based interface to access and analyze these logs from the Spark driver (Application Master) and individual executors, enabling effective troubleshooting of errors, resource issues, and performance bottlenecks. 

Key Areas to Check in YARN UI 

  1. YARN ResourceManager UI Home Page (http://<ResourceManager_hostname>:8088): 

  • Applications Tab: Lists all submitted and completed applications. Pay attention to the ID, Name, State (RUNNING, FAILED, KILLED), and FinalStatus columns. 

  • Nodes Tab: Shows the health status of all NodeManagers (Running, Unhealthy, Lost). Unhealthy NodeManagers can cause container failures. 

  1. Application Details Page (Click on Application ID): 

  • Overview: Provides a summary of the application, including start/finish times, user, queue, and final status. 

  • Application Master (AM) Link: A direct link to the logs for the Application Master. This is typically the Spark Driver's logs. 

  • Attempts: If the Application Master itself failed and restarted, multiple "Attempts" will be listed, each with its own logs. 

  • Containers Tab: Lists all containers launched for the application (including the AM and all executors). This tab is crucial for identifying which containers failed and accessing their logs. 

  1. Container Logs Page (Click on Logs for a specific Container): 

  • stdout: Standard output from the container's process. Often contains Spark's progress messages and some application print statements. 

  • stderr: Standard error output. This is where most error messages, exceptions, and stack traces will be found. 

  • log4j (or other logger output): The main log file for Spark and your application, configured via log4j.properties. This contains detailed Spark events, task successes/failures, and application-specific logging. 

Step-by-Step Debugging Process: 

  1. Access the YARN UI and Locate Your Application: 

  • Open your web browser to http://<ResourceManager_hostname>:8088. 

  • Navigate to the "Applications" tab. Find your Spark application by its ID or Name. 

  1. Identify the Failure Status: 

  • Look at the State and FinalStatus columns. Common failure statuses are FAILED or KILLED. 

  • If KILLED, it often indicates a resource-related issue (e.g., container exceeding memory limits). 

  1. Start with the Application Master (AM) Logs: 

  • Click on the ID of your failed application. 

  • On the Application Details page, find the "Logs" link next to the Application Master and click it. 

  • What to look for: 

  • Root Cause Exception: Scroll through the logs and search for ERROR, FATAL, Exception, Caused by:. The first exception is often the root cause. 

  • Driver-side Errors: Errors here indicate problems with the Spark driver itself (e.g., driver OOM, inability to connect to Metastore, issues with collecting large results). 

  • Executor Launch Failures: Messages indicating that executors could not be launched or failed immediately after launch (e.g., "Container launch failed for container..."). 

  1. Inspect Failed Executor Container Logs (if AM logs point to executor issues): 

  • Go back to the Application Details page and click on the "Containers" tab. 

  • Identify Failed Containers: Look for containers with a FAILED or KILLED status, or those with a non-zero Exit Code (e.g., 137, 143 for memory issues, 1 for general application errors). 

  • Access Logs: Click the "Logs" link for these failed containers. 

  • What to look for in stderr and log4j: 

  • java.lang.OutOfMemoryError: Indicates that the executor's JVM ran out of heap space. This could be due to spark.executor.memory being too low or a memory leak in the application. 

  • Container killed by YARN for exceeding memory limits: This message (or similar variations) means the YARN NodeManager killed the container because its total memory usage (heap + off-heap) exceeded its allocated YARN container size. This points to spark.executor.memory or spark.executor.memoryOverhead being too low. 

  • Task lost: Indicates a task failed to run successfully on an executor. The preceding log entries will often explain why. 

  • java.io.NotSerializableException: Common Spark error when non-serializable objects are captured in closures. 

  • Network errors: (java.net.SocketTimeoutException, Connection refused) can indicate network issues between executors or to external services. 

  • GC overhead limit exceeded: The JVM spent too much time doing garbage collection. Adjust spark.executor.memory or try using G1GC. 

  1. Correlate with Spark UI (if the application partially runs): 

  • If the application started successfully (even if it later failed), the Spark UI (link usually available on the YARN Application Details page) can provide deeper insights. 

  • Jobs/Stages Tab: Identify which stages failed. Click on the stage to see details about individual tasks. 

  • Tasks Tab: Look for tasks that took a long time, failed, or had large shuffle read/write sizes (potential data skew). 

  • Executors Tab: Monitor memory usage, GC activity, and active/failed tasks for each executor. 

  1. Check YARN Node Status: 

  • If multiple applications or containers are failing across the cluster, check the "Nodes" tab in the YARN UI. 

  • If NodeManagers are Unhealthy or Lost, investigate the underlying host's logs and resources. 

  1. Use yarn logs CLI Utility: 

  • For downloading or viewing logs from the command line, especially useful for automation or when UI access is restricted. 

  • yarn logs -applicationId <application_id>: Fetches all container logs for the application. 

  • yarn logs -applicationId <application_id> -containerId <container_id>: Fetches logs for a specific container. 

Additional Notes: 

  • Log Retention: Ensure your YARN cluster is configured to retain logs for a sufficient period (yarn.log-aggregation-enable=true and yarn.log.retain-seconds). If logs are aggregated to HDFS, they will persist even after the application is finished. 

  • Logging Levels: Temporarily increasing the logging level (e.g., from INFO to DEBUG) in Spark's log4j.properties can provide more verbose output for troubleshooting but remember to revert it in production to avoid excessive log volume. 

  • Common Exit Codes: 

  • 137: Container killed by YARN (often OOM, sigkill). 

  • 143: Container killed by YARN (often graceful shutdown due to OOM, sigterm). 

  • 1: Generic application error. 

  • 255: Unknown or internal error. 

  • Iterative Debugging: Debugging is often an iterative process. Identify a potential cause from the logs, make a small configuration change or code fix, and re-run to see the effect. 

    • Related Articles

    • Understanding Hadoop Logs – Types, Use Cases, and Common Locations.

      Category: Troubleshooting → Logging and montoring Applies To: Hadoop HA cluster. Issue Summary In a distributed Hadoop HA cluster, component logs are the primary source of truth for monitoring system health, diagnosing failures, and troubleshooting ...
    • Troubleshooting Yarn Application Failures

      Troubleshooting Yarn Application Failures Category: Troubleshooting → YARN Applies To: Apache YARN 2.x, 3.x Issues Summary: YARN applications (such as Spark, MapReduce, Tez jobs) fail to complete successfully, often exiting with a FAILED status, or ...
    • Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

      Category: Configuration → Hadoop Platform Applies To: Hadoop 3.x, spark 3.x Issue Summary This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and ...
    • Memory Overhead and OOM in Spark – Tuning Executor Settings

      Memory Overhead and OOM in Spark – Tuning Executor Settings Title: Memory Overhead and OOM in Spark - Tuning Executor Settings Category: Troubleshooting → Spark Applies To: Apache Spark 2.x, 3.x (running on YARN, Mesos, or Standalone) Issue Summary ...
    • Troubleshooting Out-of-Memory or Slow Execution – spark

      Troubleshooting Out-of-Memory or Slow Execution – spark Title: Memory Overhead and OOM in Spark - Tuning Executor Settings Category: Troubleshooting → Spark Applies To: Apache Spark 2.x, 3.x (running on YARN, Mesos, or Standalone) Issue Summary Spark ...