Identifying Causes and Solutions for Job Slowness in Hadoop

Identifying Causes and Solutions for Job Slowness in Hadoop

Category: Troubleshooting Performance, Job Management 

Applies To: Distributed Processing Systems (Hadoop, Spark, etc.), Databases, Any application with batch jobs 

Issue Summary  

A batch job, data pipeline, or long-running process is executing significantly slower than expected, impacting SLAs, data freshness, or resource utilization. 

Possible Cause(s)  

List common reasons why this issue may occur.  

  1. Resource bottlenecks: Insufficient CPU, memory, disk I/O, or network bandwidth on nodes. 

  1. Data skew: Uneven distribution of data, causing some tasks/nodes to process much more data than others. 

  1. Input/Output bottlenecks: Slow reads from source systems or slow writes to destination systems. 

  1. Configuration issues: Suboptimal job configuration parameters (e.g., too few executors, incorrect parallelism). 

  1. Network latency/congestion: High latency or saturation in the network between cluster components. 

  1. External dependencies: Slow performance of external databases, APIs, or services the job interacts with. 

  1. Small files problem (Hadoop/Spark): Processing many tiny files creates high overhead. 

Step-by-Step Resolution  

  1. 1. Monitor Job Progress and Resources: 

  1. Use the job UI to monitor task progress, identify slow stages/tasks. 

  1. Monitor resource utilization on cluster nodes (CPU, memory, disk I/O, network). 

yarn top 10 

  1. 2. Analyze Job Logs: 

  1. Review driver and executor logs for errors, warnings, or performance indicators. 

http://<hostname>:18080/  

  1. 3. Check for Data Skew: 

  1. Look at task durations in the job UI. If a few tasks run significantly longer than others, it's a strong indicator of data skew. 

To check the task duration of job: 

  1. 4. Review Job Configuration: 

  1. For Pyspark:

    1. Add spark configuration within the program file. (.py)

  2. For spark: 

Edit in spark configuration: 

nano $SPARK_HOME/conf/spark-default.conf 

  • Spark.executor.memory: Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m, 2g). 

  • Spark.executor.cores: The number of cores to use on each executor. In standalone and Mesos coarse-grained modes. 

  • Spark.dynamicAllocation.enabled: Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload. 

  • Spark.default.parallelism: Spark's default parallelism is not a fixed number; it's determined based on the environment. In local mode, it defaults to the number of cores on your machine. In cluster mode, it's typically the number of available cores in the cluster, but it can be overridden. 

  1. 5. Optimize I/O: 

  1. File formats: Use efficient file formats (e.g., Parquet, ORC) for large datasets. 

  1. 6. Refine Code/Queries: 

  1. Review the job code for inefficient loops, unnecessary computations, or redundant operations. 

  1. For SQL-based jobs, analyze query plans and optimize indexes, joins, and filtering. 

  1. 7. Address Small Files: 

  1. If dealing with many small files, consider combining them into larger files before processing. 

Additional Notes: 

  • Profiling the application code can reveal hotspots causing slowness. 

  • Baseline performance measurements are crucial to identify when a job is actually “slow”. 

  • Break down complex jobs into smaller, manageable stages to isolate performance issues. 

    • Related Articles

    • Job Not Progressing - stuck in NEW_SAVING After Submission

      Title: Hadoop YARN Job Stuck in NEW_SAVING State Category: Troubleshooting Applies To: Hadoop 3.4.1 Last Updated: 23/06/2025 Issue Summary A submitted YARN application or job remains indefinitely in the NEW_SAVING state and does not transition to ...
    • Hadoop/Yarn Jobs Not starting - stuck in accepted state

      Title: hadoop yarn job stuck in accepted state - Step-by-Step Troubleshooting Guide Category: Troubleshooting Applies To: Last Updated: 23/06/2025 Issue Summary A job submitted via YARN remains in the ACCEPTED state indefinitely and does not ...
    • Understanding Hadoop Logs – Types, Use Cases, and Common Locations.

      Category: Troubleshooting → Logging and montoring Applies To: Hadoop HA cluster. Issue Summary In a distributed Hadoop HA cluster, component logs are the primary source of truth for monitoring system health, diagnosing failures, and troubleshooting ...
    • Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

      Category: Configuration → Hadoop Platform Applies To: Hadoop 3.x, spark 3.x Issue Summary This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and ...
    • Disk space full on /tmp Mount in Linux

      Title: Disk Full on /tmp Location in Linux Category: Troubleshooting Applies To: Linux 9.4(Linux version) Last Updated: 23/06/2025 Issue Summary: The /tmp directory has consumed all available disk space, causing application failures, service crashes, ...