Category: Troubleshooting → Performance, Job Management
Applies To: Distributed Processing Systems (Hadoop, Spark, etc.), Databases, Any application with batch jobs
Issue Summary
A batch job, data pipeline, or long-running process is executing significantly slower than expected, impacting SLAs, data freshness, or resource utilization.
Possible Cause(s)
List common reasons why this issue may occur.
Resource bottlenecks: Insufficient CPU, memory, disk I/O, or network bandwidth on nodes.
Data skew: Uneven distribution of data, causing some tasks/nodes to process much more data than others.
Input/Output bottlenecks: Slow reads from source systems or slow writes to destination systems.
Configuration issues: Suboptimal job configuration parameters (e.g., too few executors, incorrect parallelism).
Network latency/congestion: High latency or saturation in the network between cluster components.
External dependencies: Slow performance of external databases, APIs, or services the job interacts with.
Small files problem (Hadoop/Spark): Processing many tiny files creates high overhead.
Step-by-Step Resolution
1. Monitor Job Progress and Resources:
Use the job UI to monitor task progress, identify slow stages/tasks.
Monitor resource utilization on cluster nodes (CPU, memory, disk I/O, network).
yarn top 10
2. Analyze Job Logs:
Review driver and executor logs for errors, warnings, or performance indicators.
http://<hostname>:18080/
3. Check for Data Skew:
Look at task durations in the job UI. If a few tasks run significantly longer than others, it's a strong indicator of data skew.
To check the task duration of job:
4. Review Job Configuration:
For Pyspark:
Add spark configuration within the program file. (.py)
For spark:
Edit in spark configuration:
nano $SPARK_HOME/conf/spark-default.conf
Spark.executor.memory: Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m, 2g).
Spark.executor.cores: The number of cores to use on each executor. In standalone and Mesos coarse-grained modes.
Spark.dynamicAllocation.enabled: Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
Spark.default.parallelism: Spark's default parallelism is not a fixed number; it's determined based on the environment. In local mode, it defaults to the number of cores on your machine. In cluster mode, it's typically the number of available cores in the cluster, but it can be overridden.
5. Optimize I/O:
File formats: Use efficient file formats (e.g., Parquet, ORC) for large datasets.
6. Refine Code/Queries:
Review the job code for inefficient loops, unnecessary computations, or redundant operations.
For SQL-based jobs, analyze query plans and optimize indexes, joins, and filtering.
7. Address Small Files:
If dealing with many small files, consider combining them into larger files before processing.
Additional Notes:
Profiling the application code can reveal hotspots causing slowness.
Baseline performance measurements are crucial to identify when a job is actually “slow”.
Break down complex jobs into smaller, manageable stages to isolate performance issues.