Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

Category: Configuration → Hadoop Platform
Applies To: Hadoop 3.x, spark 3.x

Issue Summary

This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and Spark. These properties are fundamental to performance, stability, and resource management and should be configured based on the cluster's hardware and workload.

HDFS (Hadoop Distributed File System)

Critical Properties (in hdfs-site.xml and core-site.xml)

To edit the configuration file:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

nano $HADOOP_HOME/etc/hadoop/core-site.xml

For hdfs-site.xml file,

dfs.replication:

Description: The default number of replicas for each block of a file.

Value: 3 (default and recommended for production). A higher value increases fault tolerance and read throughput but consumes more storage.

dfs.nameservices:

Description: Comma-separated list of nameservices.

Value: mycluster

dfs.ha.namenodes.mycluster:

Description: This will be used by DataNodes to determine all the NameNodes in the cluster.

Value: nn1, nn2

dfs.namenode.rpc-address.mycluster.nn1:

Description: RPC address that handles all clients requests. The NameNode's default RPC port is 8020.

Value: <Namenode_hostname>:8020

dfs.namenode.rpc-address.mycluster.nn2:

Description: RPC address that handles all clients requests. The NameNode's default RPC port is 8020.

Value: <Namenode_hostname>:8020

dfs.namenode.http-address.mycluster.nn1:

Description: The address and the base port where the namenode web ui will listen on.

Value: <Namenode_hostname>:9870

dfs.namenode.http-address.mycluster.nn2:

Description: The address and the base port where the namenode web ui will listen on.

Value: <Namenode_hostname>:9870

dfs.client.failover.proxy.provider.mycluster:

Description: The prefix (plus a required nameservice ID) for the class name of the configured Failover proxy provider for the host.

Value: Org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

dfs.ha.fencing.methods:

Description: A list of scripts or Java classes which will be used to fence the Active NameNode during a failover.

Value: sshfence

dfs.ha.fencing.ssh.private-key-files:

Description: comma-separated list of SSH private key files.

Value: /home/<user_name>/.ssh/id_rsa

dfs.journalnode.edits.dir:

Description: The directory where the journal edit files are stored.

Value: /opt/hadoop/journal

dfs.namenode.shared.edits.dir:

Description: A directory on shared storage between the multiple namenodes in an HA cluster. This directory will be written by the active and read by the standby in order to keep the namespaces synchronized.

Value: qjournal://<active_namenode_hostname>:8485;<standby_namenode_hostname>:8485;<datanode_hostname>:8485/<nameservice>

dfs.ha.automatic-failover.enabled:

Description: Whether automatic failover is enabled.

Value: true

dfs.blocksize:

Description: The default size of an HDFS block.

Value: 134217728 (128 MB) or 268435456 (256 MB). Increasing this value for large files reduces NameNode metadata overhead and improves I/O efficiency for sequential reads.

dfs.namenode.datanode.registration.ip-hostname-check:

Description: Verifies that a DataNode's IP address and hostname match during registration.

Value: true. It's a critical security measure to prevent unauthorized DataNodes from joining the cluster.

dfs.namenode.safemode.threshold-pct:

Description: The percentage of HDFS blocks that must be reported to the NameNode before it exits Safe Mode (read-only state).

Value: 0.999f (default). Lowering this value can make the cluster available sooner after a restart, but with a higher risk of data unavailability.

dfs.namenode.handler.count:

Description: The number of NameNode RPC server threads.

Value: Typically, 20. For large clusters with many clients, increasing this can improve NameNode responsiveness.

For core-site.xml file:

ha.zookeeper.quorum:

Description: A list of ZooKeeper server addresses, separated by commas, that are to be used by the ZKFailoverController in automatic failover.

Value: <active_namenode_hostname>:2181,<standby_namenode_hostname>:2181,<datanode_hostname>:2181

fs.defaultFS:

Description: The URI of the default file system (the NameNode).

Value: hdfs://<namenode_hostname>:<port> or hdfs://<nameservice_id> for HA.

YARN (Yet Another Resource Negotiator)

Critical Properties (in yarn-site.xml)

To edit the configuration file:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

yarn.nodemanager.resource.memory-mb:

Description: The total amount of physical memory (in MB) that can be allocated for containers on a NodeManager.

Value: Should be less than the total physical memory of the host machine to leave room for the OS and other daemons. Common practice is 75-80% of total RAM.

yarn.nodemanager.aux-services:

Description: A comma separated list of services where service name should only contain a-z A-Z 0-9_ and cannot start with numbers.

Value: mapreduce_shuffle

yarn.resourcemanager.hostname:

Description: the hostname of RM.

Value: <Namenode_hostname>

yarn.resourcemanager.ha.enabled:

Description: Enable RM high-availability.

Value: true

yarn.resourcemanager.cluster-id:

Description: Name of the cluster. In a HA setting, this is used to ensure the RM participates in leader election for this cluster and ensures it does not affect other clusters.

Value: yarn-cluster

yarn.resourcemanager.ha.rm-ids:

Description: The list of RM nodes in the cluster when HA is enabled.

Value: rm1, rm2

yarn.resourcemanagerhostname.<rm_ids>:

Description: the hostname of RM

Value: <active_namenode_hostname>

Yarn.resourcemanager.recovery.enabled:

Description: Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified.

Value: true

yarn.resourcemanager.ha.automatic-failover.enabled:

Description: Enable automatic failover. By default, it is enabled only when HA is enabled.

Value: true

yarn.nodemanager.resource.cpu-vcores:

Description: The number of CPU cores that can be allocated for containers on a NodeManager.

Value: Should be set to the number of physical cores available on the machine.

yarn.scheduler.minimum-allocation-mb / yarn.scheduler.minimum-allocation-vcores:

Description: The minimum allocation of units for a container. All container requests will be rounded up to this value.

Value: 1024 (1GB) and 1. Setting these too low can lead to excessive container requests and scheduling overhead.

yarn.scheduler.maximum-allocation-mb / yarn.scheduler.maximum-allocation-vcores:

Description: The maximum allocation of units for a container.

Value: Should not exceed the NodeManager's total resources.

yarn.resourcemanager.scheduler.class:

Description: The scheduler used by the ResourceManager to allocate resources.

Value: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (default) or ...fair.FairScheduler.

Spark

Critical Properties (in spark-defaults.conf)

To edit the configuration file:

nano $SPARK_HOME/conf/spark-default.conf

spark.executor.memory:

Description: The amount of memory to be allocated to each executor.

Value: Depends on the workload. Typically, 2g to 8g. For large jobs, this can be increased, but it should be less than the NodeManager's max container size.

spark.executor.cores:

Description: The number of CPU cores to be allocated to each executor.

Value: Typically 1 to 5. A higher value allows an executor to run more tasks concurrently.

spark.driver.memory:

Description: The memory allocated to the Spark driver process.

Value: 1g is a common starting point. Increase for applications with large datasets that require local operations on the driver.

spark.default.parallelism:

Description: The default number of partitions for RDDs and shuffle output.

Value: Should be 2-4 times the number of cores in the cluster. A higher value leads to more parallelism but also more scheduling overhead.

spark.serializer:

Description: The serializer used for data serialization.

Value: org.apache.spark.serializer.KryoSerializer. Using Kryo can significantly improve performance over Java serialization.

Hive

Critical properties (hive-site.xml)

To edit the configuration file:

Nano $HIVE_HOME/conf/hive-site.xml

javax.jdo.option.ConnectionURL

Description: This property specifies the JDBC connection string for the MySQL database.

Value: jdbc:mysql://<hostname>:<port>/<database>

javax.jdo.option.ConnectionDriverName:

Description: This property defines the JDBC driver class for MySQL.

Value: com.mysql.cj.jdbc.Driver

javax.jdo.option.ConnectionUserName:

Description: This property sets the username for connecting to the Hive metastore database in MySQL

Value: <HIVE_USERNAME>

javax.jdo.option.ConnectionPassword:

Description: This property specifies the password for the Hive metastore database user.

Value: your_password

hive.metastore.uris:

Description: This property specifies the Thrift URI(s) of the Hive metastore service(s). If using a single metastore server, this would be a single URI. If using a high-availability setup with multiple metastore servers, you'd list multiple URIs.

Value: thrift://<hostname>:9083

Other Essential Configurations

Disable the SELINUX:

Edit the selinux configuration file:

nano /etc/selinux/config

set enforcing to disabled.

Stop the firewall service:

sudo systemctl stop firewalld
sudo systemctl disable firewalld

Network Time Protocol (NTP) Synchronization:

Configuration: Ensure all nodes are synchronized to a common NTP server using a service like chronyd.

Sudo yum install chrony

Sudo systemctl start chrony

Sudo timedatectl set-ntp true

timedatectl status

Reason: Time drift between nodes can cause scheduling issues in YARN.

Linux Swappiness:

Configuration: Set vm.swappiness to a default value (60) in /etc/sysctl.conf.

Reason: The vm.swappiness parameter influences the kernel's decision on when to move data from RAM to swap space (disk). A higher value means the kernel will be more aggressive in using swap.

Log and Disk Management:

Configuration: Implement log rotation policies for all Hadoop daemons to prevent log files from filling up the disk.

Reason: Unmanaged logs can consume all disk space, leading to service failures.

Related Articles
Troubleshooting Yarn Application Failures
Troubleshooting Yarn Application Failures Category: Troubleshooting → YARN Applies To: Apache YARN 2.x, 3.x Issues Summary: YARN applications (such as Spark, MapReduce, Tez jobs) fail to complete successfully, often exiting with a FAILED status, or ...
How to Debug Spark Application Logs (YARN UI)
How to Debug Spark Application Logs (YARN UI) Category: Troubleshooting → Apache Spark Applies To: Apache Spark 2.x, 3.x running on Apache Hadoop YARN 2.x, 3.x Issue summary: When a Spark application fails on a YARN cluster, the application logs are ...
Managing HDFS Space and Replication
Managing HDFS Space and Replication Category: Troubleshooting → HDFS Applies To: Apache Hadoop HDFS 2.x, 3.x Issue summary: Effective management of HDFS disk space and data replication is crucial for the stability, performance, and data availability ...
Understanding Hadoop Logs – Types, Use Cases, and Common Locations.
Category: Troubleshooting → Logging and montoring Applies To: Hadoop HA cluster. Issue Summary In a distributed Hadoop HA cluster, component logs are the primary source of truth for monitoring system health, diagnosing failures, and troubleshooting ...
Resource Allocation and Scheduler Configuration
Resource Allocation and Scheduler Configuration Category: Administration → Resource Management Applies to: Apache Hadoop 2.x, 3.x Issue Summary This document outlines critical configurations for resource allocation and scheduler management within ...

Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

Related Articles

Troubleshooting Yarn Application Failures

How to Debug Spark Application Logs (YARN UI)

Managing HDFS Space and Replication

Understanding Hadoop Logs – Types, Use Cases, and Common Locations.

Resource Allocation and Scheduler Configuration