Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

Critical Configuration Properties for HDFS, YARN, Spark, and Other Hadoop Components

Category: Configuration → Hadoop Platform 
Applies To: Hadoop 3.x, spark 3.x 

Issue Summary 

This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and Spark. These properties are fundamental to performance, stability, and resource management and should be configured based on the cluster's hardware and workload. 

  1. HDFS (Hadoop Distributed File System) 

Critical Properties (in hdfs-site.xml and core-site.xml) 

To edit the configuration file: 

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml 

nano $HADOOP_HOME/etc/hadoop/core-site.xml 

For hdfs-site.xml file, 

  • dfs.replication: 

  • Description: The default number of replicas for each block of a file. 

  • Value: 3 (default and recommended for production). A higher value increases fault tolerance and read throughput but consumes more storage. 

  • dfs.nameservices: 

  • Description: Comma-separated list of nameservices. 

  • Value: mycluster 

  • dfs.ha.namenodes.mycluster 

  • Description: This will be used by DataNodes to determine all the NameNodes in the cluster.   

  • Value: nn1, nn2 

  • dfs.namenode.rpc-address.mycluster.nn1 

  • Description: RPC address that handles all clients requests. The NameNode's default RPC port is 8020. 

  • Value: <Namenode_hostname>:8020 

  • dfs.namenode.rpc-address.mycluster.nn2 

  • Description: RPC address that handles all clients requests. The NameNode's default RPC port is 8020. 

  • Value: <Namenode_hostname>:8020 

  • dfs.namenode.http-address.mycluster.nn1 

  • Description: The address and the base port where the namenode web ui will listen on. 

  • Value: <Namenode_hostname>:9870 

  • dfs.namenode.http-address.mycluster.nn2: 

  • Description: The address and the base port where the namenode web ui will listen on. 

  • Value: <Namenode_hostname>:9870 

  • dfs.client.failover.proxy.provider.mycluster: 

  • Description: The prefix (plus a required nameservice ID) for the class name of the configured Failover proxy provider for the host. 

  • Value: Org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider 

  • dfs.ha.fencing.methods: 

  • Description: A list of scripts or Java classes which will be used to fence the Active NameNode during a failover.   

  • Value: sshfence   

  • dfs.ha.fencing.ssh.private-key-files: 

  • Description: comma-separated list of SSH private key files.   

  • Value: /home/<user_name>/.ssh/id_rsa 

  • dfs.journalnode.edits.dir: 

  • Description: The directory where the journal edit files are stored. 

  • Value: /opt/hadoop/journal 

  • dfs.namenode.shared.edits.dir: 

  • Description: A directory on shared storage between the multiple namenodes in an HA cluster. This directory will be written by the active and read by the standby in order to keep the namespaces synchronized. 

  • Value: qjournal://<active_namenode_hostname>:8485;<standby_namenode_hostname>:8485;<datanode_hostname>:8485/<nameservice> 

  • dfs.ha.automatic-failover.enabled: 

  • Description: Whether automatic failover is enabled. 

  • Value: true 

  • dfs.blocksize: 

  • Description: The default size of an HDFS block. 

  • Value: 134217728 (128 MB) or 268435456 (256 MB). Increasing this value for large files reduces NameNode metadata overhead and improves I/O efficiency for sequential reads. 

  • dfs.namenode.datanode.registration.ip-hostname-check: 

  • Description: Verifies that a DataNode's IP address and hostname match during registration. 

  • Value: true. It's a critical security measure to prevent unauthorized DataNodes from joining the cluster. 

  • dfs.namenode.safemode.threshold-pct: 

  • Description: The percentage of HDFS blocks that must be reported to the NameNode before it exits Safe Mode (read-only state). 

  • Value: 0.999f (default). Lowering this value can make the cluster available sooner after a restart, but with a higher risk of data unavailability. 

  • dfs.namenode.handler.count: 

  • Description: The number of NameNode RPC server threads. 

  • Value: Typically, 20. For large clusters with many clients, increasing this can improve NameNode responsiveness. 

For core-site.xml file: 

  • ha.zookeeper.quorum: 

  • Description: A list of ZooKeeper server addresses, separated by commas, that are to be used by the ZKFailoverController in automatic failover. 

  • Value: <active_namenode_hostname>:2181,<standby_namenode_hostname>:2181,<datanode_hostname>:2181 

  • fs.defaultFS: 

  • Description: The URI of the default file system (the NameNode). 

  • Value: hdfs://<namenode_hostname>:<port> or hdfs://<nameservice_id> for HA. 

  1. YARN (Yet Another Resource Negotiator) 

Critical Properties (in yarn-site.xml) 

To edit the configuration file:  

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml 

  • yarn.nodemanager.resource.memory-mb: 

  • Description: The total amount of physical memory (in MB) that can be allocated for containers on a NodeManager. 

  • Value: Should be less than the total physical memory of the host machine to leave room for the OS and other daemons. Common practice is 75-80% of total RAM. 

  • yarn.nodemanager.aux-services: 

  • Description: A comma separated list of services where service name should only contain a-z A-Z 0-9_ and cannot start with numbers. 

  • Value: mapreduce_shuffle 

  • yarn.resourcemanager.hostname: 

  • Description: the hostname of RM. 

  • Value: <Namenode_hostname> 

  • yarn.resourcemanager.ha.enabled: 

  • Description: Enable RM high-availability. 

  • Value: true 

  • yarn.resourcemanager.cluster-id: 

  • Description: Name of the cluster. In a HA setting, this is used to ensure the RM participates in leader election for this cluster and ensures it does not affect other clusters. 

  • Value: yarn-cluster 

  • yarn.resourcemanager.ha.rm-ids: 

  • Description: The list of RM nodes in the cluster when HA is enabled. 

  • Value: rm1, rm2 

  • yarn.resourcemanagerhostname.<rm_ids>: 

  • Description: the hostname of RM 

  • Value: <active_namenode_hostname> 

  • Yarn.resourcemanager.recovery.enabled: 

  • Description: Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified. 

  • Value: true 

  • yarn.resourcemanager.ha.automatic-failover.enabled: 

  • Description: Enable automatic failover. By default, it is enabled only when HA is enabled. 

  • Value: true 

  • yarn.nodemanager.resource.cpu-vcores: 

  • Description: The number of CPU cores that can be allocated for containers on a NodeManager. 

  • Value: Should be set to the number of physical cores available on the machine. 

  • yarn.scheduler.minimum-allocation-mb / yarn.scheduler.minimum-allocation-vcores: 

  • Description: The minimum allocation of units for a container. All container requests will be rounded up to this value. 

  • Value: 1024 (1GB) and 1. Setting these too low can lead to excessive container requests and scheduling overhead. 

  • yarn.scheduler.maximum-allocation-mb / yarn.scheduler.maximum-allocation-vcores: 

  • Description: The maximum allocation of units for a container. 

  • Value: Should not exceed the NodeManager's total resources. 

  • yarn.resourcemanager.scheduler.class: 

  • Description: The scheduler used by the ResourceManager to allocate resources. 

  • Value: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (default) or ...fair.FairScheduler. 

  1. Spark 

Critical Properties (in spark-defaults.conf) 

To edit the configuration file:  

nano $SPARK_HOME/conf/spark-default.conf 

  • spark.executor.memory: 

  • Description: The amount of memory to be allocated to each executor. 

  • Value: Depends on the workload. Typically, 2g to 8g. For large jobs, this can be increased, but it should be less than the NodeManager's max container size. 

  • spark.executor.cores: 

  • Description: The number of CPU cores to be allocated to each executor. 

  • Value: Typically 1 to 5. A higher value allows an executor to run more tasks concurrently. 

  • spark.driver.memory: 

  • Description: The memory allocated to the Spark driver process. 

  • Value: 1g is a common starting point. Increase for applications with large datasets that require local operations on the driver. 

  • spark.default.parallelism: 

  • Description: The default number of partitions for RDDs and shuffle output. 

  • Value: Should be 2-4 times the number of cores in the cluster. A higher value leads to more parallelism but also more scheduling overhead. 

  • spark.serializer: 

  • Description: The serializer used for data serialization. 

  • Value: org.apache.spark.serializer.KryoSerializer. Using Kryo can significantly improve performance over Java serialization. 

  1. Hive 

Critical properties (hive-site.xml) 

To edit the configuration file: 

Nano $HIVE_HOME/conf/hive-site.xml 

  • javax.jdo.option.ConnectionURL 

  • Description: This property specifies the JDBC connection string for the MySQL database. 

  • Value: jdbc:mysql://<hostname>:<port>/<database> 

  • javax.jdo.option.ConnectionDriverName: 

  • Description: This property defines the JDBC driver class for MySQL. 

  • Value: com.mysql.cj.jdbc.Driver 

  • javax.jdo.option.ConnectionUserName: 

  • Description: This property sets the username for connecting to the Hive metastore database in MySQL 

  • Value: <HIVE_USERNAME> 

  • javax.jdo.option.ConnectionPassword: 

  • Description: This property specifies the password for the Hive metastore database user. 

  • Value: your_password 

  • hive.metastore.uris: 

  • Description: This property specifies the Thrift URI(s) of the Hive metastore service(s). If using a single metastore server, this would be a single URI. If using a high-availability setup with multiple metastore servers, you'd list multiple URIs. 

  • Value: thrift://<hostname>:9083 

  •  

  1. Other Essential Configurations

  • Disable the SELINUX:

    • Edit the selinux configuration file:

      • nano /etc/selinux/config

    • set enforcing to disabled.

  • Stop the firewall service:

    • sudo systemctl stop firewalld

    • sudo systemctl disable firewalld

  • Network Time Protocol (NTP) Synchronization: 

  • Configuration: Ensure all nodes are synchronized to a common NTP server using a service like chronyd. 

Sudo yum install chrony 

Sudo systemctl start chrony 

Sudo timedatectl set-ntp true 

timedatectl status 

  • Reason: Time drift between nodes can cause scheduling issues in YARN. 

  • Linux Swappiness: 

  • Configuration: Set vm.swappiness to a default value (60) in /etc/sysctl.conf. 

  • Reason: The vm.swappiness parameter influences the kernel's decision on when to move data from RAM to swap space (disk). A higher value means the kernel will be more aggressive in using swap. 

  • Log and Disk Management: 

  • Configuration: Implement log rotation policies for all Hadoop daemons to prevent log files from filling up the disk. 

  • Reason: Unmanaged logs can consume all disk space, leading to service failures. 

    • Related Articles

    • Troubleshooting Yarn Application Failures

      Troubleshooting Yarn Application Failures Category: Troubleshooting → YARN Applies To: Apache YARN 2.x, 3.x Issues Summary: YARN applications (such as Spark, MapReduce, Tez jobs) fail to complete successfully, often exiting with a FAILED status, or ...
    • How to Debug Spark Application Logs (YARN UI)

      How to Debug Spark Application Logs (YARN UI) Category: Troubleshooting → Apache Spark Applies To: Apache Spark 2.x, 3.x running on Apache Hadoop YARN 2.x, 3.x Issue summary: When a Spark application fails on a YARN cluster, the application logs are ...
    • Managing HDFS Space and Replication

      Managing HDFS Space and Replication Category: Troubleshooting → HDFS Applies To: Apache Hadoop HDFS 2.x, 3.x Issue summary: Effective management of HDFS disk space and data replication is crucial for the stability, performance, and data availability ...
    • Resource Allocation and Scheduler Configuration

      Resource Allocation and Scheduler Configuration Category: Administration → Resource Management Applies to: Apache Hadoop 2.x, 3.x Issue Summary This document outlines critical configurations for resource allocation and scheduler management within ...
    • Understanding Hadoop Logs – Types, Use Cases, and Common Locations.

      Category: Troubleshooting → Logging and montoring Applies To: Hadoop HA cluster. Issue Summary In a distributed Hadoop HA cluster, component logs are the primary source of truth for monitoring system health, diagnosing failures, and troubleshooting ...