Category: Configuration → Hadoop Platform
Applies To: Hadoop 3.x, spark 3.x
Issue Summary
This document provides a comprehensive list of critical properties and essential configurations for the core components of the Hadoop ecosystem: HDFS, YARN, and Spark. These properties are fundamental to performance, stability, and resource management and should be configured based on the cluster's hardware and workload.
HDFS (Hadoop Distributed File System)
Critical Properties (in hdfs-site.xml and core-site.xml)
To edit the configuration file:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
nano $HADOOP_HOME/etc/hadoop/core-site.xml
For hdfs-site.xml file,
dfs.replication:
Description: The default number of replicas for each block of a file.
Value: 3 (default and recommended for production). A higher value increases fault tolerance and read throughput but consumes more storage.
dfs.nameservices:
Description: Comma-separated list of nameservices.
Value: mycluster
dfs.ha.namenodes.mycluster:
Description: This will be used by DataNodes to determine all the NameNodes in the cluster.
Value: nn1, nn2
dfs.namenode.rpc-address.mycluster.nn1:
Description: RPC address that handles all clients requests. The NameNode's default RPC port is 8020.
Value: <Namenode_hostname>:8020
dfs.namenode.rpc-address.mycluster.nn2:
Description: RPC address that handles all clients requests. The NameNode's default RPC port is 8020.
Value: <Namenode_hostname>:8020
dfs.namenode.http-address.mycluster.nn1:
Description: The address and the base port where the namenode web ui will listen on.
Value: <Namenode_hostname>:9870
dfs.namenode.http-address.mycluster.nn2:
Description: The address and the base port where the namenode web ui will listen on.
Value: <Namenode_hostname>:9870
dfs.client.failover.proxy.provider.mycluster:
Description: The prefix (plus a required nameservice ID) for the class name of the configured Failover proxy provider for the host.
Value: Org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
dfs.ha.fencing.methods:
Description: A list of scripts or Java classes which will be used to fence the Active NameNode during a failover.
Value: sshfence
dfs.ha.fencing.ssh.private-key-files:
Description: comma-separated list of SSH private key files.
Value: /home/<user_name>/.ssh/id_rsa
dfs.journalnode.edits.dir:
Description: The directory where the journal edit files are stored.
Value: /opt/hadoop/journal
dfs.namenode.shared.edits.dir:
Description: A directory on shared storage between the multiple namenodes in an HA cluster. This directory will be written by the active and read by the standby in order to keep the namespaces synchronized.
Value: qjournal://<active_namenode_hostname>:8485;<standby_namenode_hostname>:8485;<datanode_hostname>:8485/<nameservice>
dfs.ha.automatic-failover.enabled:
Description: Whether automatic failover is enabled.
Value: true
dfs.blocksize:
Description: The default size of an HDFS block.
Value: 134217728 (128 MB) or 268435456 (256 MB). Increasing this value for large files reduces NameNode metadata overhead and improves I/O efficiency for sequential reads.
dfs.namenode.datanode.registration.ip-hostname-check:
Description: Verifies that a DataNode's IP address and hostname match during registration.
Value: true. It's a critical security measure to prevent unauthorized DataNodes from joining the cluster.
dfs.namenode.safemode.threshold-pct:
Description: The percentage of HDFS blocks that must be reported to the NameNode before it exits Safe Mode (read-only state).
Value: 0.999f (default). Lowering this value can make the cluster available sooner after a restart, but with a higher risk of data unavailability.
dfs.namenode.handler.count:
Description: The number of NameNode RPC server threads.
Value: Typically, 20. For large clusters with many clients, increasing this can improve NameNode responsiveness.
For core-site.xml file:
ha.zookeeper.quorum:
Description: A list of ZooKeeper server addresses, separated by commas, that are to be used by the ZKFailoverController in automatic failover.
Value: <active_namenode_hostname>:2181,<standby_namenode_hostname>:2181,<datanode_hostname>:2181
fs.defaultFS:
Description: The URI of the default file system (the NameNode).
Value: hdfs://<namenode_hostname>:<port> or hdfs://<nameservice_id> for HA.
YARN (Yet Another Resource Negotiator)
Critical Properties (in yarn-site.xml)
To edit the configuration file:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
yarn.nodemanager.resource.memory-mb:
Description: The total amount of physical memory (in MB) that can be allocated for containers on a NodeManager.
Value: Should be less than the total physical memory of the host machine to leave room for the OS and other daemons. Common practice is 75-80% of total RAM.
yarn.nodemanager.aux-services:
Description: A comma separated list of services where service name should only contain a-z A-Z 0-9_ and cannot start with numbers.
Value: mapreduce_shuffle
yarn.resourcemanager.hostname:
Description: the hostname of RM.
Value: <Namenode_hostname>
yarn.resourcemanager.ha.enabled:
Description: Enable RM high-availability.
Value: true
yarn.resourcemanager.cluster-id:
Description: Name of the cluster. In a HA setting, this is used to ensure the RM participates in leader election for this cluster and ensures it does not affect other clusters.
Value: yarn-cluster
yarn.resourcemanager.ha.rm-ids:
Description: The list of RM nodes in the cluster when HA is enabled.
Value: rm1, rm2
yarn.resourcemanagerhostname.<rm_ids>:
Description: the hostname of RM
Value: <active_namenode_hostname>
Yarn.resourcemanager.recovery.enabled:
Description: Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified.
Value: true
yarn.resourcemanager.ha.automatic-failover.enabled:
Description: Enable automatic failover. By default, it is enabled only when HA is enabled.
Value: true
yarn.nodemanager.resource.cpu-vcores:
Description: The number of CPU cores that can be allocated for containers on a NodeManager.
Value: Should be set to the number of physical cores available on the machine.
yarn.scheduler.minimum-allocation-mb / yarn.scheduler.minimum-allocation-vcores:
Description: The minimum allocation of units for a container. All container requests will be rounded up to this value.
Value: 1024 (1GB) and 1. Setting these too low can lead to excessive container requests and scheduling overhead.
yarn.scheduler.maximum-allocation-mb / yarn.scheduler.maximum-allocation-vcores:
Description: The maximum allocation of units for a container.
Value: Should not exceed the NodeManager's total resources.
yarn.resourcemanager.scheduler.class:
Description: The scheduler used by the ResourceManager to allocate resources.
Value: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (default) or ...fair.FairScheduler.
Spark
Critical Properties (in spark-defaults.conf)
To edit the configuration file:
nano $SPARK_HOME/conf/spark-default.conf
spark.executor.memory:
Description: The amount of memory to be allocated to each executor.
Value: Depends on the workload. Typically, 2g to 8g. For large jobs, this can be increased, but it should be less than the NodeManager's max container size.
spark.executor.cores:
Description: The number of CPU cores to be allocated to each executor.
Value: Typically 1 to 5. A higher value allows an executor to run more tasks concurrently.
spark.driver.memory:
Description: The memory allocated to the Spark driver process.
Value: 1g is a common starting point. Increase for applications with large datasets that require local operations on the driver.
spark.default.parallelism:
Description: The default number of partitions for RDDs and shuffle output.
Value: Should be 2-4 times the number of cores in the cluster. A higher value leads to more parallelism but also more scheduling overhead.
spark.serializer:
Description: The serializer used for data serialization.
Value: org.apache.spark.serializer.KryoSerializer. Using Kryo can significantly improve performance over Java serialization.
Hive
Critical properties (hive-site.xml)
To edit the configuration file:
Nano $HIVE_HOME/conf/hive-site.xml
javax.jdo.option.ConnectionURL
Description: This property specifies the JDBC connection string for the MySQL database.
Value: jdbc:mysql://<hostname>:<port>/<database>
javax.jdo.option.ConnectionDriverName:
Description: This property defines the JDBC driver class for MySQL.
Value: com.mysql.cj.jdbc.Driver
javax.jdo.option.ConnectionUserName:
Description: This property sets the username for connecting to the Hive metastore database in MySQL
Value: <HIVE_USERNAME>
javax.jdo.option.ConnectionPassword:
Description: This property specifies the password for the Hive metastore database user.
Value: your_password
hive.metastore.uris:
Description: This property specifies the Thrift URI(s) of the Hive metastore service(s). If using a single metastore server, this would be a single URI. If using a high-availability setup with multiple metastore servers, you'd list multiple URIs.
Value: thrift://<hostname>:9083
Other Essential Configurations
Disable the SELINUX:
Edit the selinux configuration file:
nano /etc/selinux/config
set enforcing to disabled.
Stop the firewall service:
sudo systemctl stop firewalld
sudo systemctl disable firewalld
Network Time Protocol (NTP) Synchronization:
Configuration: Ensure all nodes are synchronized to a common NTP server using a service like chronyd.
Sudo yum install chrony
Sudo systemctl start chrony
Sudo timedatectl set-ntp true
timedatectl status
Reason: Time drift between nodes can cause scheduling issues in YARN.
Linux Swappiness:
Configuration: Set vm.swappiness to a default value (60) in /etc/sysctl.conf.
Reason: The vm.swappiness parameter influences the kernel's decision on when to move data from RAM to swap space (disk). A higher value means the kernel will be more aggressive in using swap.
Log and Disk Management:
Configuration: Implement log rotation policies for all Hadoop daemons to prevent log files from filling up the disk.
Reason: Unmanaged logs can consume all disk space, leading to service failures.